build gpu cluster hardware for efficiently accelerating ... · hardware gpu dense hpc cluster cpu...

33
Build GPU Cluster Hardware for Efficiently Accelerating CNN Training YIN Jianxiong Nanyang Technological University [email protected]

Upload: others

Post on 24-Aug-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Build GPU Cluster Hardware for Efficiently Accelerating CNN Training

YIN JianxiongNanyang Technological University

[email protected]

Page 2: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Visual Object Search

Private Large-­scale Visual Object Database

Domain Specifi Object Search Engine

Singapore Smart CityPlan

Page 3: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Deep Learning Dedicated Cluster

Servers w/ 2GPU-­ 24x E5-­2630-­ 5x K20m-­ 4x Titan Black-­ 4x GTX 780

4x 4GPU Server-­ 8x E5-­2630v2-­ 16x K40

8GPU Server TestCluster-­ 4x E5-­2650v2-­ 16x K40-­ 4x K80-­ 8x Titan X-­ 36x IB FDR Switch

Expanded 8GPU ServerCluster-­ 12x E5-­2650v2-­ 16x K40-­ 8x K80-­ 8x M60-­ 8x Titan X-­ 36x FDR IB Switch

2013 2014 2015 2016

Page 4: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Parallel CNN Training is Important

CNN go deeper and complicated Stopping Single Thread Performance increment

Page 5: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Multi-­‐node Parallel Training supportive Frameworks

Multi-­GPU in Single Server

Multi-­GPU across Servers

Data Parallelism

*Model Parallelism

Caffe Yes No Yes No

CNTK Yes Yes Yes No

MXNet Yes Yes Yes Yes

Tensor Flow Yes Yes Yes No

HPC Facility Architecture

DNN Training Framework

Training Algorithm

Page 6: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Host Memory

GPU

Workflow and Job Handlers

GPU

CPU Mini-­Batch Loading

Forward Forward Forward

Backward Backward Backward

Interconnects

Mini-­Batch Caching

Caches Mini-­batch samples

CPU Handle mini-­Batch Preparation

GPUs compute Forward Training

AllReduce Loss Sync or Collect + Broadcast by GPU or CPU (Slower)

GPUs compute Backward Training

Mini-­Batch Loading

Forward Forward Forward

Backward Backward Backward

Mini-­Batch Caching

Synchronization

CPU Mini-­Batch Preparation

Mini-­Batch Preparation CPU prepares mini-­batch

Page 7: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Typical HPC architectureTypical HPC Hardware

Typical HPC Cluster

Host RAM

CPU

GPU

PHB

IB Card

Host RAM

CPU

GPU

PHB

IB Card

Host RAM

CPUPHB

IB Card

Host RAM

CPUPHB

IB CardHost RAM

CPUPHB

Host RAM

CPU

GPU

PHB

IB Card

...

Page 8: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

GPU Optimal HPC Architecture

• High GPU:CPU ratio (>2)• High GPU density• Direct P2P links• Prioritized GPU Cooling

GPU Optimal HPC Hardware GPU dense HPC Cluster

Host

RAM

CPU

GPU #5PHB

IB Card

Host RAM

CPU

GPU #0

PHB

IB Card

GPU #1

GPU #4

GPU #2GPU #3

GPU #7GPU #6

PCI-­‐E Switch PCI-­‐E Switch

Host

RAM

CPU

GPU #5

PHB

IB Card

Host RAM

CPUGPU #0

PHB

IB Card

GPU #1

GPU #4

GPU #2GPU #3

GPU #7GPU #6

PCI-­‐E Switch PCI-­‐E Switch

• Lower TCO Monetary Cost• Lower Maintenance Cost• Better Performance

Page 9: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

𝑇𝑇 < 𝑇𝑇 ≤ 𝑇𝑇 + 𝑇𝑇 + 𝑇𝑇

𝑃𝑃 = × × × × ×

Key Performance Impactors and Corresponding Hardware

Single GPU Computing Capacity Optimized Algo, e.g., cuDNN, fbFFT Feed in example#

Sync Message Payload size Sync Frequency GPU-­GPU Interconnect Topology Intra/inter -­node GPU-­GPU Link capacity

Traffic Optimization

CNN Model Structure GPU Memory Size

Data Parallelism Model Parallelism

Sync Timing

After every forward computing

After each layer is computed

Sync Frequency

Proportional to # of iterations

Proportional to # of layers x # of iterations

Synced Msg body

The trained model, Output ActivationVolume of eachlayer

Synced Msg size

Mostly large payload, > 1MB,

Very tiny payload,~ 4KB

Page 10: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

GPU Card

Page 11: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Must-­‐have GPU featuresSoftware Features• W/ updated cuDNNv4 Support• W/ GPUDirect P2P/RDMA• W/ NVIDIA Collective

Communication Library support

Hardware Features• Single Precision Performance Matters• Larger Read Cache preferred• Maxwell cards strongly preferred over

Kepler or other previous gen.

Page 12: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

VRAM Usage Decomposition

Model Structure Dependent

Batch Size Dependent

Multi-­‐GPU training reduces per Card VRAMfootprint, enables large model training

H/W Setup: 2x Intel Xeon E5-­2650v2, 2x 8GT/s QPI 192GB, 1TB SATA-­III SSD, 8x NVIDIA Geforce Titan X Mellanox ConnectX-­3 Pro FDR IB adapter,

S/W Setup: Ubuntu 14.04.3, CUDA 7.5, GPU Driver: 352.39, 3.0-­OFED.3.0.2.0.0.1, GDR, cuDNNv4 enabled;;

Training Config: NN: Inception-­bn, batch size=128, lr_factor=0.94, lr=0.045, synchronized training, num_classess=1000

Page 13: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

GPU Memory And System Throughput• Larger GPU memory allows larger Batch Size, which pushes GPU to higher utilization.

MXNet8x K80 Chip, Batch Size=512Model: VggNet

MXNet8x K80 Chip, Batch Size=256Model: VggNet

GPU Idling Time

Page 14: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

GPU Throughput And ScalabilityTitan X (12GB GDDR5, 384bit)K40 (12GB GDDR5, 384bit)

Tesla M40 / Quadro M6000, 24GB VRAM

Page 15: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

GPU Form Factors and Availability

Right GPU Cooler forMulti-­‐GPU Setup When identical in specs, passive cooling cards are:

1) less inter-­card heat interference, increases stableness

2) Lower operation temperature

3) no Clock speed throttling

4) Cooling failure resistantPassive Cooling Active Cooling

Page 16: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

GPU-­‐GPU Interconnect

Page 17: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

When and How to Sync

Reducing Time Transfer Overhead

Reduce Handling Processor

Data Path Capacities

Sync Schedule & Rules

Sync Payload Size

Asynchronous Synchronos Frequent Sync Long Interval Sync Hybrid Sync

>1MB Large <4KB Small Traffic Optimize

CPU handling GPU handling Hybrid handling

Mainboard IO layout Bandwidth Latency

Hardware Software Software Hardware

Page 18: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

CPU vs GPU Reducer

Overhead Of LARDAllReduce by GPUs

Overhead Of LARCAllReduce by CPU+GPUs

Overhead Of CPU Reduce-­Broadcast

When D2D and H2D/D2H bandwidth is similar,

In practise, BWH2D/D2H ~= 1.5 x BWD2D

020406080100120140

PIX PXB QPI

Reduce HandlerBenchmarking w/ K40 GPU

GPU CPU Hybrid

Page 19: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Extend System to Multiple NodesKey Issues:Inter-­‐Node Links only have:1) much smaller link BW(6GB/s), 2) longer latency (up to 10x more than PCI-­‐E link), 3) limited interface BW to system, thus suffers from low intra-­‐node BW to peer GPU cards.

Pipeline Compute & Sync:-­ Sacrifice Immediate Consistency

-­ Hurts Accuracy

Traffic Optimization-­ Local Aggregation-­ Message Data Compression

H/W Arch Innovation-­ Faster Inter-­node links-­ Bottleneck-­removing P2P alternatives

Page 20: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Intra-­‐node P2P TopologiesHost RAM

CPU

GPU

PHB

IB Card

Host RAM

CPU

GPU

PHB

IB Card

Host RAM

CPU

GPU

PHB

IB Card

GPU

Host RAM

CPU

GPU

PHB

IB Card

Host RAM

CPU

GPU

PHB

IB Card

Host RAM

CPU

GPU

PHB

IB Card

GPU

GPU

GPU

GPU

GPU

PCI-­‐E Switch PCI-­‐E Switch

Topo-­1 Topo-­2 Topo-­3 Topo-­4

Page 21: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Both Link Bandwidth and LatencyMatter for Data Parallelism

H/W Setup: 2x Intel Xeon E5-­2650v2, 2x 8GT/s QPI 192GB, 1TB SATA-­III SSD, NVIDIA Tesla K40, Mellanox ConnectX-­3 Pro FDR IB adapter,

S/W Setup: Ubuntu 14.04.3, CUDA 7.5, GPU Driver: 352.39, 3.0-­OFED.3.0.2.0.0.1, GDR, cuDNNv4 enabled;;

Training Config: NN: Inception-­bn, batch size=128, lr_factor=0.94, lr=0.045, synchronized training, num_classess=1000

Typical Root Node Traffic over IB (caffe variation)

Page 22: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

NUMA based or Chassis based

H/W Setup: 2x Intel Xeon E5-­2650v2, 2x 8GT/s QPI 192GB, 1TB SATA-­III SSD, 8x NVIDIA Tesla K40, Mellanox ConnectX-­3 Pro FDR IB adapter,

S/W Setup: Ubuntu 14.04.3, CUDA 7.5, GPU Driver: 352.39, 3.0-­OFED.3.0.2.0.0.1,GPUDirect P2P/RDMA, cuDNNv4 enabled;;

Training Config: NN: Inception-­bn, batch size=128, lr_factor=0.94, lr=0.045, synchronized training, num_classess=1000

Chassis-­based

NUMA-­based

Page 23: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Other Major Components

Page 24: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Pipelined CPU Handled IO overheadInitialize Training

Prepare Batch Forward

Sync & CombineLoss

Backward

Prepare Batch Forward

Sync & CombineLoss

Backward

Prepare Batch Forward

Sync & CombineLoss

Backward

Waiting...

Waiting...

Prepare Batch Waiting...

Iteration #1

Training Time

Iteration #2

Iteration #3

Iteration #4

**The length above are not proportional to the actual time spent on each sub-­task. The actual time cost depends on structure of the model.

*Phase#3 handling is implementation dependent, e.g., MXNET supports both GPU and CPU

Phase#1 Phase#2 Phase#3 Phase#4

GPU Computing

CPU Computing

Combine LossTransfer Loss Distribute Combined Loss

Phase#3.1 Phase#3.2 Phase#3.3

Page 25: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

CPU SelectionSpecs Empirical Min

SuggestionsElaborations

CPU series

Xeon or Core does not matter,

Xeon Processor preferred because it has better MB.

CPU micro architecture

Ivy Bridge EP or higher architecture processor, Sandy Bridge CPU not recommended[1].

Sandy Bridge Processor’s PHB throughput is capped to 0.8GB/s to 5.56GB/s for small and large message respectively.

PCIe lane

CPU with 40 PCI-­E lanes

Even for EP architect processors, there is still low PCIe lane model.

CPU core

2x CPU threads for each GPU chip

CPU mainly handles IO requests, and decoding of batch sample.

CPU clock

Base clock 2.4GHz, Up to 3.2GHz

IO handling overhead could be hidden using pipeline, CPU clock effects

[1] “Efficient Data Movement on GPU accelerated clusters”,(GTC2014)[2] “MVAPICH2: A High Performance MPI Library for NVIDIA GPU Clusters with InfiniBand” (GTC 2013)[3] Tim Dettmer: http://timdettmers.com/2015/03/09/deep-­‐learning-­‐hardware-­‐guide/

[2]

[3]

[1]

Page 26: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Host RAM & Storage SelectionHost RAM:• Total Amount:

– 2x Total VRAM size

• Specs:– DDR3-­‐1066 and 1866 hardly

makes difference– Coz H2D memory IO is

hidden by pipeline

Save extra RAM budget, and invest on SSD, GPU card, better layout & cooling efficient servers.

Storage: Implementation Dependent• SSD is good for check-­‐point saving

(CPS) speed, which is fairly important for fault-­‐tolerant

• NFS storage hurts CPS speed• Local SATA Drive preferred save

PCIe Lane resource

Page 27: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Multi-­‐GPU-­‐friendly ServerAirflow Design

CPU

CPU

GPUGPU

CPUCPU

GPU

GPU

CPUCPU

GPUGPU

GPU

GPU

GPU

GPU

CPU

GPU

CPU

CPU

GPU

CPUGPU

GPU is major power horse

Page 28: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Wrap-­‐up

Page 29: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Design Goals and System Arch MappingRequirement Input

Throughput

Scalability

Monetary

Unit GPU Computing Capacity

Node IO layout;; Inter node link capacity;;

Per throughput $;;Scale Up $;;

GPU Processor DensityGPU ArchitectureGPU Clock SpeedGPU RAM SizeGPU Mem BW

PCIe VersionPCIe 3.0 16x slot QtyPCIe 3.0 8x slot QtyPCIe Switch LayoutLink Latency

PCIe 3.0 16x slot Qty versus

CPU socket Qtyratio

Relevant Key Specs Major Metrics

Page 30: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Pipeline Throughput NVLINK + PCIe/CAPI

Maximize Performance

Pascal + HBM

Scalability NVLINK + EDR/XDR

Avoid Short Board Link;;Partitioned Workload;;Balanced Traffic;;

Wish list Architecture• P2P Large BW & low latency

intra / inter node(s) link

• Large high BW Dedicated Local GPU RAM

•Decoupled to-­‐GPU Storage IO and P2P Compute IO

Max Processor Utilization

Hide overhead and delay

Page 31: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Server Node RecommendationsHo

st R

AM

CPU

GPU #3

PHB

IB Ca

rd

Host RAM

CPU

GPU #0

PHB

IB Card

GPU #1

GPU #2

Host

RAM

CPU

GPU #5

PHB

IB Card

Host RAM

CPU

GPU #0

PHB

IB Card

GPU #1

GPU #4

GPU #2GPU #3

GPU #7GPU #6

Host

RAM

GPU #3PCI-­‐E Switch

Host RAM

CPU

GPU #0

PHB

EDR Card

GPU #1

GPU #2

CPU

PHB

PIX PIXPIX PIX

EDR Ready

Host

RAM

CPU

GPU #5

PHB

EDR Card

Host RAMCPU

GPU #0PHB

EDR Card

GPU #1

GPU #4

GPU #2GPU #3

GPU #7GPU #6

PIX PIXPIX PIX

P2P Link

80GB/s G2G LinkMax GPU DensityBalanced

Page 32: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

ACKNOWLEDGEMENTCredits:Pradeep Gupta (NVIDIA), Project Manager;;Wang Xingxing (ROSE Lab, NTU, Singapore);; Xiao Bin (ex WeChat, Tencent),

Funding Agencies

Industry Partner

Open-­source Project:

Page 33: Build GPU Cluster Hardware for Efficiently Accelerating ... · Hardware GPU dense HPC Cluster CPU Host RAM GPU #5 PHB IB Card Host RAM CPU GPU #0 PHB IB Card GPU #1 GPU #4 GPU #2

Q & A