high-performance broadcast for streaming and deep...

27
High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu [email protected] Department of Computer Science and Engineering The Ohio State University

Upload: others

Post on 19-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

High-PerformanceBroadcastforStreamingandDeepLearning

Ching-HsiangChu

[email protected]

TheOhioStateUniversity

Page 2: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 2NetworkBasedComputingLaboratory

• Introduction

• ProposedDesignsinMVAPICH2-GDR

• PerformanceEvaluation

• ConcludingRemarks

Outline

Page 3: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 3NetworkBasedComputingLaboratory

TrendsinModernHPCArchitecture

• Multi-core/many-coretechnologies

• HighPerformanceInterconnects

• Accelerators/Coprocessorsarebecomingcommoninhigh-endsystems

• HighPerformanceStorageandComputedevices

Accelerators/Coprocessorshighcomputedensity,high

performance/watt>1Tflop/sDPonachip

HighPerformanceInterconnects–InfiniBand(IB),Omni-Path

<1μseclatency,100GbpsBandwidth>Multi-coreProcessors SSD,NVMe-SSD,NVRAM

Tianhe– 2 TitanK- ComputerSunwayTaihuLight

Page 4: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 4NetworkBasedComputingLaboratory

ArchitecturesforDeepLearning(DL)

Multi-coreCPUsacrossnodes Multi-coreCPUs+SingleGPUacrossnodes

Multi-coreCPUswithinanode Multi-coreCPUs+Multi-GPUwithinanode

Multi-coreCPUs+Multi-GPUacrossnodes

PastandCurrentTrend Near-future

E.g.,NVIDIADGX-1systems

IBNetworks

IBNetworks

IBNetworks

Page 5: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 5NetworkBasedComputingLaboratory

• StreamingapplicationsonHPCsystems1. Communication(MPI)

• Broadcast-typeoperations

2. Computation(CUDA)• MultipleGPUnodesasworkers

StreamingApplicationsDataSource

SenderHPCresourcesforreal-timeanalytics

Real-timestreaming

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

Datastreaming-likebroadcastoperations

Page 6: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 6NetworkBasedComputingLaboratory

• ComputationusingGPU

• CommunicationusingMPI– Exchanging partial gradients aftereachminibatch

– All-to-all(Multi-Source)communications

Ø E.g.,MPI_Bcast

• Challenges– Highcomputation-communicationoverlap

– Goodscalability forupcominglarge-scaleGPUclusters

– Noapplication-levelmodification

High-performanceDeepLearning

GPUNode1

GPUNode2 GPUNode4

GPUNode3

Page 7: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 7NetworkBasedComputingLaboratory

• Introduction

• ProposedDesignsinMVAPICH2-GDR

• PerformanceEvaluation

• ConcludingRemarks

Outline

Page 8: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 8NetworkBasedComputingLaboratory

IBHCA

CPU

GPU

Source

IBSwitch

Header

Data

IBHCA

CPU

GPU

Destination1Header

Data

IBHCA

CPU

GPU

DestinationNHeader

Data

1.IBGather+GDRRead2.IBHardwareMulticast3.IBScatter+GDRWrite

• ForGPU-residentdata,using– GPUDirectRDMA(GDR)

– InfiniBandHardwareMulticast(IB-MCAST)

• Overhead– IBUDlimit

– GDRlimit

HardwareMulticast-basedBroadcast

A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.

Page 9: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 9NetworkBasedComputingLaboratory

• HeterogeneousBroadcastforstreamingapplicationsØ Free-upPCIeresources

HardwareMulticast-basedBroadcast(con’t)

NodeNIBHCA

IBHCA

CPU

GPU

Source

IBSwitch

GPU

CPUNode1

Multicaststeps

C Data

C

IBSLstep

Data

IBHCAGPU

CPU

Data

C

C.-H.Chu,K.Hamidouche,H.Subramoni,A.Venkatesh,B.Elton,andD.K.Panda,"DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,"SBAC-PAD'16,Oct.26-28,2016.

Page 10: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 10NetworkBasedComputingLaboratory

• PreparingIntermediatebuffer(im_buf)– Page-locked(pinned)hostbuffer

Ø FastDevice-Hostdatamovement

– AllocatedatinitializationphaseØ Lowoverhead

• Streaming datathroughhost– Fine-tunedchunkeddata

– Asynchronouscopyoperations

Ø Three-stagepipeline

IBHCA

CPU

GPU

Source

IBSwitch

Header

d_out

1.DataPreparation2.IBGather3.IBHardwareMulticast

im_buf

OptimizedBroadcastSend

MPI_Bcast(d_out,…)

C.-H.Chu,X.Lu,A.A.Awan,H.Subramoni,J.Hashmi,B.EltonandD.K.Panda.,"EfficientandScalableMulti-SourceStreamingBroadcastonGPUClustersforDeepLearning,"ICPP2017,Aug14-17,2017.

Page 11: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 11NetworkBasedComputingLaboratory

• Zero-copybroadcastreceive– Pre-posteduserbuffer(d_in)

– Avoidsadditionaldatamovement

– LeveragesIBScatterandGDRfeatures

Ø Low-latency

Ø Free-upPCIeresourcesforapplications

IBSwitch

IBHCA

CPU

GPU

Destination1Header

d_in

IBHCA

CPU

GPU

DestinationNHeader

d_in

IBHardwareMulticastIBScatter(GDRWrite)

OptimizedBroadcastReceiveMPI_Bcast(d_in,…)

C.-H.Chu,X.Lu,A.A.Awan,H.Subramoni,J.Hashmi,B.EltonandD.K.Panda.,"EfficientandScalableMulti-SourceStreamingBroadcastonGPUClustersforDeepLearning,"ICPP2017,Aug14-17,2017.

Page 12: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 12NetworkBasedComputingLaboratory

• ProposedIntra-nodeTopology-AwareBroadcast– CUDAInterProcessCommunication(IPC)

BroadcastonMulti-GPUsystems

Node1

IBSwitch

GPU0 GPU1 GPUN

NodeNGPU

CPU

Source

GPU

CPU

CPUMulticaststeps

cudaMemcpy(Device↔Device)

C.-H.Chu,K.Hamidouche,H.Subramoni,A.Venkatesh,B.Elton,andD.K.Panda,"DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,"SBAC-PAD'16,Oct.26-28,2016.

Page 13: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 13NetworkBasedComputingLaboratory

• Whenareceiverexperiencestimeout(lostMCASTpacket)– PerformstheRMAGetoperationtothesender’sbackupbuffertoretrieve

lostMCASTpackets

– Senderisnotinterrupted

EfficientReliabilitySupportforIB-MCAST

BroadcastreceiverBroadcastsenderIBHCA IBHCA MPI

Timeout

MPITime

C.-H.Chu,K.Hamidouche,H.Subramoni,A.Venkatesh,B.Elton,andD.K.Panda,"EfficientReliabilitySupportforHardwareMulticast-basedBroadcastinGPU-enabledStreamingApplications,"COMHPCWorkshop,2016.

Page 14: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 14NetworkBasedComputingLaboratory

• Introduction

• ProposedDesignsinMVAPICH2-GDR

• PerformanceEvaluation

• ConcludingRemarks

Outline

Page 15: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 15NetworkBasedComputingLaboratory

• OhioStateUniversity(OSU)Micro-Benchmark(OMB)http://mvapich.cse.ohio-state.edu/benchmarks/

– osu_bcast- MPI_BcastLatencyTest

– osu_bcast_streaming – MPI_Bcast streamingTest

• Deeplearningframework:CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)*– AlexNetandVGGmodelswithImageNetdataset

ExperimentalEnvironments

*D.S.Banerjee,K.HamidoucheandD.K.Panda,"Re-DesigningCNTKDeepLearningFrameworkonModernGPUEnabledClusters," IEEECloudCom,LuxembourgCity,2016,pp.144-151.

Page 16: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 16NetworkBasedComputingLaboratory

• @RI2cluster,16GPUs,1GPU/node

1

10

100

1000

10000

4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

Latency(μs)

MessageSize(bytes)

MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt

BenchmarkEvaluation

1

10

100

1000

10000

2 4 8 16

Latency(μs)

NumberofGPUnodes

2MBMessage

MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt

Lowerisbetter

Near-Constant65%

• Providenear-constantlatencyoverthesystemsizes• Reducesupto65%oflatencyforlargemessages

HitGDRreadlimit

C.-H.Chu,X.Lu,A.A.Awan,H.Subramoni,J.Hashmi,B.EltonandD.K.Panda.,"EfficientandScalableMulti-SourceStreamingBroadcastonGPUClustersforDeepLearning,"ICPP2017,Aug14-17,2017.

Page 17: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 17NetworkBasedComputingLaboratory

0

10

20

30

40

50

60

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

Latency(μs)

MessageSize(Bytes)

MCAST-GDR-OPT MCAST-GDR

• IB-MCAST+GDR+Topology-awareIPC-basedschemes– Upto58%and79%reduction

forsmallandlargemessages

StreamingBenchmark@CSCS(88GPUs)

58%

0

2000

4000

6000

8000

10000

12000

32K 64K 128K 256K 512K 1M 2M 4M

Latency(μs)

MessageSize(Bytes)

MCAST-GDR-OPT MCAST-GDR

79%

C.-H.Chu,K.Hamidouche,H.Subramoni,A.Venkatesh,B.Elton,andD.K.Panda,"DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,"SBAC-PAD'16,Oct.26-28,2016.

Page 18: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 18NetworkBasedComputingLaboratory

• @RI2cluster,16GPUs,1GPU/node:– CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)withoutmodification

DeepLearningFrameworks

0

100

200

300

8 16

TrainingTim

e(s)

NumberofGPUnodes

AlexNetmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

0

1000

2000

3000

8 16

TrainingTim

e(s)

NumberofGPUnodes

VGGmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

Lowerisbetter

15% 24% 6%15%

• Reducesupto24%and15%oflatencyforAlexNetandVGGmodels• Higherimprovementisexpectedforlargersystemsizes

Page 19: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 19NetworkBasedComputingLaboratory

• High-performancebroadcastschemestoleverageGDRandIB-

MCASTfeatures forstreaminganddeeplearningapplications

– Optimizedstreamingdesignforlargemessagestransfers

• High-performancereliabilitysupportforIB-MCAST

Ø ThesefeaturesareincludedinMVAPICH2-GDR2.3a

Ø http://mvapich.cse.ohio-state.edu/

Ø http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3a/

ConcludingRemarks

Page 20: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 20NetworkBasedComputingLaboratory

[email protected]

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/

[1] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance HeterogeneousBroadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.

[2] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source StreamingBroadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017.

[3] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for HardwareMulticast-based Broadcast in GPU-enabled Streaming Applications, " COMHPC Workshop, 2016.

[4] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton and D. K. Panda, "Exploiting Hardware Multicast and GPUDirect RDMA forEfficient Broadcast , ” submitted to IEEE TPDS. (Under review)

Page 21: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 21NetworkBasedComputingLaboratory

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/

• Joinusformoretechtalks fromMVAPICH2team– http://mvapich.cse.ohio-state.edu/conference/677/talks/

Page 22: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 22NetworkBasedComputingLaboratory

EvaluationParameters

IBHCA

CPU

GPU

Bandwidth

𝑩𝑮

𝑩𝑯 ≫ 𝑩𝑮

𝑩𝑷𝑪𝑰𝒆

Notation Meaning Unit𝒏 Numberofprocesses N/A𝒎 Numberofbroadcastsources N/A𝒕𝒔 Setuptimeforsendingdata sec

𝒕𝒐(𝒏) OverheadforissuinganIB-MCASTpacket sec𝑴 Originalmessagesize bytes𝑪 Sizeofadatachunk bytes

𝑼 MaximumTransmissionUnitforIB-MCAST,providedbyhardwaremanufacturer bytes

𝑩𝑯 BandwidthofreadingHostmemory bytes/sec

𝑩𝑮BandwidthofreadingGPUmemory

(NVIDIAGPUDirect RDMA) bytes/sec

𝑩𝑷𝑪𝑰𝒆PCIeBandwidthbetweenHostandGPU

memory bytes/sec

Message

𝑴

𝑪

𝑼

Page 23: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 23NetworkBasedComputingLaboratory

• Direct • Pipeline • Staging

Ring-basedBroadcast

IBHCACPU

GPU

Source

DataIBHCA

CPU GPUDestination1

Data IBHCA

CPU GPU

Destination3

Data

GDRReadGDRWriteNetworkTransfer

IBHCA

CPU GPU

Destination2

Data

PoorScalability

(𝑛 − 1)× 𝑡7 +𝑀𝐵;

𝑀𝐶+ (𝑛 − 2) × 𝑡7 +

𝐶𝐵;

𝑀𝐵>?@A

+ (𝑛 − 1)× 𝑡7 +𝑀𝐵B

Page 24: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 24NetworkBasedComputingLaboratory

• Direct • Pipeline • Staging

K-nomial-basedBroadcast

Non-optimizedScalability

IBHCACPU

GPU

Source

Data

IBHCA

CPU GPUDestination1

Data

IBHCA

CPU GPU

Destination3

DataGDRReadGDRWriteNetworkTransfer

IBHCACPU

GPU

Destination2

Data

logF 𝑛 × 𝑡7 +𝑀𝐵;

𝑀𝐶× logF 𝑛 × 𝑡7 +

𝐶𝐵;

𝑀𝐵>?@A

+ logF 𝑛 × 𝑡7 +𝑀𝐵B

Page 25: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 25NetworkBasedComputingLaboratory

OverlapOpportunities

BroadcastfromNodeC

BroadcastfromNodeA

BroadcastfromNodeBTimeline

HCA

CPU

GPU

GPU

CPU

HCANod

eB

Nod

eC

GPUCPUHCAN

odeA

:cudaMemcpyAsync:IBHardwareMulticast:cudaStreamSynchronize:GDRWrite

OverlapAcrossNodes

Overlapwithinanode

Page 26: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 26NetworkBasedComputingLaboratory

• NVIDIAGPUDirect[1]

– Remotedirectmemoryaccess(RDMA)transfersbetweenGPUsandotherPCIedevices⇒ GDR

– andmore…

• InfiniBand(IB)hardwaremulticast(IBMCAST)[2]

– Enablesefficientdesignsofbroadcastoperations

• Host-based[3]

• GPU-based[4]

MCAST-basedBroadcast

[1]https://developer.nvidia.com/gpudirect[2]PfisterGF.,“AnIntroductiontotheInfiniBandArchitecture.”HighPerformanceMassStorageandParallelI/O,Chapter42,pp617-632,Jun2001.[3]J.Liu,A.R.Mamidala,andD.K.Panda,“FastandScalableMPI-levelBroadcastusingInfiniBand’sHardwareMulticastSupport,”inIPDPS2004,p.10,April2004.[4]A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.

Page 27: High-Performance Broadcast for Streaming and Deep Learningmvapich.cse.ohio-state.edu/static/media/talks/...A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A High Performance

OSUBooth- SC17 27NetworkBasedComputingLaboratory

• Extendthedesignforotherbroadcast-basedcollective

algorithmsaswellasnon-blockingoperations

– Allreduce,Allgather,…,andsoon

FutureWork