high-performance broadcast for streaming and deep...
TRANSCRIPT
High-PerformanceBroadcastforStreamingandDeepLearning
Ching-HsiangChu
TheOhioStateUniversity
OSUBooth- SC17 2NetworkBasedComputingLaboratory
• Introduction
• ProposedDesignsinMVAPICH2-GDR
• PerformanceEvaluation
• ConcludingRemarks
Outline
OSUBooth- SC17 3NetworkBasedComputingLaboratory
TrendsinModernHPCArchitecture
• Multi-core/many-coretechnologies
• HighPerformanceInterconnects
• Accelerators/Coprocessorsarebecomingcommoninhigh-endsystems
• HighPerformanceStorageandComputedevices
Accelerators/Coprocessorshighcomputedensity,high
performance/watt>1Tflop/sDPonachip
HighPerformanceInterconnects–InfiniBand(IB),Omni-Path
<1μseclatency,100GbpsBandwidth>Multi-coreProcessors SSD,NVMe-SSD,NVRAM
Tianhe– 2 TitanK- ComputerSunwayTaihuLight
OSUBooth- SC17 4NetworkBasedComputingLaboratory
ArchitecturesforDeepLearning(DL)
Multi-coreCPUsacrossnodes Multi-coreCPUs+SingleGPUacrossnodes
Multi-coreCPUswithinanode Multi-coreCPUs+Multi-GPUwithinanode
Multi-coreCPUs+Multi-GPUacrossnodes
PastandCurrentTrend Near-future
E.g.,NVIDIADGX-1systems
IBNetworks
IBNetworks
IBNetworks
OSUBooth- SC17 5NetworkBasedComputingLaboratory
• StreamingapplicationsonHPCsystems1. Communication(MPI)
• Broadcast-typeoperations
2. Computation(CUDA)• MultipleGPUnodesasworkers
StreamingApplicationsDataSource
SenderHPCresourcesforreal-timeanalytics
Real-timestreaming
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
WorkerCPUGPU
GPU
Datastreaming-likebroadcastoperations
OSUBooth- SC17 6NetworkBasedComputingLaboratory
• ComputationusingGPU
• CommunicationusingMPI– Exchanging partial gradients aftereachminibatch
– All-to-all(Multi-Source)communications
Ø E.g.,MPI_Bcast
• Challenges– Highcomputation-communicationoverlap
– Goodscalability forupcominglarge-scaleGPUclusters
– Noapplication-levelmodification
High-performanceDeepLearning
GPUNode1
GPUNode2 GPUNode4
GPUNode3
OSUBooth- SC17 7NetworkBasedComputingLaboratory
• Introduction
• ProposedDesignsinMVAPICH2-GDR
• PerformanceEvaluation
• ConcludingRemarks
Outline
OSUBooth- SC17 8NetworkBasedComputingLaboratory
IBHCA
CPU
GPU
Source
IBSwitch
Header
Data
IBHCA
CPU
GPU
Destination1Header
Data
IBHCA
CPU
GPU
DestinationNHeader
Data
1.IBGather+GDRRead2.IBHardwareMulticast3.IBScatter+GDRWrite
• ForGPU-residentdata,using– GPUDirectRDMA(GDR)
– InfiniBandHardwareMulticast(IB-MCAST)
• Overhead– IBUDlimit
– GDRlimit
HardwareMulticast-basedBroadcast
A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.
OSUBooth- SC17 9NetworkBasedComputingLaboratory
• HeterogeneousBroadcastforstreamingapplicationsØ Free-upPCIeresources
HardwareMulticast-basedBroadcast(con’t)
NodeNIBHCA
IBHCA
CPU
GPU
Source
IBSwitch
GPU
CPUNode1
Multicaststeps
C Data
C
IBSLstep
Data
IBHCAGPU
CPU
Data
C
C.-H.Chu,K.Hamidouche,H.Subramoni,A.Venkatesh,B.Elton,andD.K.Panda,"DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,"SBAC-PAD'16,Oct.26-28,2016.
OSUBooth- SC17 10NetworkBasedComputingLaboratory
• PreparingIntermediatebuffer(im_buf)– Page-locked(pinned)hostbuffer
Ø FastDevice-Hostdatamovement
– AllocatedatinitializationphaseØ Lowoverhead
• Streaming datathroughhost– Fine-tunedchunkeddata
– Asynchronouscopyoperations
Ø Three-stagepipeline
IBHCA
CPU
GPU
Source
IBSwitch
Header
d_out
1.DataPreparation2.IBGather3.IBHardwareMulticast
im_buf
OptimizedBroadcastSend
MPI_Bcast(d_out,…)
C.-H.Chu,X.Lu,A.A.Awan,H.Subramoni,J.Hashmi,B.EltonandD.K.Panda.,"EfficientandScalableMulti-SourceStreamingBroadcastonGPUClustersforDeepLearning,"ICPP2017,Aug14-17,2017.
OSUBooth- SC17 11NetworkBasedComputingLaboratory
• Zero-copybroadcastreceive– Pre-posteduserbuffer(d_in)
– Avoidsadditionaldatamovement
– LeveragesIBScatterandGDRfeatures
Ø Low-latency
Ø Free-upPCIeresourcesforapplications
IBSwitch
IBHCA
CPU
GPU
Destination1Header
d_in
IBHCA
CPU
GPU
DestinationNHeader
d_in
IBHardwareMulticastIBScatter(GDRWrite)
OptimizedBroadcastReceiveMPI_Bcast(d_in,…)
C.-H.Chu,X.Lu,A.A.Awan,H.Subramoni,J.Hashmi,B.EltonandD.K.Panda.,"EfficientandScalableMulti-SourceStreamingBroadcastonGPUClustersforDeepLearning,"ICPP2017,Aug14-17,2017.
OSUBooth- SC17 12NetworkBasedComputingLaboratory
• ProposedIntra-nodeTopology-AwareBroadcast– CUDAInterProcessCommunication(IPC)
BroadcastonMulti-GPUsystems
Node1
IBSwitch
GPU0 GPU1 GPUN
NodeNGPU
CPU
Source
GPU
CPU
CPUMulticaststeps
cudaMemcpy(Device↔Device)
C.-H.Chu,K.Hamidouche,H.Subramoni,A.Venkatesh,B.Elton,andD.K.Panda,"DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,"SBAC-PAD'16,Oct.26-28,2016.
OSUBooth- SC17 13NetworkBasedComputingLaboratory
• Whenareceiverexperiencestimeout(lostMCASTpacket)– PerformstheRMAGetoperationtothesender’sbackupbuffertoretrieve
lostMCASTpackets
– Senderisnotinterrupted
EfficientReliabilitySupportforIB-MCAST
BroadcastreceiverBroadcastsenderIBHCA IBHCA MPI
Timeout
MPITime
C.-H.Chu,K.Hamidouche,H.Subramoni,A.Venkatesh,B.Elton,andD.K.Panda,"EfficientReliabilitySupportforHardwareMulticast-basedBroadcastinGPU-enabledStreamingApplications,"COMHPCWorkshop,2016.
OSUBooth- SC17 14NetworkBasedComputingLaboratory
• Introduction
• ProposedDesignsinMVAPICH2-GDR
• PerformanceEvaluation
• ConcludingRemarks
Outline
OSUBooth- SC17 15NetworkBasedComputingLaboratory
• OhioStateUniversity(OSU)Micro-Benchmark(OMB)http://mvapich.cse.ohio-state.edu/benchmarks/
– osu_bcast- MPI_BcastLatencyTest
– osu_bcast_streaming – MPI_Bcast streamingTest
• Deeplearningframework:CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)*– AlexNetandVGGmodelswithImageNetdataset
ExperimentalEnvironments
*D.S.Banerjee,K.HamidoucheandD.K.Panda,"Re-DesigningCNTKDeepLearningFrameworkonModernGPUEnabledClusters," IEEECloudCom,LuxembourgCity,2016,pp.144-151.
OSUBooth- SC17 16NetworkBasedComputingLaboratory
• @RI2cluster,16GPUs,1GPU/node
1
10
100
1000
10000
4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
Latency(μs)
MessageSize(bytes)
MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt
BenchmarkEvaluation
1
10
100
1000
10000
2 4 8 16
Latency(μs)
NumberofGPUnodes
2MBMessage
MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt
Lowerisbetter
Near-Constant65%
• Providenear-constantlatencyoverthesystemsizes• Reducesupto65%oflatencyforlargemessages
HitGDRreadlimit
C.-H.Chu,X.Lu,A.A.Awan,H.Subramoni,J.Hashmi,B.EltonandD.K.Panda.,"EfficientandScalableMulti-SourceStreamingBroadcastonGPUClustersforDeepLearning,"ICPP2017,Aug14-17,2017.
OSUBooth- SC17 17NetworkBasedComputingLaboratory
0
10
20
30
40
50
60
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
Latency(μs)
MessageSize(Bytes)
MCAST-GDR-OPT MCAST-GDR
• IB-MCAST+GDR+Topology-awareIPC-basedschemes– Upto58%and79%reduction
forsmallandlargemessages
StreamingBenchmark@CSCS(88GPUs)
58%
0
2000
4000
6000
8000
10000
12000
32K 64K 128K 256K 512K 1M 2M 4M
Latency(μs)
MessageSize(Bytes)
MCAST-GDR-OPT MCAST-GDR
79%
C.-H.Chu,K.Hamidouche,H.Subramoni,A.Venkatesh,B.Elton,andD.K.Panda,"DesigningHighPerformanceHeterogeneousBroadcastforStreamingApplicationsonGPUClusters,"SBAC-PAD'16,Oct.26-28,2016.
OSUBooth- SC17 18NetworkBasedComputingLaboratory
• @RI2cluster,16GPUs,1GPU/node:– CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)withoutmodification
DeepLearningFrameworks
0
100
200
300
8 16
TrainingTim
e(s)
NumberofGPUnodes
AlexNetmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt
0
1000
2000
3000
8 16
TrainingTim
e(s)
NumberofGPUnodes
VGGmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt
Lowerisbetter
15% 24% 6%15%
• Reducesupto24%and15%oflatencyforAlexNetandVGGmodels• Higherimprovementisexpectedforlargersystemsizes
OSUBooth- SC17 19NetworkBasedComputingLaboratory
• High-performancebroadcastschemestoleverageGDRandIB-
MCASTfeatures forstreaminganddeeplearningapplications
– Optimizedstreamingdesignforlargemessagestransfers
• High-performancereliabilitysupportforIB-MCAST
Ø ThesefeaturesareincludedinMVAPICH2-GDR2.3a
Ø http://mvapich.cse.ohio-state.edu/
Ø http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3a/
ConcludingRemarks
OSUBooth- SC17 20NetworkBasedComputingLaboratory
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/
[1] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance HeterogeneousBroadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.
[2] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source StreamingBroadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017.
[3] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for HardwareMulticast-based Broadcast in GPU-enabled Streaming Applications, " COMHPC Workshop, 2016.
[4] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton and D. K. Panda, "Exploiting Hardware Multicast and GPUDirect RDMA forEfficient Broadcast , ” submitted to IEEE TPDS. (Under review)
OSUBooth- SC17 21NetworkBasedComputingLaboratory
ThankYou!
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/
• Joinusformoretechtalks fromMVAPICH2team– http://mvapich.cse.ohio-state.edu/conference/677/talks/
OSUBooth- SC17 22NetworkBasedComputingLaboratory
EvaluationParameters
IBHCA
CPU
GPU
Bandwidth
𝑩𝑮
𝑩𝑯 ≫ 𝑩𝑮
𝑩𝑷𝑪𝑰𝒆
Notation Meaning Unit𝒏 Numberofprocesses N/A𝒎 Numberofbroadcastsources N/A𝒕𝒔 Setuptimeforsendingdata sec
𝒕𝒐(𝒏) OverheadforissuinganIB-MCASTpacket sec𝑴 Originalmessagesize bytes𝑪 Sizeofadatachunk bytes
𝑼 MaximumTransmissionUnitforIB-MCAST,providedbyhardwaremanufacturer bytes
𝑩𝑯 BandwidthofreadingHostmemory bytes/sec
𝑩𝑮BandwidthofreadingGPUmemory
(NVIDIAGPUDirect RDMA) bytes/sec
𝑩𝑷𝑪𝑰𝒆PCIeBandwidthbetweenHostandGPU
memory bytes/sec
Message
𝑴
𝑪
𝑼
OSUBooth- SC17 23NetworkBasedComputingLaboratory
• Direct • Pipeline • Staging
Ring-basedBroadcast
IBHCACPU
GPU
Source
DataIBHCA
CPU GPUDestination1
Data IBHCA
CPU GPU
Destination3
Data
GDRReadGDRWriteNetworkTransfer
IBHCA
CPU GPU
Destination2
Data
PoorScalability
(𝑛 − 1)× 𝑡7 +𝑀𝐵;
𝑀𝐶+ (𝑛 − 2) × 𝑡7 +
𝐶𝐵;
𝑀𝐵>?@A
+ (𝑛 − 1)× 𝑡7 +𝑀𝐵B
OSUBooth- SC17 24NetworkBasedComputingLaboratory
• Direct • Pipeline • Staging
K-nomial-basedBroadcast
Non-optimizedScalability
IBHCACPU
GPU
Source
Data
IBHCA
CPU GPUDestination1
Data
IBHCA
CPU GPU
Destination3
DataGDRReadGDRWriteNetworkTransfer
IBHCACPU
GPU
Destination2
Data
logF 𝑛 × 𝑡7 +𝑀𝐵;
𝑀𝐶× logF 𝑛 × 𝑡7 +
𝐶𝐵;
𝑀𝐵>?@A
+ logF 𝑛 × 𝑡7 +𝑀𝐵B
OSUBooth- SC17 25NetworkBasedComputingLaboratory
OverlapOpportunities
BroadcastfromNodeC
BroadcastfromNodeA
BroadcastfromNodeBTimeline
HCA
CPU
GPU
GPU
CPU
HCANod
eB
Nod
eC
GPUCPUHCAN
odeA
:cudaMemcpyAsync:IBHardwareMulticast:cudaStreamSynchronize:GDRWrite
OverlapAcrossNodes
Overlapwithinanode
OSUBooth- SC17 26NetworkBasedComputingLaboratory
• NVIDIAGPUDirect[1]
– Remotedirectmemoryaccess(RDMA)transfersbetweenGPUsandotherPCIedevices⇒ GDR
– andmore…
• InfiniBand(IB)hardwaremulticast(IBMCAST)[2]
– Enablesefficientdesignsofbroadcastoperations
• Host-based[3]
• GPU-based[4]
MCAST-basedBroadcast
[1]https://developer.nvidia.com/gpudirect[2]PfisterGF.,“AnIntroductiontotheInfiniBandArchitecture.”HighPerformanceMassStorageandParallelI/O,Chapter42,pp617-632,Jun2001.[3]J.Liu,A.R.Mamidala,andD.K.Panda,“FastandScalableMPI-levelBroadcastusingInfiniBand’sHardwareMulticastSupport,”inIPDPS2004,p.10,April2004.[4]A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.
OSUBooth- SC17 27NetworkBasedComputingLaboratory
• Extendthedesignforotherbroadcast-basedcollective
algorithmsaswellasnon-blockingoperations
– Allreduce,Allgather,…,andsoon
FutureWork