efficient and scalable multi-source streaming broadcast on ... · june-2014 nov-2014 june-2015...
TRANSCRIPT
EfficientandScalableMulti-SourceStreamingBroadcastonGPUClusters
forDeepLearningChing-HsiangChu1,XiaoyiLu1,AmmarA.Awan1,HariSubramoni1,
JahanzebHashmi1,BracyElton2 andDhabaleswarK.(DK)Panda1
1DepartmentofComputerScienceandEngineering,TheOhioStateUniversity2EngilityCorporation
ICPP2017 2NetworkBasedComputingLaboratory
• Introduction– DeepLearningonGPUandInfiniBand(IB)Clusters
– Multi-sourceBroadcast-typeOperationforDeepLearning
• Analysis
• ProposedDesign– Streaming-basedDesignwithIBmulticastandNVIDIAGPUDirectfeatures
• PerformanceEvaluation
• ConclusionandFutureWork
Outline
ICPP2017 3NetworkBasedComputingLaboratory
TrendsinModernHPCArchitecture
• Multi-core/many-coretechnologies
• HighPerformanceInterconnects
• Accelerators/Coprocessorsarebecomingcommoninhigh-endsystems
• HighPerformanceStorageandComputedevices
Accelerators/Coprocessorshighcomputedensity,high
performance/watt>1Tflop/sDPonachip
HighPerformanceInterconnects–InfiniBand(IB),Omni-Path
<1μseclatency,100GbpsBandwidth>Multi-coreProcessors SSD,NVMe-SSD,NVRAM
Tianhe– 2 TitanK- ComputerSunwayTaihuLight
ICPP2017 4NetworkBasedComputingLaboratory
20 18 15 14 10 8 6
23 28 3352 53 50
43
2 22
01020304050607080
June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017
System
Cou
nt
NVIDIAFermi NVIDIAKepler NVIDIAPascal
• GrowthofGPUclustersinthelast3years– NVIDIAGPUsboostmanyTop500andGreen500systems
• “Top13systemsonthelatestGreen500 areallequippedwiththeP100hardware”*
GPUinHPCSystems
*Datacollectedfromhttp://top500.org
ICPP2017 5NetworkBasedComputingLaboratory
ArchitecturesforDeepLearning(DL)
Multi-coreCPUsacrossnodes Multi-coreCPUs+SingleGPUacrossnodes
Multi-coreCPUswithinanode Multi-coreCPUs+Multi-GPUwithinanode
Multi-coreCPUs+Multi-GPUacrossnodes
PastandCurrentTrend Near-future
E.g.,NVIDIADGX-1systems
IBNetworks
IBNetworks
IBNetworks
ICPP2017 6NetworkBasedComputingLaboratory
• ComputationusingGPU
• CommunicationusingMPI– Exchanging partial gradients aftereachminibatch
– All-to-all(Multi-Source)communications
Ø E.g.,MPI_Bcast
• Challenges– Highcomputation-communicationoverlap
– Goodscalability forupcominglarge-scaleGPUclusters
– Noapplication-levelmodification
High-performanceDeepLearning
GPUNode1
GPUNode2 GPUNode4
GPUNode3
ICPP2017 7NetworkBasedComputingLaboratory
• Introduction
• Analysis– ExistingDesigns
– ProblemStatement
• ProposedDesign
• PerformanceEvaluation
• ConclusionandFutureWork
Outline
ICPP2017 8NetworkBasedComputingLaboratory
EvaluationParameters
IBHCA
CPU
GPU
Bandwidth
𝑩𝑮
𝑩𝑯 ≫ 𝑩𝑮
𝑩𝑷𝑪𝑰𝒆
Notation Meaning Unit𝒏 Numberofprocesses N/A𝒎 Numberofbroadcastsources N/A𝒕𝒔 Setuptimeforsendingdata sec
𝒕𝒐(𝒏) OverheadforissuinganIB-MCASTpacket sec𝑴 Originalmessagesize bytes𝑪 Sizeofadatachunk bytes
𝑼 MaximumTransmissionUnitforIB-MCAST,providedbyhardwaremanufacturer bytes
𝑩𝑯 BandwidthofreadingHostmemory bytes/sec
𝑩𝑮BandwidthofreadingGPUmemory
(NVIDIAGPUDirect RDMA) bytes/sec
𝑩𝑷𝑪𝑰𝒆PCIeBandwidthbetweenHostandGPU
memory bytes/sec
Message
𝑴
𝑪
𝑼
ICPP2017 9NetworkBasedComputingLaboratory
• Direct • Pipeline • Staging
Ring-basedBroadcast
IBHCACPU
GPU
Source
DataIBHCA
CPU GPUDestination1
Data IBHCA
CPU GPU
Destination3
Data
GDRReadGDRWriteNetworkTransfer
IBHCA
CPU GPU
Destination2
Data
PoorScalability
(𝑛 − 1)× 𝑡7 +𝑀𝐵;
𝑀𝐶 + (𝑛 − 2) × 𝑡7 +
𝐶𝐵;
𝑀𝐵>?@A
+ (𝑛 − 1)× 𝑡7 +𝑀𝐵B
ICPP2017 10NetworkBasedComputingLaboratory
• Direct • Pipeline • Staging
K-nomial-basedBroadcast
Non-optimizedScalability
IBHCACPU
GPU
Source
Data
IBHCA
CPU GPUDestination1
Data
IBHCA
CPU GPU
Destination3
DataGDRReadGDRWriteNetworkTransfer
IBHCACPU
GPU
Destination2
Data
logF 𝑛 × 𝑡7 +𝑀𝐵;
𝑀𝐶 × logF 𝑛 × 𝑡7 +
𝐶𝐵;
𝑀𝐵>?@A
+ logF 𝑛 × 𝑡7 +𝑀𝐵B
ICPP2017 11NetworkBasedComputingLaboratory
IBHCA
CPU
GPU
Source
IBSwitch
Header
Data
IBHCA
CPU
GPU
Destination1Header
Data
IBHCA
CPU
GPU
DestinationNHeader
Data
1.IBGather+GDRRead2.IBHardwareMulticast3.IBScatter+GDRWrite
• ForGPU-residentdata,using– GPUDirectRDMA(GDR)
– InfiniBandHardwareMulticast(IB-MCAST)
• Overhead– IBUDlimit
– GDRlimit
HardwareMulticast-basedBroadcast*
*A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.
𝑀𝑈 × 𝑡7 + 𝑡H(𝑛) +
𝑈𝐵;
ICPP2017 12NetworkBasedComputingLaboratory
• HowtodeterminetechniquestoleverageIB-MCAST andotherGPUadvancedfeatures GDR todesignefficientandscalablebroadcast withlargemessagesonGPUclusters?
• Howtoachievehighoverlapandscalabilityformulti-sourcebroadcastoperations?
• Howtodetermineattainabletheoreticalandpracticalperformancebenefitsfordeeplearningapplications?
ProblemStatement
ICPP2017 13NetworkBasedComputingLaboratory
• Introduction
• Analysis
• ProposedDesign– Streaming-basedDesignwithIBmulticastandNVIDIA
GPUDirectfeatures
• PerformanceEvaluation
• ConclusionandFutureWork
Outline
ICPP2017 14NetworkBasedComputingLaboratory
• Optimizedbroadcastsendoperation
– Streaming theGPU-residentdatathroughhostmemory
– LeveragingInfiniBandhardwaremulticast
Ø Low-latency:avoidingGDRReadlimit
Ø Overlappingdatatransferswithinandacrossnodes
• Optimizedbroadcastreceiveoperation
– Zero-copyschemebyleveragingGDRfeature
Ø Low-latency:avoidingunnecessarydatatransfers
OverviewofProposedStreamingDesign
ICPP2017 15NetworkBasedComputingLaboratory
• PreparingIntermediatebuffer(im_buf)– Page-locked(pinned)hostbuffer
Ø FastDevice-Hostdatamovement
– AllocatedatinitializationphaseØ Lowoverhead
• Streaming datathroughhost– Fine-tunedchunkeddata
– Asynchronouscopyoperations
Ø Three-stagepipeline
IBHCA
CPU
GPU
Source
IBSwitch
Header
d_out
1.DataPreparation2.IBGather3.IBHardwareMulticast
im_buf
OptimizedBroadcastSend
MPI_Bcast(d_out,…)
ICPP2017 16NetworkBasedComputingLaboratory
• Zero-copybroadcastreceive– Pre-posteduserbuffer(d_in)
– Avoidsadditionaldatamovement
– LeveragesIBScatterandGDRfeatures
Ø Low-latency
Ø Free-upPCIeresourcesforapplications
IBSwitch
IBHCA
CPU
GPU
Destination1Header
d_in
IBHCA
CPU
GPU
DestinationNHeader
d_inIBHardwareMulticastIBScatter(GDRWrite)
OptimizedBroadcastReceiveMPI_Bcast(d_in,…)
ICPP2017 17NetworkBasedComputingLaboratory
OverlapOpportunities
BroadcastfromNodeC
BroadcastfromNodeA
BroadcastfromNodeBTimeline
HCA
CPU
GPU
GPU
CPU
HCANod
eB
Nod
eC
GPUCPUHCAN
odeA
:cudaMemcpyAsync:IBHardwareMulticast:cudaStreamSynchronize:GDRWrite
OverlapAcrossNodes
Overlapwithinanode𝐶𝐵>?@A
+𝑀𝑈 × 𝑡7 + 𝑡H(𝑛) +
𝑈𝐵B
ICPP2017 18NetworkBasedComputingLaboratory
• Introduction
• Analysis
• ProposedDesign
• PerformanceEvaluation– OSUMicro-Benchmark(OMB)
– DeepLearningFramework
• ConclusionandFutureWork
Outline
ICPP2017 19NetworkBasedComputingLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet
(RoCE)– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,775organizationsin85countries
– Morethan420,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(June‘17ranking)
• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China
• 15th,241,108-core(Pleiades)atNASA
• 20th,462,462-core(Stampede)atTACC
• 44th,74,520-core(Tsubame2.5)atTokyoInstituteofTechnology
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
• EmpoweringTop500systemsforoveradecade– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->
– SunwayTaihuLight(1st inJun’16,10Mcores,100PFlops)
ICPP2017 20NetworkBasedComputingLaboratory
• RI2cluster@TheOhioStateUniversity– Two14-coreIntel(Broadwell)XeonE5-2680V4processors
– 1NVIDIAK80GPUpernode;UsedUpto16GPUnodes
– OnesingleportInfiniBandEDRHCA
– MellanoxSB7790andSB7800InfiniBandswitches
• OhioStateUniversity(OSU)Micro-Benchmark(OMB)http://mvapich.cse.ohio-state.edu/benchmarks/
– osu_bcast- MPI_BcastLatencyTest
• Deeplearningframework:CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)*– AlexNetandVGGmodelswithImageNetdataset
ExperimentalEnvironments
*D.S.Banerjee,K.HamidoucheandD.K.Panda,"Re-DesigningCNTKDeepLearningFrameworkonModernGPUEnabledClusters," IEEECloudCom,LuxembourgCity,2016,pp.144-151.
ICPP2017 21NetworkBasedComputingLaboratory
• @RI2cluster,16GPUs,1GPU/node
1
10
100
1000
10000
4K 8K 16K
32K
64K
128K
256K
512K 1M 2M 4M 8M 16M
Latency(μs)
MessageSize(bytes)
MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt
Evaluation:BenchmarkEvaluation
1
10
100
1000
10000
2 4 8 16
Latency(μs)
NumberofGPUnodes
2MBMessage
MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt
Lowerisbetter
Near-Constant65%
• Providenear-constantlatencyoverthesystemsizes• Reducesupto65%oflatencyforlargemessages
HitGDRreadlimit
ICPP2017 22NetworkBasedComputingLaboratory
• @RI2cluster,16GPUs,1GPU/node:– CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)withoutmodification
Evaluation:DeepLearningFrameworks
0
100
200
300
8 16
TrainingTim
e(s)
NumberofGPUnodes
AlexNetmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt
0
1000
2000
3000
8 16
TrainingTim
e(s)
NumberofGPUnodes
VGGmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt
Lowerisbetter
15% 24% 6%15%
• Reducesupto24%and15%oflatencyforAlexNetandVGGmodels• Higherimprovementisexpectedforlargersystemsizes
ICPP2017 23NetworkBasedComputingLaboratory
• BasedonthearchitectureonRI2cluster
PerformancePrediction
0.001
0.1
10
1000
100000
Latency(s)
NumberofBroadcastSources
K-nomial-based:Model-basedEstimationRing-based:Model-basedEstimationMCAST-GDR-Opt:Model-basedEstimation
0.001
0.01
0.1
1
10
100
2 4 8 16
Latency(s)
NumberofBroadcastSources
K-nomial-based:Model-basedEstimationK-nomial-based:ExperimentRing-based:Model-basedEstimationRing-based:ExperimentMCAST-GDR-Opt:Model-basedEstimationMCAST-GDR-Opt:Experiment
Within10%oferror
𝑴 = 2𝑀𝐵; 𝑪 = 512𝐾𝐵; 𝑼 = 4𝐾𝐵;𝑩𝑯 ≈ 100𝐺𝑏𝑝𝑠;𝑩𝑷𝑪𝑰𝒆 = 8𝐺𝑏𝑝𝑠; 𝒕𝒐 𝒏 ≈1𝛼 × ln 𝑛 , 15 ≤ 𝛼 ≤ 20
ICPP2017 24NetworkBasedComputingLaboratory
• Introduction
• Analysis
• ProposedDesign
• PerformanceEvaluation
• ConclusionandFutureWork
Outline
ICPP2017 25NetworkBasedComputingLaboratory
• ProposedefficientbroadcastschemestoleverageGDRandMCAST
features fordeeplearningapplications
– Optimizedstreamingdesignforlargemessagestransfers
• Providedandevaluatedanalyticalmodels tocaptureessential
performancebehaviorofalternativebroadcastschemesonGPU
clusters
Ø ThesefeaturesareincludedinthelatestreleaseofMVAPICH2-GDR
library
Conclusion
ICPP2017 26NetworkBasedComputingLaboratory
• Extendthedesignforotherbroadcast-basedcollective
algorithmsaswellasnon-blockingoperations
– Allreduce,Allgather,…,andsoon
• Evaluatetheproposeddesigninupcominglarger-scale
GPUclusters
FutureWork
ICPP2017 27NetworkBasedComputingLaboratory
ThankYou!Ching-HsiangChu,XiaoyiLu,AmmarA.Awan,HariSubramoni,JahanzebHashmi,BracyEltonandDhabaleswarK.(DK)Panda
{chu.368,lu.932,awan.10,subramoni.1,hashmi.29}@[email protected],[email protected]
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/
This project is supported under the United States Department of Defense (DOD) High Performance ComputingModernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity(Contract No. GS04T09DBC0017 Engility Corporation). The opinions expressed herein are those of the authors anddo not necessarily reflect the views of the DOD or the employer of the author.
ICPP2017 28NetworkBasedComputingLaboratory
• NVIDIAGPUDirect[1]
– Remotedirectmemoryaccess(RDMA)transfersbetweenGPUsandotherPCIedevices⇒ GDR
– andmore…
• InfiniBand(IB)hardwaremulticast(IBMCAST)[2]
– Enablesefficientdesignsofbroadcastoperations
• Host-based[3]
• GPU-based[4]
MCAST-basedBroadcast
[1]https://developer.nvidia.com/gpudirect[2]PfisterGF.,“AnIntroductiontotheInfiniBandArchitecture.”HighPerformanceMassStorageandParallelI/O,Chapter42,pp617-632,Jun2001.[3]J.Liu,A.R.Mamidala,andD.K.Panda,“FastandScalableMPI-levelBroadcastusingInfiniBand’sHardwareMulticastSupport,”inIPDPS2004,p.10,April2004.[4]A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.