efficient and scalable multi-source streaming broadcast on ... · june-2014 nov-2014 june-2015...

EfficientandScalableMulti-SourceStreamingBroadcastonGPUClusters

forDeepLearningChing-HsiangChu1,XiaoyiLu1,AmmarA.Awan1,HariSubramoni1,

JahanzebHashmi1,BracyElton2 andDhabaleswarK.(DK)Panda1

1DepartmentofComputerScienceandEngineering,TheOhioStateUniversity2EngilityCorporation

ICPP2017 2NetworkBasedComputingLaboratory

• Introduction– DeepLearningonGPUandInfiniBand(IB)Clusters

– Multi-sourceBroadcast-typeOperationforDeepLearning

• Analysis

• ProposedDesign– Streaming-basedDesignwithIBmulticastandNVIDIAGPUDirectfeatures

• PerformanceEvaluation

• ConclusionandFutureWork

Outline


TrendsinModernHPCArchitecture

• Multi-core/many-coretechnologies

• HighPerformanceInterconnects

• Accelerators/Coprocessorsarebecomingcommoninhigh-endsystems

• HighPerformanceStorageandComputedevices

Accelerators/Coprocessorshighcomputedensity,high

performance/watt>1Tflop/sDPonachip

HighPerformanceInterconnects–InfiniBand(IB),Omni-Path

<1μseclatency,100GbpsBandwidth>Multi-coreProcessors SSD,NVMe-SSD,NVRAM

Tianhe– 2 TitanK- ComputerSunwayTaihuLight


20 18 15 14 10 8 6

23 28 3352 53 50

43

2 22

01020304050607080

June-2014 Nov-2014 June-2015 Nov-2015 June-2016 Nov-2016 June-2017

System

Cou

nt

NVIDIAFermi NVIDIAKepler NVIDIAPascal

• GrowthofGPUclustersinthelast3years– NVIDIAGPUsboostmanyTop500andGreen500systems

• “Top13systemsonthelatestGreen500 areallequippedwiththeP100hardware”*

GPUinHPCSystems

*Datacollectedfromhttp://top500.org


ArchitecturesforDeepLearning(DL)

Multi-coreCPUsacrossnodes Multi-coreCPUs+SingleGPUacrossnodes

Multi-coreCPUswithinanode Multi-coreCPUs+Multi-GPUwithinanode

Multi-coreCPUs+Multi-GPUacrossnodes

PastandCurrentTrend Near-future

E.g.,NVIDIADGX-1systems

IBNetworks

IBNetworks

IBNetworks


• ComputationusingGPU

• CommunicationusingMPI– Exchanging partial gradients aftereachminibatch

– All-to-all(Multi-Source)communications

Ø E.g.,MPI_Bcast

• Challenges– Highcomputation-communicationoverlap

– Goodscalability forupcominglarge-scaleGPUclusters

– Noapplication-levelmodification

High-performanceDeepLearning

GPUNode1

GPUNode2 GPUNode4

GPUNode3


• Introduction

• Analysis– ExistingDesigns

– ProblemStatement

• ProposedDesign



Outline


EvaluationParameters

IBHCA

CPU

GPU

Bandwidth

𝑩𝑮

𝑩𝑯 ≫ 𝑩𝑮

𝑩𝑷𝑪𝑰𝒆

Notation Meaning Unit𝒏 Numberofprocesses N/A𝒎 Numberofbroadcastsources N/A𝒕𝒔 Setuptimeforsendingdata sec

𝒕𝒐(𝒏) OverheadforissuinganIB-MCASTpacket sec𝑴 Originalmessagesize bytes𝑪 Sizeofadatachunk bytes

𝑼 MaximumTransmissionUnitforIB-MCAST,providedbyhardwaremanufacturer bytes

𝑩𝑯 BandwidthofreadingHostmemory bytes/sec

𝑩𝑮BandwidthofreadingGPUmemory

(NVIDIAGPUDirect RDMA) bytes/sec

𝑩𝑷𝑪𝑰𝒆PCIeBandwidthbetweenHostandGPU

memory bytes/sec

Message

𝑴

𝑪

𝑼


• Direct • Pipeline • Staging

Ring-basedBroadcast

IBHCACPU

GPU

Source

DataIBHCA

CPU GPUDestination1

Data IBHCA

CPU GPU

Destination3

Data

GDRReadGDRWriteNetworkTransfer

IBHCA

CPU GPU

Destination2

Data

PoorScalability

(𝑛 − 1)× 𝑡7 +𝑀𝐵;

𝑀𝐶 + (𝑛 − 2) × 𝑡7 +

𝐶𝐵;

𝑀𝐵>?@A

+ (𝑛 − 1)× 𝑡7 +𝑀𝐵B


• Direct • Pipeline • Staging

K-nomial-basedBroadcast

Non-optimizedScalability

IBHCACPU

GPU

Source

Data

IBHCA

CPU GPUDestination1

Data

IBHCA

CPU GPU

Destination3

DataGDRReadGDRWriteNetworkTransfer

IBHCACPU

GPU

Destination2

Data

logF 𝑛 × 𝑡7 +𝑀𝐵;

𝑀𝐶 × logF 𝑛 × 𝑡7 +

𝐶𝐵;

𝑀𝐵>?@A

+ logF 𝑛 × 𝑡7 +𝑀𝐵B


IBHCA

CPU

GPU

Source

IBSwitch

Header

Data

IBHCA

CPU

GPU

Destination1Header

Data

IBHCA

CPU

GPU

DestinationNHeader

Data

1.IBGather+GDRRead2.IBHardwareMulticast3.IBScatter+GDRWrite

• ForGPU-residentdata,using– GPUDirectRDMA(GDR)

– InfiniBandHardwareMulticast(IB-MCAST)

• Overhead– IBUDlimit

– GDRlimit

HardwareMulticast-basedBroadcast*

*A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.

𝑀𝑈 × 𝑡7 + 𝑡H(𝑛) +

𝑈𝐵;


• HowtodeterminetechniquestoleverageIB-MCAST andotherGPUadvancedfeatures GDR todesignefficientandscalablebroadcast withlargemessagesonGPUclusters?

• Howtoachievehighoverlapandscalabilityformulti-sourcebroadcastoperations?

• Howtodetermineattainabletheoreticalandpracticalperformancebenefitsfordeeplearningapplications?

ProblemStatement


• Introduction

• Analysis

• ProposedDesign– Streaming-basedDesignwithIBmulticastandNVIDIA

GPUDirectfeatures



Outline


• Optimizedbroadcastsendoperation

– Streaming theGPU-residentdatathroughhostmemory

– LeveragingInfiniBandhardwaremulticast

Ø Low-latency:avoidingGDRReadlimit

Ø Overlappingdatatransferswithinandacrossnodes

• Optimizedbroadcastreceiveoperation

– Zero-copyschemebyleveragingGDRfeature

Ø Low-latency:avoidingunnecessarydatatransfers

OverviewofProposedStreamingDesign


• PreparingIntermediatebuffer(im_buf)– Page-locked(pinned)hostbuffer

Ø FastDevice-Hostdatamovement

– AllocatedatinitializationphaseØ Lowoverhead

• Streaming datathroughhost– Fine-tunedchunkeddata

– Asynchronouscopyoperations

Ø Three-stagepipeline

IBHCA

CPU

GPU

Source

IBSwitch

Header

d_out

1.DataPreparation2.IBGather3.IBHardwareMulticast

im_buf

OptimizedBroadcastSend

MPI_Bcast(d_out,…)


• Zero-copybroadcastreceive– Pre-posteduserbuffer(d_in)

– Avoidsadditionaldatamovement

– LeveragesIBScatterandGDRfeatures

Ø Low-latency

Ø Free-upPCIeresourcesforapplications

IBSwitch

IBHCA

CPU

GPU

Destination1Header

d_in

IBHCA

CPU

GPU

DestinationNHeader

d_inIBHardwareMulticastIBScatter(GDRWrite)

OptimizedBroadcastReceiveMPI_Bcast(d_in,…)


OverlapOpportunities

BroadcastfromNodeC

BroadcastfromNodeA

BroadcastfromNodeBTimeline

HCA

CPU

GPU

GPU

CPU

HCANod

eB

Nod

eC

GPUCPUHCAN

odeA

:cudaMemcpyAsync:IBHardwareMulticast:cudaStreamSynchronize:GDRWrite

OverlapAcrossNodes

Overlapwithinanode𝐶𝐵>?@A

+𝑀𝑈 × 𝑡7 + 𝑡H(𝑛) +

𝑈𝐵B


• Introduction

• Analysis

• ProposedDesign

• PerformanceEvaluation– OSUMicro-Benchmark(OMB)

– DeepLearningFramework


Outline


OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet

(RoCE)– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,775organizationsin85countries

– Morethan420,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(June‘17ranking)

• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 15th,241,108-core(Pleiades)atNASA

• 20th,462,462-core(Stampede)atTACC

• 44th,74,520-core(Tsubame2.5)atTokyoInstituteofTechnology

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

• EmpoweringTop500systemsforoveradecade– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– SunwayTaihuLight(1st inJun’16,10Mcores,100PFlops)


• RI2cluster@TheOhioStateUniversity– Two14-coreIntel(Broadwell)XeonE5-2680V4processors

– 1NVIDIAK80GPUpernode;UsedUpto16GPUnodes

– OnesingleportInfiniBandEDRHCA

– MellanoxSB7790andSB7800InfiniBandswitches

• OhioStateUniversity(OSU)Micro-Benchmark(OMB)http://mvapich.cse.ohio-state.edu/benchmarks/

– osu_bcast- MPI_BcastLatencyTest

• Deeplearningframework:CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)*– AlexNetandVGGmodelswithImageNetdataset

ExperimentalEnvironments

*D.S.Banerjee,K.HamidoucheandD.K.Panda,"Re-DesigningCNTKDeepLearningFrameworkonModernGPUEnabledClusters," IEEECloudCom,LuxembourgCity,2016,pp.144-151.


• @RI2cluster,16GPUs,1GPU/node

1

10

100

1000

10000

4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M 8M 16M

Latency(μs)

MessageSize(bytes)

MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt

Evaluation:BenchmarkEvaluation

1

10

100

1000

10000

2 4 8 16

Latency(μs)

NumberofGPUnodes

2MBMessage

MV2-GDR-Knomial MV2-GDR-RingMCAST-GDR MCAST-GDR-Opt

Lowerisbetter

Near-Constant65%

• Providenear-constantlatencyoverthesystemsizes• Reducesupto65%oflatencyforlargemessages

HitGDRreadlimit


• @RI2cluster,16GPUs,1GPU/node:– CUDA-AwareMicrosoftCognitiveToolkit(CA-CNTK)withoutmodification

Evaluation:DeepLearningFrameworks

0

100

200

300

8 16

TrainingTim

e(s)

NumberofGPUnodes

AlexNetmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

0

1000

2000

3000

8 16

TrainingTim

e(s)

NumberofGPUnodes

VGGmodelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

Lowerisbetter

15% 24% 6%15%

• Reducesupto24%and15%oflatencyforAlexNetandVGGmodels• Higherimprovementisexpectedforlargersystemsizes


• BasedonthearchitectureonRI2cluster

PerformancePrediction

0.001

0.1

10

1000

100000

Latency(s)

NumberofBroadcastSources

K-nomial-based:Model-basedEstimationRing-based:Model-basedEstimationMCAST-GDR-Opt:Model-basedEstimation

0.001

0.01

0.1

1

10

100

2 4 8 16

Latency(s)

NumberofBroadcastSources

K-nomial-based:Model-basedEstimationK-nomial-based:ExperimentRing-based:Model-basedEstimationRing-based:ExperimentMCAST-GDR-Opt:Model-basedEstimationMCAST-GDR-Opt:Experiment

Within10%oferror

𝑴 = 2𝑀𝐵; 𝑪 = 512𝐾𝐵; 𝑼 = 4𝐾𝐵;𝑩𝑯 ≈ 100𝐺𝑏𝑝𝑠;𝑩𝑷𝑪𝑰𝒆 = 8𝐺𝑏𝑝𝑠; 𝒕𝒐 𝒏 ≈1𝛼 × ln 𝑛 , 15 ≤ 𝛼 ≤ 20


• Introduction

• Analysis

• ProposedDesign



Outline


• ProposedefficientbroadcastschemestoleverageGDRandMCAST

features fordeeplearningapplications

– Optimizedstreamingdesignforlargemessagestransfers

• Providedandevaluatedanalyticalmodels tocaptureessential

performancebehaviorofalternativebroadcastschemesonGPU

clusters

Ø ThesefeaturesareincludedinthelatestreleaseofMVAPICH2-GDR

library

Conclusion


• Extendthedesignforotherbroadcast-basedcollective

algorithmsaswellasnon-blockingoperations

– Allreduce,Allgather,…,andsoon

• Evaluatetheproposeddesigninupcominglarger-scale

GPUclusters

FutureWork


ThankYou!Ching-HsiangChu,XiaoyiLu,AmmarA.Awan,HariSubramoni,JahanzebHashmi,BracyEltonandDhabaleswarK.(DK)Panda

{chu.368,lu.932,awan.10,subramoni.1,hashmi.29}@[email protected],[email protected]

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecthttp://mvapich.cse.ohio-state.edu/

This project is supported under the United States Department of Defense (DOD) High Performance ComputingModernization Program (HPCMP) User Productivity Enhancement and Technology Transfer (PETTT) activity(Contract No. GS04T09DBC0017 Engility Corporation). The opinions expressed herein are those of the authors anddo not necessarily reflect the views of the DOD or the employer of the author.


• NVIDIAGPUDirect[1]

– Remotedirectmemoryaccess(RDMA)transfersbetweenGPUsandotherPCIedevices⇒ GDR

– andmore…

• InfiniBand(IB)hardwaremulticast(IBMCAST)[2]

– Enablesefficientdesignsofbroadcastoperations

• Host-based[3]

• GPU-based[4]

MCAST-basedBroadcast

[1]https://developer.nvidia.com/gpudirect[2]PfisterGF.,“AnIntroductiontotheInfiniBandArchitecture.”HighPerformanceMassStorageandParallelI/O,Chapter42,pp617-632,Jun2001.[3]J.Liu,A.R.Mamidala,andD.K.Panda,“FastandScalableMPI-levelBroadcastusingInfiniBand’sHardwareMulticastSupport,”inIPDPS2004,p.10,April2004.[4]A.Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,“AHighPerformanceBroadcastDesignwithHardwareMulticastandGPUDirectRDMAforStreamingApplicationsonInfiniBandClusters,”inHiPC 2014,Dec2014.

efficient and scalable multi-source streaming broadcast on ... · june-2014 nov-2014 june-2015...

Documents