high performance migration framework for mpi applications
TRANSCRIPT
HighPerformanceMigrationFrameworkforMPIApplicationsonHPCCloud
JieZhang,Xiaoyi LuandDhabaleswar K.Panda
{zhanjie,luxi,panda}@cse.ohio-state.edu
ComputerScience&EngineeringDepartment,
TheOhioStateUniversity
OSU-SC’17 2NetworkBasedComputingLaboratory
• Introduction
• ProblemStatement
• Design
• PerformanceEvaluation
• Conclusion&FutureWork
Outline
OSU-SC’17 3NetworkBasedComputingLaboratory
• CloudComputingfocusesonmaximizingtheeffectivenessofthesharedresources
• VirtualizationisthekeytechnologyforresourcesharingintheCloud
• Widelyadoptedinindustrycomputingenvironment
• IDCForecastsWorldwidePublicITCloudServicesspendingwillreach$195billionby2020(Courtesy:http://www.idc.com/getdoc.jsp?containerId=prUS41669516)
CloudComputingandVirtualization
VirtualizationCloudComputing
OSU-SC’17 4NetworkBasedComputingLaboratory
DriversofModernHPCClusterandCloudArchitecture
• Multi-core/Many-coretechnologies
• Accelerators(GPUs/Co-processors)
• Largememorynodes
• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)
• SingleRootI/OVirtualization(SR-IOV)
Multi-/Many-coreProcessors
Largememorynodes(Upto2TB)
Cloud CloudSDSCComet TACCStampede
Accelerators(GPUs/Co-processors)
HighPerformanceInterconnects–InfiniBand(withSR-IOV)
<1useclatency,200GbpsBandwidth>
OSU-SC’17 5NetworkBasedComputingLaboratory
• SR-IOVisprovidingnewopportunitiestodesignHPCcloudwithverylittlelowoverhead
SingleRootI/OVirtualization(SR-IOV)
• Allowsasinglephysicaldevice,oraPhysicalFunction(PF),topresentitselfasmultiplevirtualdevices,orVirtualFunctions(VFs)
• VFsaredesignedbasedontheexistingnon-virtualizedPFs,noneedfordriverchange
• EachVFcanbededicatedtoasingleVMthroughPCIpass-through
4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications
The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.
The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.
2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.
Qemu Userspace
Guest 1
Userspace
kernelPCI Device
mmap region
Qemu Userspace
Guest 2
Userspace
kernel
mmap region
Qemu Userspace
Guest 3
Userspace
kernelPCI Device
mmap region
/dev/shm/<name>
PCI Device
Host
mmap mmap mmap
shared mem fd
eventfds
(a) Inter-VM Shmem Mechanism [15]
Guest 1Guest OS
VF Driver
Guest 2Guest OS
VF Driver
Guest 3Guest OS
VF Driver
Hypervisor PF Driver
I/O MMU
SR-IOV Hardware
Virtual Function
Virtual Function
Virtual Function
Physical Function
PCI Express
(b) SR-IOV Mechanism [22]
Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms
Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to
OSU-SC’17 6NetworkBasedComputingLaboratory
• High-PerformanceComputing(HPC)hasadoptedadvancedinterconnectsandprotocols
– InfiniBand
– 10/40/100GigabitEthernet/iWARP
– RDMAoverConvergedEnhancedEthernet(RoCE)
• VeryGoodPerformance
– Lowlatency(fewmicroseconds)
– HighBandwidth(200Gb/swithHDRInfiniBand)
– LowCPUoverhead(5-10%)
• OpenFabricssoftwarestackwithIB,iWARPandRoCEinterfacesaredrivingHPCsystems
BuildingHPCCloudwithSR-IOVandInfiniBand
OSU-SC’17 7NetworkBasedComputingLaboratory
• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualization(MVAPICH2-VirtsupportVM,Docker,Singularity,etc.),Availablesince2015– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,825organizationsin85countries
– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(June‘17ranking)
• 1st ranked10,649,640-corecluster(SunwayTaihuLight)atNSC,Wuxi,China
• 15th ranked241,108-corecluster(Pleiades)atNASA
• 20th ranked522,080-corecluster(Stampede)atTACC
• 44th ranked74,520-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu• EmpoweringTop500systemsforoveradecade
– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)
– SunwayTaihuLight (1st inJun’17,10Mcores,100PFlops)
OverviewoftheMVAPICH2Project
OSU-SC’17 8NetworkBasedComputingLaboratory
ApplicationPerformanceonVMwithMVAPICH2
• Proposed design delivers up to 43% (IS) improvement for NAS• Proposed design brings 29%,33%,29%and20% improvement for INVERSE,RAND,SINEandSPEC
0
2
4
6
8
10
12
14
16
18
FT LU CG MG IS
ExecutionTime(s)
NAS B Class
SR-IOV
Proposed
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
SPEC INVERSE RAND SINEExecutionTimes
(s)
P3DFFT 512*512*512
SR-IOV
Proposed
NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node)
43%
20%29%
33%29%
J.Zhang,X.Lu,J.JoseandD.K.Panda,HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters,HiPC,2014
OSU-SC’17 9NetworkBasedComputingLaboratory
0
10
20
30
40
50
60
70
80
90
100
MG.D FT.D EP.D LU.D CG.D
ExecutionTime(s)
Container-Def
Container-Opt
Native
• 64Containersacross16nodes,pining4CoresperContainer
• Compared to Container-Def, upto11%and73% ofexecutiontimereductionforNASandGraph500
• ComparedtoNative,lessthan9% and5%overhead for NASandGraph500
Application PerformanceonDockerwithMVAPICH2NAS
11%
Graph500
0
50
100
150
200
250
300
1Cont*16P 2Conts*8P 4Conts*4P
BFSExecutionTime(m
s)
Scale,Edgefactor (20,16)
Container-Def
Container-Opt
Native
73%
J.Zhang,X.Lu,D.K.Panda.HighPerformanceMPILibraryforContainer-basedHPCCloudonInfiniBand,ICPP,2016
OSU-SC’17 10NetworkBasedComputingLaboratory
• 512Processesacross32nodes
• Lessthan16%and11%overheadforNPBandGraph500,respectively
ApplicationPerformanceonSingularitywithMVAPICH2
0
500
1000
1500
2000
2500
3000
22,16 22,20 24,16 24,20 26,16 26,20
BFSExecutionTime(m
s)ProblemSize(Scale,Edgefactor)
Graph500
Singularity
Native
16%
11%
0
50
100
150
200
250
300
CG EP FT IS LU MG
ExecutionTime(s)
NPB(ClassD)
Singularity
Native
J.Zhang,X.Lu,D.K.Panda.IsSingularity-basedContainerTechnologyReadyforRunningMPIApplicationsonHPCClouds? UCC,2017
OSU-SC’17 11NetworkBasedComputingLaboratory
• 256processesacross64containerson16nodes• ComparedwithDefault,enhanced-hybriddesignreducesupto16% (28,16)and10% (LU)ofexecution
timeforGraph500andNAS,respectively• Comparedwith1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6% (LU)benefit.
0
50
100
150
200
IS MG EP FT CG LU
ExecutionTime(s)
ClassDNASDefault
1Layer
2Layer-Enhanced-Hybrid
0
2
4
6
8
10
22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16
BFSExecutionTime(s)
Graph500
Default
1Layer
2Layer-Enhanced-Hybrid
ApplicationPerformanceonNestedVirtualizationEnv.withMVAPICH2
J.Zhang,X.LuandD.K.Panda,DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,VEE,2017
OSU-SC’17 12NetworkBasedComputingLaboratory
ExecuteLiveMigrationwithSR-IOVDevice
OSU-SC’17 13NetworkBasedComputingLaboratory
Platform NIC NoGuestOSModification
Device DriverIndependent
HypervisorIndependent
Zhai,etc (Linuxbonding driver)
Ethernet N/A Yes No Yes (Xen)
Kadav, etc(shadowdriver)
Ethernet IntelPro/1000gigabitNIC No Yes No(Xen)
Pan,etc (CompSC) Ethernet Intel82576, Intel82599 Yes No No(Xen)
Guay, etc InfniBand Mellanox ConnectX2QDRHCA
Yes No Yes(OracleVMServer(OVS)3.0.)
Han Ethernet HuaweismartNIC Yes No No(QEMU+KVM)
Xu,etc (SRVM) Ethernet Intel82599 Yes Yes No(VMwareEXSi)
OverviewofExistingMigrationSolutionsforSR-IOV
Canwehaveahypervisor-independentanddevicedriver-independentsolutionforInfiniBand basedHPCCloudswithSR-IOV?
OSU-SC’17 14NetworkBasedComputingLaboratory
• Canweproposearealisticapproachtosolvethiscontradictionwithsometradeoffs?
• MostHPCapplicationsarebasedonMPIruntimenowadays,canweproposeaVMmigrationapproachforMPIapplicationsoverSR-IOVenabledInfiniBandclusters?
• CansuchamigrationapproachforMPIapplicationsavoidmodifyinghypervisorsandInfiniBanddrivers?
• CantheproposeddesignmigrateVMswithSR-IOVIBdevicesinahigh-performanceandscalablemanner?
• CantheproposeddesignminimizetheoverheadforrunningMPIapplicationsinsidetheVMsduringmigration?
ProblemStatement
OSU-SC’17 15NetworkBasedComputingLaboratory
• Introduction
• ProblemStatement
• Design
• PerformanceEvaluation
• Conclusion&FutureWork
Outline
OSU-SC’17 16NetworkBasedComputingLaboratory
ProposedHighPerformanceSR-IOVenabledVMMigrationFramework
MPI
Host
Guest VM1
Hypervisor
IBAdapterIVShmem
Network Suspend Trigger
Read-to-Migrate Detector
Network ReactiveNotifier
Controller
MPIGuest VM2
MPI
Host
Guest VM1
VF /SR-IOV
Hypervisor
IBAdapter IVShmemEthernet
Adapter
MPIGuest VM2
EthernetAdapter
Migration Done
Detector
VF /SR-IOV
VF /SR-IOV
VF /SR-IOV
Migration Trigger
• Twochallengesneedtohandle:– Detachment/re-attachmentofvirtualizedIBdevice
– MaintainIBconnection
• MultipleparallellibrariestocoordinateVMduringmigration(detach/reattachSR-IOV/IVShmem,migrateVMs,migrationstatus)
• MPIruntimehandlestheIBconnectionsuspendingandreactivating
• ProposeProgressEngine(PE)andMigrationThreadbased(MT)designtooptimizeVMmigrationandMPIapplicationperformance
J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017
OSU-SC’17 17NetworkBasedComputingLaboratory
SequenceDiagramofVMMigration
Migration SourceMigration TargetComputer Node
Controller
Login Node
Suspend Trigger
RTM Detector
Start Migration
MPIVF
IVSHMEM
VM
Migration
Reactivate
Detach VF Detach IVSHMEM
Attach VF Attach IVSHMEM
Phase 1
Phase 2
Phase 3.a
Phase 4
VM Migration
MPIVF
VM
MPIVMPhase 3.b
Phase 3.c
IVSHMEM
Suspend Channel
Pre-migration
Ready-to-migrate
Post-migration
Reactivate ChannelMigration Done
Phase 5
OSU-SC’17 18NetworkBasedComputingLaboratory
ProposedDesignofMPIRuntime
Time Comp
MPICall
RTime
RSMPI Call Suspend Channel Reactivate Channel Control Msg
Migration Migration
Lock/Unlock Communication
Comp Computation
P 0
ControllerPre-migration
S
Ready-to-MigrateMigration
Post-Migration
MPI Call
MPICall
TimeNo-migration P 0 Comp
MPICall
P 0
MPICall
ThreadPre-
migrationReady-to-migrate Migration
Post-migration
MPICall
Comp
Controller
TimeCompMigration-thread based Worst Scenario
P 0
MPICall
Thread Pre-migration
SReady-to-migrate Migration
Post-migrationController
S R
R
Migration-thread based Typical Scenario
Progress Engine Based
Migration Signal Detection Down-Time for VM Live Migration
MPI Call
OSU-SC’17 19NetworkBasedComputingLaboratory
• Introduction
• ProblemStatement
• Design
• PerformanceEvaluation
• Conclusion&FutureWork
Outline
OSU-SC’17 20NetworkBasedComputingLaboratory
ExperimentalTestbed
Cluster ChameleonCloud Nowlab
CPU IntelXeonE5-2670v324-core2.3GHz
IntelXeonE5-2670dual8-core2.6GHz
RAM 128GB 32GB
Interconnect
Mellanox ConnectX-3HCA,(FDR56Gbps), MLNXOFEDLINUX-3.0-1.0.1asdriver
Mellanox ConnectX-3HCA,(FDR56Gbps), MLNXOFEDLINUX-3.2-2.0.0
OS CentOS Linuxrelease7.1.1503(Core)
Compiler GCC4.8.3
MVAPICH2 andOSUmicro-benchmarksv5.3
OSU-SC’17 21NetworkBasedComputingLaboratory
Bcast(4VMs*2Procs/VM)
• MigrateaVMfromonemachinetoanotherwhilebenchmarkisrunninginside
• ProposedMT-baseddesignsperformslightlyworsethanPE-baseddesignsbecauseoflock/unlock
• NobenefitfromMTbecauseofNOcomputationinvolved
Pt2PtLatency
0
50
100
150
200
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency(us)
MessageSize(bytes)
PE-IPoIBPE-RDMAMT-IPoIBMT-RDMA
0
100
200
300
400
500
600
700
1 4 16 64 256 1K 4K 16K 64K 256K 1M
2Laten
cy(u
s)MessageSize(bytes)
PE-IPoIB
PE-RDMA
MT-IPoIB
MT-RDMA
Point-to-PointandCollectivePerformance
OSU-SC’17 22NetworkBasedComputingLaboratory
• fixthecommunicationtimeandincreasethecomputationtime
• 10%computation,partialmigrationtimecouldbeoverlappedwithcomputationinMT-typical
• Ascomputationpercentageincreases,morechanceforoverlapping
OverlappingEvaluation
10
20
30
40
50
0 10 20 30 40 50 60 70
Time(s)
Computation(percentage)
PE
MT-worst
MT-typical
NM
OSU-SC’17 23NetworkBasedComputingLaboratory
• 8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning
• ComparedwithNM,MT- worstandPEincursomeoverhead
• MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation
ApplicationPerformance
05
1015202530
LU.B EP.B IS.B MG.B CG.B FT.B
Exec
utio
n Ti
me
(s) PE
MT-worstMT-typicalNM
0.0
0.1
0.2
0.3
0.4
20,10 20,16 20,20 22,10
Exec
utio
n Ti
me
(s) PE
MT-worstMT-typicalNM
Graph500NAS
OSU-SC’17 24NetworkBasedComputingLaboratory
AvailableAppliancesonChameleonCloud*Appliance Description
CentOS7KVMSR-IOV
Chameleonbare-metalimagecustomizedwiththeKVMhypervisorandarecompiledkerneltoenableSR-IOVoverInfiniBand.https://www.chameleoncloud.org/appliances/3/
MPIbare-metalclustercomplex
appliance(BasedonHeat)
ThisappliancedeploysanMPIclustercomposedofbaremetalinstancesusingtheMVAPICH2libraryoverInfiniBand.https://www.chameleoncloud.org/appliances/29/
MPI+SR-IOVKVMcluster (Basedon
Heat)
ThisappliancedeploysanMPIclusterofKVMvirtualmachinesusingtheMVAPICH2-VirtimplementationandconfiguredwithSR-IOVforhigh-performancecommunicationoverInfiniBand.https://www.chameleoncloud.org/appliances/28/
CentOS7SR-IOVRDMA-Hadoop
TheCentOS7SR-IOVRDMA-HadoopapplianceisbuiltfromtheCentOS7applianceandadditionallycontainsRDMA-Hadooplibrarywith SR-IOV.https://www.chameleoncloud.org/appliances/17/
• Throughtheseavailableappliances,usersandresearcherscaneasilydeployHPCcloudstoperformexperimentsandrunjobswith– High-PerformanceSR-IOV+InfiniBand
– High-PerformanceMVAPICH2Libraryoverbare-metalInfiniBandclusters
– High-PerformanceMVAPICH2LibrarywithVirtualizationSupportoverSR-IOVenabledKVMclusters
– High-PerformanceHadoopwithRDMA-basedEnhancementsSupport [*]OnlyincludeappliancescontributedbyOSUNowLab
OSU-SC’17 25NetworkBasedComputingLaboratory
MPIComplexAppliancesbasedonMVAPICH2onChameleon1. LoadVMConfig2. AllocatePorts3. AllocateFloatingIPs4. GenerateSSHKeypair5. LaunchVM6. AttachSR-IOVDevice7. HotplugIVShmem
Device8. Download/Install
MVAPICH2-Virt9. PopulateVMs/IPs10. AssociateFloatingIPs
MVAPICH2-VirtHeat-basedComplexAppliance
OSU-SC’17 26NetworkBasedComputingLaboratory
• Proposeahigh-performanceVMmigrationframeworkforMPIapplicationsonSR- IOVenabledInfiniBandclusters
• Hypervisor- andInfiniBanddriver-independent
• DesignProgressEngine(PE)basedandMigrationThread(MT)basedMPIruntimedesign
• Designahigh-performanceandscalablecontrollerwhichworksseamlesslywithourproposeddesignsinMPIruntime
• EvaluatetheproposedframeworkwithMPIlevelmicro-benchmarksandreal-worldHPCapplications
• CouldcompletelyhidetheoverheadofVMmigrationthroughcomputationandmigrationoverlapping
• FutureWork– Evaluateourproposedframeworkatlargerscalesanddifferenthypervisors,suchasXen;Solutionwillbeavailableinupcomingrelease
Conclusion&FutureWork
OSU-SC’17 27NetworkBasedComputingLaboratory
{zhanjie,luxi,panda}@cse.ohio-state.edu
ThankYou!
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/