high performance migration framework for mpi applications

HighPerformanceMigrationFrameworkforMPIApplicationsonHPCCloud

JieZhang,Xiaoyi LuandDhabaleswar K.Panda

{zhanjie,luxi,panda}@cse.ohio-state.edu

ComputerScience&EngineeringDepartment,

TheOhioStateUniversity

OSU-SC’17 2NetworkBasedComputingLaboratory

• Introduction

• ProblemStatement

• Design

• PerformanceEvaluation

• Conclusion&FutureWork

Outline


• CloudComputingfocusesonmaximizingtheeffectivenessofthesharedresources

• VirtualizationisthekeytechnologyforresourcesharingintheCloud

• Widelyadoptedinindustrycomputingenvironment

• IDCForecastsWorldwidePublicITCloudServicesspendingwillreach$195billionby2020(Courtesy:http://www.idc.com/getdoc.jsp?containerId=prUS41669516)

CloudComputingandVirtualization

VirtualizationCloudComputing


DriversofModernHPCClusterandCloudArchitecture

• Multi-core/Many-coretechnologies

• Accelerators(GPUs/Co-processors)

• Largememorynodes

• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

• SingleRootI/OVirtualization(SR-IOV)

Multi-/Many-coreProcessors

Largememorynodes(Upto2TB)

Cloud CloudSDSCComet TACCStampede

Accelerators(GPUs/Co-processors)

HighPerformanceInterconnects–InfiniBand(withSR-IOV)

<1useclatency,200GbpsBandwidth>


• SR-IOVisprovidingnewopportunitiestodesignHPCcloudwithverylittlelowoverhead

SingleRootI/OVirtualization(SR-IOV)

• Allowsasinglephysicaldevice,oraPhysicalFunction(PF),topresentitselfasmultiplevirtualdevices,orVirtualFunctions(VFs)

• VFsaredesignedbasedontheexistingnon-virtualizedPFs,noneedfordriverchange

• EachVFcanbededicatedtoasingleVMthroughPCIpass-through

4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications

The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.

The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.

2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.

Qemu Userspace

Guest 1

Userspace

kernelPCI Device

mmap region

Qemu Userspace

Guest 2

Userspace

kernel

mmap region

Qemu Userspace

Guest 3

Userspace

kernelPCI Device

mmap region

/dev/shm/<name>

PCI Device

Host

mmap mmap mmap

shared mem fd

eventfds

(a) Inter-VM Shmem Mechanism [15]

Guest 1Guest OS

VF Driver

Guest 2Guest OS

VF Driver

Guest 3Guest OS

VF Driver

Hypervisor PF Driver

I/O MMU

SR-IOV Hardware

Virtual Function

Virtual Function

Virtual Function

Physical Function

PCI Express

(b) SR-IOV Mechanism [22]

Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms

Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to


• High-PerformanceComputing(HPC)hasadoptedadvancedinterconnectsandprotocols

– InfiniBand

– 10/40/100GigabitEthernet/iWARP

– RDMAoverConvergedEnhancedEthernet(RoCE)

• VeryGoodPerformance

– Lowlatency(fewmicroseconds)

– HighBandwidth(200Gb/swithHDRInfiniBand)

– LowCPUoverhead(5-10%)

• OpenFabricssoftwarestackwithIB,iWARPandRoCEinterfacesaredrivingHPCsystems

BuildingHPCCloudwithSR-IOVandInfiniBand


• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-VirtsupportVM,Docker,Singularity,etc.),Availablesince2015– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(June‘17ranking)

• 1st ranked10,649,640-corecluster(SunwayTaihuLight)atNSC,Wuxi,China

• 15th ranked241,108-corecluster(Pleiades)atNASA

• 20th ranked522,080-corecluster(Stampede)atTACC

• 44th ranked74,520-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)

– SunwayTaihuLight (1st inJun’17,10Mcores,100PFlops)

OverviewoftheMVAPICH2Project


ApplicationPerformanceonVMwithMVAPICH2

• Proposed design delivers up to 43% (IS) improvement for NAS• Proposed design brings 29%,33%,29%and20% improvement for INVERSE,RAND,SINEandSPEC

0

2

4

6

8

10

12

14

16

18

FT LU CG MG IS

ExecutionTime(s)

NAS B Class

SR-IOV

Proposed

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

SPEC INVERSE RAND SINEExecutionTimes

(s)

P3DFFT 512*512*512

SR-IOV

Proposed

NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node)

43%

20%29%

33%29%

J.Zhang,X.Lu,J.JoseandD.K.Panda,HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters,HiPC,2014


0

10

20

30

40

50

60

70

80

90

100

MG.D FT.D EP.D LU.D CG.D

ExecutionTime(s)

Container-Def

Container-Opt

Native

• 64Containersacross16nodes,pining4CoresperContainer

• Compared to Container-Def, upto11%and73% ofexecutiontimereductionforNASandGraph500

• ComparedtoNative,lessthan9% and5%overhead for NASandGraph500

Application PerformanceonDockerwithMVAPICH2NAS

11%

Graph500

0

50

100

150

200

250

300

1Cont*16P 2Conts*8P 4Conts*4P

BFSExecutionTime(m

s)

Scale,Edgefactor (20,16)

Container-Def

Container-Opt

Native

73%

J.Zhang,X.Lu,D.K.Panda.HighPerformanceMPILibraryforContainer-basedHPCCloudonInfiniBand,ICPP,2016


• 512Processesacross32nodes

• Lessthan16%and11%overheadforNPBandGraph500,respectively

ApplicationPerformanceonSingularitywithMVAPICH2

0

500

1000

1500

2000

2500

3000

22,16 22,20 24,16 24,20 26,16 26,20

BFSExecutionTime(m

s)ProblemSize(Scale,Edgefactor)

Graph500

Singularity

Native

16%

11%

0

50

100

150

200

250

300

CG EP FT IS LU MG

ExecutionTime(s)

NPB(ClassD)

Singularity

Native

J.Zhang,X.Lu,D.K.Panda.IsSingularity-basedContainerTechnologyReadyforRunningMPIApplicationsonHPCClouds? UCC,2017


• 256processesacross64containerson16nodes• ComparedwithDefault,enhanced-hybriddesignreducesupto16% (28,16)and10% (LU)ofexecution

timeforGraph500andNAS,respectively• Comparedwith1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6% (LU)benefit.

0

50

100

150

200

IS MG EP FT CG LU

ExecutionTime(s)

ClassDNASDefault

1Layer

2Layer-Enhanced-Hybrid

0

2

4

6

8

10

22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16

BFSExecutionTime(s)

Graph500

Default

1Layer

2Layer-Enhanced-Hybrid

ApplicationPerformanceonNestedVirtualizationEnv.withMVAPICH2

J.Zhang,X.LuandD.K.Panda,DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,VEE,2017


ExecuteLiveMigrationwithSR-IOVDevice


Platform NIC NoGuestOSModification

Device DriverIndependent

HypervisorIndependent

Zhai,etc (Linuxbonding driver)

Ethernet N/A Yes No Yes (Xen)

Kadav, etc(shadowdriver)

Ethernet IntelPro/1000gigabitNIC No Yes No(Xen)

Pan,etc (CompSC) Ethernet Intel82576, Intel82599 Yes No No(Xen)

Guay, etc InfniBand Mellanox ConnectX2QDRHCA

Yes No Yes(OracleVMServer(OVS)3.0.)

Han Ethernet HuaweismartNIC Yes No No(QEMU+KVM)

Xu,etc (SRVM) Ethernet Intel82599 Yes Yes No(VMwareEXSi)

OverviewofExistingMigrationSolutionsforSR-IOV

Canwehaveahypervisor-independentanddevicedriver-independentsolutionforInfiniBand basedHPCCloudswithSR-IOV?


• Canweproposearealisticapproachtosolvethiscontradictionwithsometradeoffs?

• MostHPCapplicationsarebasedonMPIruntimenowadays,canweproposeaVMmigrationapproachforMPIapplicationsoverSR-IOVenabledInfiniBandclusters?

• CansuchamigrationapproachforMPIapplicationsavoidmodifyinghypervisorsandInfiniBanddrivers?

• CantheproposeddesignmigrateVMswithSR-IOVIBdevicesinahigh-performanceandscalablemanner?

• CantheproposeddesignminimizetheoverheadforrunningMPIapplicationsinsidetheVMsduringmigration?

ProblemStatement


• Introduction


• Design



Outline


ProposedHighPerformanceSR-IOVenabledVMMigrationFramework

MPI

Host

Guest VM1

Hypervisor

IBAdapterIVShmem

Network Suspend Trigger

Read-to-Migrate Detector

Network ReactiveNotifier

Controller

MPIGuest VM2

MPI

Host

Guest VM1

VF /SR-IOV

Hypervisor

IBAdapter IVShmemEthernet

Adapter

MPIGuest VM2

EthernetAdapter

Migration Done

Detector

VF /SR-IOV

VF /SR-IOV

VF /SR-IOV

Migration Trigger

• Twochallengesneedtohandle:– Detachment/re-attachmentofvirtualizedIBdevice

– MaintainIBconnection

• MultipleparallellibrariestocoordinateVMduringmigration(detach/reattachSR-IOV/IVShmem,migrateVMs,migrationstatus)

• MPIruntimehandlestheIBconnectionsuspendingandreactivating

• ProposeProgressEngine(PE)andMigrationThreadbased(MT)designtooptimizeVMmigrationandMPIapplicationperformance

J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017


SequenceDiagramofVMMigration

Migration SourceMigration TargetComputer Node

Controller

Login Node

Suspend Trigger

RTM Detector

Start Migration

MPIVF

IVSHMEM

VM

Migration

Reactivate

Detach VF Detach IVSHMEM

Attach VF Attach IVSHMEM

Phase 1

Phase 2

Phase 3.a

Phase 4

VM Migration

MPIVF

VM

MPIVMPhase 3.b

Phase 3.c

IVSHMEM

Suspend Channel

Pre-migration

Ready-to-migrate

Post-migration

Reactivate ChannelMigration Done

Phase 5


ProposedDesignofMPIRuntime

Time Comp

MPICall

RTime

RSMPI Call Suspend Channel Reactivate Channel Control Msg

Migration Migration

Lock/Unlock Communication

Comp Computation

P 0

ControllerPre-migration

S

Ready-to-MigrateMigration

Post-Migration

MPI Call

MPICall

TimeNo-migration P 0 Comp

MPICall

P 0

MPICall

ThreadPre-

migrationReady-to-migrate Migration

Post-migration

MPICall

Comp

Controller

TimeCompMigration-thread based Worst Scenario

P 0

MPICall

Thread Pre-migration

SReady-to-migrate Migration

Post-migrationController

S R

R

Migration-thread based Typical Scenario

Progress Engine Based

Migration Signal Detection Down-Time for VM Live Migration

MPI Call


• Introduction


• Design



Outline


ExperimentalTestbed

Cluster ChameleonCloud Nowlab

CPU IntelXeonE5-2670v324-core2.3GHz

IntelXeonE5-2670dual8-core2.6GHz

RAM 128GB 32GB

Interconnect

Mellanox ConnectX-3HCA,(FDR56Gbps), MLNXOFEDLINUX-3.0-1.0.1asdriver

Mellanox ConnectX-3HCA,(FDR56Gbps), MLNXOFEDLINUX-3.2-2.0.0

OS CentOS Linuxrelease7.1.1503(Core)

Compiler GCC4.8.3

MVAPICH2 andOSUmicro-benchmarksv5.3


Bcast(4VMs*2Procs/VM)

• MigrateaVMfromonemachinetoanotherwhilebenchmarkisrunninginside

• ProposedMT-baseddesignsperformslightlyworsethanPE-baseddesignsbecauseoflock/unlock

• NobenefitfromMTbecauseofNOcomputationinvolved

Pt2PtLatency

0

50

100

150

200

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize(bytes)

PE-IPoIBPE-RDMAMT-IPoIBMT-RDMA

0

100

200

300

400

500

600

700

1 4 16 64 256 1K 4K 16K 64K 256K 1M

2Laten

cy(u

s)MessageSize(bytes)

PE-IPoIB

PE-RDMA

MT-IPoIB

MT-RDMA

Point-to-PointandCollectivePerformance


• fixthecommunicationtimeandincreasethecomputationtime

• 10%computation,partialmigrationtimecouldbeoverlappedwithcomputationinMT-typical

• Ascomputationpercentageincreases,morechanceforoverlapping

OverlappingEvaluation

10

20

30

40

50

0 10 20 30 40 50 60 70

Time(s)

Computation(percentage)

PE

MT-worst

MT-typical

NM


• 8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning

• ComparedwithNM,MT- worstandPEincursomeoverhead

• MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation

ApplicationPerformance

05

1015202530

LU.B EP.B IS.B MG.B CG.B FT.B

Exec

utio

n Ti

me

(s) PE

MT-worstMT-typicalNM

0.0

0.1

0.2

0.3

0.4

20,10 20,16 20,20 22,10

Exec

utio

n Ti

me

(s) PE

MT-worstMT-typicalNM

Graph500NAS


AvailableAppliancesonChameleonCloud*Appliance Description

CentOS7KVMSR-IOV

Chameleonbare-metalimagecustomizedwiththeKVMhypervisorandarecompiledkerneltoenableSR-IOVoverInfiniBand.https://www.chameleoncloud.org/appliances/3/

MPIbare-metalclustercomplex

appliance(BasedonHeat)

ThisappliancedeploysanMPIclustercomposedofbaremetalinstancesusingtheMVAPICH2libraryoverInfiniBand.https://www.chameleoncloud.org/appliances/29/

MPI+SR-IOVKVMcluster (Basedon

Heat)

ThisappliancedeploysanMPIclusterofKVMvirtualmachinesusingtheMVAPICH2-VirtimplementationandconfiguredwithSR-IOVforhigh-performancecommunicationoverInfiniBand.https://www.chameleoncloud.org/appliances/28/

CentOS7SR-IOVRDMA-Hadoop

TheCentOS7SR-IOVRDMA-HadoopapplianceisbuiltfromtheCentOS7applianceandadditionallycontainsRDMA-Hadooplibrarywith SR-IOV.https://www.chameleoncloud.org/appliances/17/

• Throughtheseavailableappliances,usersandresearcherscaneasilydeployHPCcloudstoperformexperimentsandrunjobswith– High-PerformanceSR-IOV+InfiniBand

– High-PerformanceMVAPICH2Libraryoverbare-metalInfiniBandclusters

– High-PerformanceMVAPICH2LibrarywithVirtualizationSupportoverSR-IOVenabledKVMclusters

– High-PerformanceHadoopwithRDMA-basedEnhancementsSupport [*]OnlyincludeappliancescontributedbyOSUNowLab


MPIComplexAppliancesbasedonMVAPICH2onChameleon1. LoadVMConfig2. AllocatePorts3. AllocateFloatingIPs4. GenerateSSHKeypair5. LaunchVM6. AttachSR-IOVDevice7. HotplugIVShmem

Device8. Download/Install

MVAPICH2-Virt9. PopulateVMs/IPs10. AssociateFloatingIPs

MVAPICH2-VirtHeat-basedComplexAppliance


• Proposeahigh-performanceVMmigrationframeworkforMPIapplicationsonSR- IOVenabledInfiniBandclusters

• Hypervisor- andInfiniBanddriver-independent

• DesignProgressEngine(PE)basedandMigrationThread(MT)basedMPIruntimedesign

• Designahigh-performanceandscalablecontrollerwhichworksseamlesslywithourproposeddesignsinMPIruntime

• EvaluatetheproposedframeworkwithMPIlevelmicro-benchmarksandreal-worldHPCapplications

• CouldcompletelyhidetheoverheadofVMmigrationthroughcomputationandmigrationoverlapping

• FutureWork– Evaluateourproposedframeworkatlargerscalesanddifferenthypervisors,suchasXen;Solutionwillbeavailableinupcomingrelease

Conclusion&FutureWork


{zhanjie,luxi,panda}@cse.ohio-state.edu

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

high performance migration framework for mpi applications

Documents