achieving high performance on microsoft azure hpc cloud

26
Achieving High Performance on Microsoft Azure HPC Cloud using MVAPICH2 Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~p anda Microsoft Azure Booth Talk at SC ‘19 by

Upload: others

Post on 18-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Achieving High Performance on Microsoft Azure HPC Cloud using

MVAPICH2

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Microsoft Azure Booth Talk at SC ‘19

by

Microsoft-Azure-Booth (SC ‘19) 2Network Based Computing Laboratory

Big Data (Hadoop, Spark, HBase, Memcached, etc.)

Big Data (Hadoop, Spark, HBase, Memcached, etc.)Deep Learning

(Caffe, TensorFlow, BigDL, etc.)Deep Learning

(Caffe, TensorFlow, BigDL, etc.)

HPC (MPI, RDMA, Lustre, etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!

Microsoft-Azure-Booth (SC ‘19) 3Network Based Computing Laboratory

Agenda

• Overview of the MVAPICH2 Project

• Overview of Azure HPC Environments

• Designing MVAPICH2 for Azure

• Performance Results

• Future Plans

Microsoft-Azure-Booth (SC ‘19) 4Network Based Computing Laboratory

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over

Converged Ethernet (RoCE)– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (Supercomputing

‘02)

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries

– More than 615,000 (> 0.6 million) downloads from the OSU site directly– Empowering many TOP500 clusters (Nov ‘19 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 14th, 570,020 cores (Nurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu• Empowering Top500 systems for over a decade

Partner in the #5th TACC Frontera System

Microsoft-Azure-Booth (SC ‘19) 5Network Based Computing Laboratory

0

100000

200000

300000

400000

500000

600000

Timeline

Num

ber of Dow

nlo

ads

MV 0 . 9 . 4

MV 2 0 . 9 . 0

MV 2 0 . 9 . 8

MV 2 1 . 0

MV 1 . 0

MV 2 1 . 0 . 3

MV 1 . 1

MV 2 1 . 4

MV 2 1 . 5

MV 2 1 . 6

MV 2 1 . 7

MV 2 1 . 8

MV 2 1 . 9M

V 2 - G D R 2 . 0 b

MV 2 - M I C 2 . 0

MV 2 - G D R 2 . 3 . 2

MV 2 - X 2 . 3 r c 2

MV 2 V i r t 2 . 2M

V 2 2 . 3 . 2

OS U I N A M 0 . 9 . 3

MV 2 - A z u r e 2 . 3 . 2

M

V 2 - A W S 2 . 3

MVAPICH2 Release Timeline and Downloads

Microsoft-Azure-Booth (SC ‘19) 6Network Based Computing Laboratory

Architecture of MVAPICH2 Software Family (HPC and DL)

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Point-to-point

Primitives

Collectives

Algorithms

Collectives

Algorithms

Energy-

Awareness

Energy-

Awareness

Remote Memory Access

Remote Memory Access

I/O and

File Systems

I/O and

File Systems

Fault

Tolerance

Fault

ToleranceVirtualiza

tion

Virtualization

Active Message

s

Active Message

s

Job Startup

Job Startup

Introspection &

Analysis

Introspection &

Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RCRC SRD

SRD UDUD DCDC UM

R

UMR

ODP

ODP

SR-IOV

SR-IOV

Multi

Rail

Multi

Rail

Transport MechanismsShare

d Memo

ry

Shared

Memory

CMA

CMA

IVSHMEM

IVSHMEM

Modern Features

Optane*Optane* NVLink

NVLink

CAPI*

CAPI*

* Upcoming

XPMEM

XPMEM

Microsoft-Azure-Booth (SC ‘19) 7Network Based Computing Laboratory

• Virtualization has many benefits– Fault-tolerance– Job migration– Compaction

• Have not been very popular in HPC due to overhead associated with Virtualization

• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field

• Enhanced MVAPICH2 support for SR-IOV• MVAPICH2-Virt 2.2 supports:

– OpenStack, Docker, and singularity

Can HPC and Virtualization be Combined?

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15

Microsoft-Azure-Booth (SC ‘19) 8Network Based Computing Laboratory

milc leslie3d pop2 GAPgeofem zeusmp2 lu0

50

100

150

200

250

300

350

400MV2-SR-IOV-DefMV2-SR-IOV-OptMV2-Native

Ex

ec

uti

on

Tim

e (

s)

1%

9.5%

22,20 24,10 24,16 24,20 26,10 26,160

1000

2000

3000

4000

5000

6000

MV2-SR-IOV-DefMV2-SR-IOV-OptMV2-Native

Problem Size (Scale, Edgefactor)

Ex

ec

uti

on

Tim

e (

ms

)

2%

• 32 VMs, 6 Core/VM

• Compared to Native, 2-5% overhead for Graph500 with 128 Procs

• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Application-Level Performance on Chameleon

SPEC MPI2007

Graph500

5%

Microsoft-Azure-Booth (SC ‘19) 9Network Based Computing Laboratory

Agenda

• Overview of the MVAPICH2 Project

• Overview of Azure HPC Environments

• Designing MVAPICH2 for Azure

• Performance Results

• Future Plans

Microsoft-Azure-Booth (SC ‘19) 10Network Based Computing Laboratory

Azure HPC Environments

• Has been using RDMA-enabled network and software stacks for the last several years

• Moved to native InfiniBand support with Mellanox OFED for new instances (HB, HC, and upcoming ones)

• Uses SR-IOV support for virtualization

• Uses one VM per node

Microsoft-Azure-Booth (SC ‘19) 11Network Based Computing Laboratory

Agenda

• Overview of the MVAPICH2 Project

• Overview of Azure HPC Environments

• Designing MVAPICH2 for Azure

• Performance Results

• Future Plans

Microsoft-Azure-Booth (SC ‘19) 12Network Based Computing Laboratory

MVAPICH2 Software Family Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE

MVAPICH2-X

MPI with IB, RoCE & GPU and Support for Deep Learning

MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance

OMB

Microsoft-Azure-Booth (SC ‘19) 13Network Based Computing Laboratory

• Released on 08/16/2019

• Major Features and Enhancements– Based on MVAPICH2-2.3.2

– Enhanced tuning for point-to-point and collective operations

– Targeted for Azure HB & HC virtual machine instances

– Flexibility for 'one-click' deployment– Tested with Azure HB & HC VM instances

• Available for download from http://mvapich.cse.ohio-state.edu/downloads/

• Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/

• On-going Work– Optimizing MVAPICH2-X features on Azure

MVAPICH2-Azure 2.3.2

Microsoft-Azure-Booth (SC ‘19) 14Network Based Computing Laboratory

Agenda

• Overview of the MVAPICH2 Project

• Overview of Azure HPC Environments• Designing MVAPICH2 for Azure

• Performance Results• Future Plans

Microsoft-Azure-Booth (SC ‘19) 15Network Based Computing Laboratory

• VM type: Azure HC

• CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz

• Cores: 2 socket, 44 cores

• MVAPICH2 Version: Latest MVAPICH2-X w/ XPMEM support

• HPCx Version: Built-in HPCx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64

• OMB Version: OSU-MicroBenchmars-5.6.2

Performance Test Configurations

Microsoft-Azure-Booth (SC ‘19) 16Network Based Computing Laboratory

Intra-node Point-to-Point (HC)

0 1 2 4 8 16 32 640

0.2

0.4

0.6

0.8

1Latency - Small Messages

MVA-PICH2-XHPCx

Message Size (Bytes)

Late

ncy

(us)

0

1

2

3

4Latency - Medium Messages

MVAPICH2-XHPCx

Message Size (Bytes)

Late

ncy (u

s)

0

100

200

300

400

500Latency - Large Messages

MVAPICH2-X

Message Size (Bytes)

Late

ncy

(us)

1 2 4 8 16 32 640

100200300400500600700

Bandwidth - Small Messages

MVA-PICH2-XHPCx

Message Size (Bytes)

Late

ncy (u

s)

0

5000

10000

15000

20000

25000Bandwidth - Medium Messages

MVAPICH2-XHPCx

Message Size (Bytes)

Late

ncy (u

s)

05000

100001500020000250003000035000

Bandwidth - Large Messages

MVAPICH2-X

Message Size (Bytes)

Late

ncy (u

s) 1.6x better

1.7x better

Microsoft-Azure-Booth (SC ‘19) 17Network Based Computing Laboratory

Inter-node Point-to-Point (HC)

0 1 2 4 8 16 32 640

1

2

3

4

5Latency - Small Messages

MVA-PICH2-XHPCx

Message Size (Bytes)

Late

ncy

(us)

0

2

4

6

8

10Latency - Medium Messages

MVAPICH2-XHPCx

Message Size (Bytes)

Late

ncy (u

s)

0

100

200

300

400

500Latency - Large Messages

MVAPICH2-X

Message Size (Bytes)

Late

ncy

(us)

1 2 4 8 16 32 640

50100150200250300350

Bandwidth - Small MessagesMVA-PICH2-XHPCx

Message Size (Bytes)

Late

ncy (u

s)

0

2000

4000

6000

8000

10000

12000Bandwidth - Medium Messages

MVAPICH2-XHPCx

Message Size (Bytes)

Late

ncy (u

s)

02000400060008000

100001200014000

Bandwidth - Large Messages

MVAPICH2-X

Message Size (Bytes)

Late

ncy (u

s)

Microsoft-Azure-Booth (SC ‘19) 18Network Based Computing Laboratory

1

10

100

1000Bcast – 44-ppn

MVAPICH2-X

Message Size (Bytes)

Late

ncy

(us)

Single node Collectives (HC)

0.1

1

10

100

1000

10000Gather – 44-ppn

MVA-PICH2-X

Message Size (Bytes)

Late

ncy

(us)

0.1

1

10

100

1000Reduce – 44-ppn

MVA-PICH2-X

Message Size (Bytes)

Late

ncy

(us)

1

10

100

1000

10000Scatter – 44 -ppn

MVAPICH2-X

Message Size (Bytes)Late

ncy

(us)

3.7x better

5.8x better3x better

Microsoft-Azure-Booth (SC ‘19) 19Network Based Computing Laboratory

1

10

100

1000Bcast – 44-ppn

MVAPICH2-X

Message Size (Bytes)

Late

ncy

(us)

2-node Collectives (HC)

0.1

1

10

100

1000

10000Gather – 44-ppn

MVA-PICH2-X

Message Size (Bytes)

Late

ncy

(us)

0.1

1

10

100

1000Reduce – 44-ppn

MVA-PICH2-X

Message Size (Bytes)

Late

ncy

(us)

1

10

100

1000

10000Scatter – 44 -ppn

MVAPICH2-X

Message Size (Bytes)Late

ncy

(us)

9x better

12x better

5x better

Microsoft-Azure-Booth (SC ‘19) 20Network Based Computing Laboratory

1

10

100

1000Bcast – 44-ppn

MVAPICH2-X

Message Size (Bytes)

Late

ncy

(us)

4-node Collectives (HC)

0.1

1

10

100

1000

10000

100000Gather – 44-ppn

MVA-PICH2-X

Message Size (Bytes)

Late

ncy

(us)

0.1

1

10

100

1000

10000Reduce – 44-ppn

MVA-PICH2-X

Message Size (Bytes)

Late

ncy

(us)

1

10

100

1000

10000

100000Scatter – 44 -ppn

MVAPICH2-X

Message Size (Bytes)Late

ncy

(us)

20x better

23x better

40% better

2.2x better

Microsoft-Azure-Booth (SC ‘19) 21Network Based Computing Laboratory

1

10

100

1000

10000Bcast – 44-ppn

MVAPICH2-X

Message Size (Bytes)

Late

ncy

(us)

8-node Collectives (HC)

0.1

1

10

100

1000

10000

100000Gather – 44-ppn

MVA-PICH2-X

Message Size (Bytes)

Late

ncy

(us)

0.1

1

10

100

1000

10000Reduce – 44-ppn

MVA-PICH2-X

Message Size (Bytes)

Late

ncy

(us)

1

10

100

1000

10000Scatter – 44 -ppn

MVAPICH2-X

Message Size (Bytes)Late

ncy

(us)

21x better

26x better

2.1x better

2.3x better

Microsoft-Azure-Booth (SC ‘19) 22Network Based Computing Laboratory

Radix Performance (HC)

16(1x16) 32(1x32) 44(1X44) 88(2X44) 172(4X28)352(8x28)0

5

10

15

20

25

30

Total Execution Time (Lower is better)

MVAPICH2-X

HPCx

Processes (Nodes X PPN)

Execu

tion T

ime (

Seco

nds) 3x

faster

Microsoft-Azure-Booth (SC ‘19) 23Network Based Computing Laboratory

FDS Performance (HC)

16(1x16) 32(1x32) 44(1X44)0

2

4

6

8

10

Single NodeTotal Execution Time (Lower is better)

Processes (Nodes X PPN)

Execu

tion T

ime (

Seco

nds)

88(2X44) 176(4X44)0

100

200

300

400

500

600

Multi-NodeTotal Execution Time (Lower is better)

Processes (Nodes X PPN)

Execu

tion T

ime (

Seco

nds)

Part of input parameter: &MESH IJK=5,5,5, XB=-1.0,0.0,-1.0,0.0,0.0,1.0, MULT_ID='mesh

array'

1.11x better

Microsoft-Azure-Booth (SC ‘19) 24Network Based Computing Laboratory

Agenda

• Overview of the MVAPICH2 Project

• Overview of Azure HPC Environments

• Designing MVAPICH2 for Azure

• Performance Results• Future Plans

Microsoft-Azure-Booth (SC ‘19) 25Network Based Computing Laboratory

Continuing Work and Future Plans

• Additional optimizations and tuning of MVAPICH2-X

• A new version will be released soon

• Making it available it in an integrated manner in the Azure portal

• Making it available through Azure Market Place

• Commercial Support available for End-Users, ISVs, and Organizations

• Through X-ScaleSolutions (http://x-scalesolutions.com)

Microsoft-Azure-Booth (SC ‘19) 26Network Based Computing Laboratory

Thank You!

Network Based Computing

LaboratoryNetwork-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Project

http://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Project

http://hidl.cse.ohio-state.edu/

The High-Performance Big Data Project

http://hibd.cse.ohio-state.edu/

Follow us on

https://twitter.com/mvapich