achieving high performance on microsoft azure hpc cloud
TRANSCRIPT
Achieving High Performance on Microsoft Azure HPC Cloud using
MVAPICH2
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Microsoft Azure Booth Talk at SC ‘19
by
Microsoft-Azure-Booth (SC ‘19) 2Network Based Computing Laboratory
Big Data (Hadoop, Spark, HBase, Memcached, etc.)
Big Data (Hadoop, Spark, HBase, Memcached, etc.)Deep Learning
(Caffe, TensorFlow, BigDL, etc.)Deep Learning
(Caffe, TensorFlow, BigDL, etc.)
HPC (MPI, RDMA, Lustre, etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
Microsoft-Azure-Booth (SC ‘19) 3Network Based Computing Laboratory
Agenda
• Overview of the MVAPICH2 Project
• Overview of Azure HPC Environments
• Designing MVAPICH2 for Azure
• Performance Results
• Future Plans
Microsoft-Azure-Booth (SC ‘19) 4Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over
Converged Ethernet (RoCE)– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (Supercomputing
‘02)
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,050 organizations in 89 countries
– More than 615,000 (> 0.6 million) downloads from the OSU site directly– Empowering many TOP500 clusters (Nov ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 14th, 570,020 cores (Nurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu• Empowering Top500 systems for over a decade
Partner in the #5th TACC Frontera System
Microsoft-Azure-Booth (SC ‘19) 5Network Based Computing Laboratory
0
100000
200000
300000
400000
500000
600000
Timeline
Num
ber of Dow
nlo
ads
MV 0 . 9 . 4
MV 2 0 . 9 . 0
MV 2 0 . 9 . 8
MV 2 1 . 0
MV 1 . 0
MV 2 1 . 0 . 3
MV 1 . 1
MV 2 1 . 4
MV 2 1 . 5
MV 2 1 . 6
MV 2 1 . 7
MV 2 1 . 8
MV 2 1 . 9M
V 2 - G D R 2 . 0 b
MV 2 - M I C 2 . 0
MV 2 - G D R 2 . 3 . 2
MV 2 - X 2 . 3 r c 2
MV 2 V i r t 2 . 2M
V 2 2 . 3 . 2
OS U I N A M 0 . 9 . 3
MV 2 - A z u r e 2 . 3 . 2
M
V 2 - A W S 2 . 3
MVAPICH2 Release Timeline and Downloads
Microsoft-Azure-Booth (SC ‘19) 6Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family (HPC and DL)
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Point-to-point
Primitives
Collectives
Algorithms
Collectives
Algorithms
Energy-
Awareness
Energy-
Awareness
Remote Memory Access
Remote Memory Access
I/O and
File Systems
I/O and
File Systems
Fault
Tolerance
Fault
ToleranceVirtualiza
tion
Virtualization
Active Message
s
Active Message
s
Job Startup
Job Startup
Introspection &
Analysis
Introspection &
Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RCRC SRD
SRD UDUD DCDC UM
R
UMR
ODP
ODP
SR-IOV
SR-IOV
Multi
Rail
Multi
Rail
Transport MechanismsShare
d Memo
ry
Shared
Memory
CMA
CMA
IVSHMEM
IVSHMEM
Modern Features
Optane*Optane* NVLink
NVLink
CAPI*
CAPI*
* Upcoming
XPMEM
XPMEM
Microsoft-Azure-Booth (SC ‘19) 7Network Based Computing Laboratory
• Virtualization has many benefits– Fault-tolerance– Job migration– Compaction
• Have not been very popular in HPC due to overhead associated with Virtualization
• New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field
• Enhanced MVAPICH2 support for SR-IOV• MVAPICH2-Virt 2.2 supports:
– OpenStack, Docker, and singularity
Can HPC and Virtualization be Combined?
J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15
Microsoft-Azure-Booth (SC ‘19) 8Network Based Computing Laboratory
milc leslie3d pop2 GAPgeofem zeusmp2 lu0
50
100
150
200
250
300
350
400MV2-SR-IOV-DefMV2-SR-IOV-OptMV2-Native
Ex
ec
uti
on
Tim
e (
s)
1%
9.5%
22,20 24,10 24,16 24,20 26,10 26,160
1000
2000
3000
4000
5000
6000
MV2-SR-IOV-DefMV2-SR-IOV-OptMV2-Native
Problem Size (Scale, Edgefactor)
Ex
ec
uti
on
Tim
e (
ms
)
2%
• 32 VMs, 6 Core/VM
• Compared to Native, 2-5% overhead for Graph500 with 128 Procs
• Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs
Application-Level Performance on Chameleon
SPEC MPI2007
Graph500
5%
Microsoft-Azure-Booth (SC ‘19) 9Network Based Computing Laboratory
Agenda
• Overview of the MVAPICH2 Project
• Overview of Azure HPC Environments
• Designing MVAPICH2 for Azure
• Performance Results
• Future Plans
Microsoft-Azure-Booth (SC ‘19) 10Network Based Computing Laboratory
Azure HPC Environments
• Has been using RDMA-enabled network and software stacks for the last several years
• Moved to native InfiniBand support with Mellanox OFED for new instances (HB, HC, and upcoming ones)
• Uses SR-IOV support for virtualization
• Uses one VM per node
Microsoft-Azure-Booth (SC ‘19) 11Network Based Computing Laboratory
Agenda
• Overview of the MVAPICH2 Project
• Overview of Azure HPC Environments
• Designing MVAPICH2 for Azure
• Performance Results
• Future Plans
Microsoft-Azure-Booth (SC ‘19) 12Network Based Computing Laboratory
MVAPICH2 Software Family Requirements Library
MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2
Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE
MVAPICH2-X
MPI with IB, RoCE & GPU and Support for Deep Learning
MVAPICH2-GDR
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
MPI Energy Monitoring Tool OEMT
InfiniBand Network Analysis and Monitoring OSU INAM
Microbenchmarks for Measuring MPI and PGAS Performance
OMB
Microsoft-Azure-Booth (SC ‘19) 13Network Based Computing Laboratory
• Released on 08/16/2019
• Major Features and Enhancements– Based on MVAPICH2-2.3.2
– Enhanced tuning for point-to-point and collective operations
– Targeted for Azure HB & HC virtual machine instances
– Flexibility for 'one-click' deployment– Tested with Azure HB & HC VM instances
• Available for download from http://mvapich.cse.ohio-state.edu/downloads/
• Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/
• On-going Work– Optimizing MVAPICH2-X features on Azure
MVAPICH2-Azure 2.3.2
Microsoft-Azure-Booth (SC ‘19) 14Network Based Computing Laboratory
Agenda
• Overview of the MVAPICH2 Project
• Overview of Azure HPC Environments• Designing MVAPICH2 for Azure
• Performance Results• Future Plans
Microsoft-Azure-Booth (SC ‘19) 15Network Based Computing Laboratory
• VM type: Azure HC
• CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
• Cores: 2 socket, 44 cores
• MVAPICH2 Version: Latest MVAPICH2-X w/ XPMEM support
• HPCx Version: Built-in HPCx-v2.5.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64
• OMB Version: OSU-MicroBenchmars-5.6.2
Performance Test Configurations
Microsoft-Azure-Booth (SC ‘19) 16Network Based Computing Laboratory
Intra-node Point-to-Point (HC)
0 1 2 4 8 16 32 640
0.2
0.4
0.6
0.8
1Latency - Small Messages
MVA-PICH2-XHPCx
Message Size (Bytes)
Late
ncy
(us)
0
1
2
3
4Latency - Medium Messages
MVAPICH2-XHPCx
Message Size (Bytes)
Late
ncy (u
s)
0
100
200
300
400
500Latency - Large Messages
MVAPICH2-X
Message Size (Bytes)
Late
ncy
(us)
1 2 4 8 16 32 640
100200300400500600700
Bandwidth - Small Messages
MVA-PICH2-XHPCx
Message Size (Bytes)
Late
ncy (u
s)
0
5000
10000
15000
20000
25000Bandwidth - Medium Messages
MVAPICH2-XHPCx
Message Size (Bytes)
Late
ncy (u
s)
05000
100001500020000250003000035000
Bandwidth - Large Messages
MVAPICH2-X
Message Size (Bytes)
Late
ncy (u
s) 1.6x better
1.7x better
Microsoft-Azure-Booth (SC ‘19) 17Network Based Computing Laboratory
Inter-node Point-to-Point (HC)
0 1 2 4 8 16 32 640
1
2
3
4
5Latency - Small Messages
MVA-PICH2-XHPCx
Message Size (Bytes)
Late
ncy
(us)
0
2
4
6
8
10Latency - Medium Messages
MVAPICH2-XHPCx
Message Size (Bytes)
Late
ncy (u
s)
0
100
200
300
400
500Latency - Large Messages
MVAPICH2-X
Message Size (Bytes)
Late
ncy
(us)
1 2 4 8 16 32 640
50100150200250300350
Bandwidth - Small MessagesMVA-PICH2-XHPCx
Message Size (Bytes)
Late
ncy (u
s)
0
2000
4000
6000
8000
10000
12000Bandwidth - Medium Messages
MVAPICH2-XHPCx
Message Size (Bytes)
Late
ncy (u
s)
02000400060008000
100001200014000
Bandwidth - Large Messages
MVAPICH2-X
Message Size (Bytes)
Late
ncy (u
s)
Microsoft-Azure-Booth (SC ‘19) 18Network Based Computing Laboratory
1
10
100
1000Bcast – 44-ppn
MVAPICH2-X
Message Size (Bytes)
Late
ncy
(us)
Single node Collectives (HC)
0.1
1
10
100
1000
10000Gather – 44-ppn
MVA-PICH2-X
Message Size (Bytes)
Late
ncy
(us)
0.1
1
10
100
1000Reduce – 44-ppn
MVA-PICH2-X
Message Size (Bytes)
Late
ncy
(us)
1
10
100
1000
10000Scatter – 44 -ppn
MVAPICH2-X
Message Size (Bytes)Late
ncy
(us)
3.7x better
5.8x better3x better
Microsoft-Azure-Booth (SC ‘19) 19Network Based Computing Laboratory
1
10
100
1000Bcast – 44-ppn
MVAPICH2-X
Message Size (Bytes)
Late
ncy
(us)
2-node Collectives (HC)
0.1
1
10
100
1000
10000Gather – 44-ppn
MVA-PICH2-X
Message Size (Bytes)
Late
ncy
(us)
0.1
1
10
100
1000Reduce – 44-ppn
MVA-PICH2-X
Message Size (Bytes)
Late
ncy
(us)
1
10
100
1000
10000Scatter – 44 -ppn
MVAPICH2-X
Message Size (Bytes)Late
ncy
(us)
9x better
12x better
5x better
Microsoft-Azure-Booth (SC ‘19) 20Network Based Computing Laboratory
1
10
100
1000Bcast – 44-ppn
MVAPICH2-X
Message Size (Bytes)
Late
ncy
(us)
4-node Collectives (HC)
0.1
1
10
100
1000
10000
100000Gather – 44-ppn
MVA-PICH2-X
Message Size (Bytes)
Late
ncy
(us)
0.1
1
10
100
1000
10000Reduce – 44-ppn
MVA-PICH2-X
Message Size (Bytes)
Late
ncy
(us)
1
10
100
1000
10000
100000Scatter – 44 -ppn
MVAPICH2-X
Message Size (Bytes)Late
ncy
(us)
20x better
23x better
40% better
2.2x better
Microsoft-Azure-Booth (SC ‘19) 21Network Based Computing Laboratory
1
10
100
1000
10000Bcast – 44-ppn
MVAPICH2-X
Message Size (Bytes)
Late
ncy
(us)
8-node Collectives (HC)
0.1
1
10
100
1000
10000
100000Gather – 44-ppn
MVA-PICH2-X
Message Size (Bytes)
Late
ncy
(us)
0.1
1
10
100
1000
10000Reduce – 44-ppn
MVA-PICH2-X
Message Size (Bytes)
Late
ncy
(us)
1
10
100
1000
10000Scatter – 44 -ppn
MVAPICH2-X
Message Size (Bytes)Late
ncy
(us)
21x better
26x better
2.1x better
2.3x better
Microsoft-Azure-Booth (SC ‘19) 22Network Based Computing Laboratory
Radix Performance (HC)
16(1x16) 32(1x32) 44(1X44) 88(2X44) 172(4X28)352(8x28)0
5
10
15
20
25
30
Total Execution Time (Lower is better)
MVAPICH2-X
HPCx
Processes (Nodes X PPN)
Execu
tion T
ime (
Seco
nds) 3x
faster
Microsoft-Azure-Booth (SC ‘19) 23Network Based Computing Laboratory
FDS Performance (HC)
16(1x16) 32(1x32) 44(1X44)0
2
4
6
8
10
Single NodeTotal Execution Time (Lower is better)
Processes (Nodes X PPN)
Execu
tion T
ime (
Seco
nds)
88(2X44) 176(4X44)0
100
200
300
400
500
600
Multi-NodeTotal Execution Time (Lower is better)
Processes (Nodes X PPN)
Execu
tion T
ime (
Seco
nds)
Part of input parameter: &MESH IJK=5,5,5, XB=-1.0,0.0,-1.0,0.0,0.0,1.0, MULT_ID='mesh
array'
1.11x better
Microsoft-Azure-Booth (SC ‘19) 24Network Based Computing Laboratory
Agenda
• Overview of the MVAPICH2 Project
• Overview of Azure HPC Environments
• Designing MVAPICH2 for Azure
• Performance Results• Future Plans
Microsoft-Azure-Booth (SC ‘19) 25Network Based Computing Laboratory
Continuing Work and Future Plans
• Additional optimizations and tuning of MVAPICH2-X
• A new version will be released soon
• Making it available it in an integrated manner in the Azure portal
• Making it available through Azure Market Place
• Commercial Support available for End-Users, ISVs, and Organizations
• Through X-ScaleSolutions (http://x-scalesolutions.com)
Microsoft-Azure-Booth (SC ‘19) 26Network Based Computing Laboratory
Thank You!
Network Based Computing
LaboratoryNetwork-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Project
http://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Project
http://hidl.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/
Follow us on
https://twitter.com/mvapich