mvapich performance on arm at scale

MVAPICH Performance on Arm at Scale

Dhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Arm HPC User Group Talk (SC ’19)

by

http://www.cse.ohio-state.edu/~panda

ARM-HUG (SC ‘19) 2Network Based Computing Laboratory

High-End Computing (HEC): PetaFlop to ExaFlop

Expected to have an ExaFlop system in 2020-2021!

100 PFlops in 2017

1 EFlops in 2020-2021?

143 PFlops in 2018


Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD

• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects -InfiniBand

<1usec latency, 200Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

K - ComputerSunway TaihuLightSummit Sierra


Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications (HPC and DL)

Networking Technologies(InfiniBand, 40/100/200GigE,

Slingshot, and Omni-Path)

Multi-/Many-coreArchitectures

Accelerators(GPU and FPGA)

MiddlewareCo-Design

Opportunities and

Challenges across Various

Layers

PerformanceScalabilityResilience

Communication Library or Runtime for Programming ModelsPoint-to-point

CommunicationCollective

CommunicationEnergy-

AwarenessSynchronization

and LocksI/O and

File SystemsFault

Tolerance


• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)– Scalable job start-up– Low memory footprint

• Scalable Collective communication– Offload– Non-blocking– Topology-aware

• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores)– Multiple end-points per node

• Support for efficient multi-threading• Integrated Support for Accelerators (GPGPUs and FPGAs)• Fault-tolerance/resiliency• QoS support for communication and I/O• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM,

MPI+UPC++, CAF, …)• Virtualization • Energy-Awareness

Designing (MPI+X) at Exascale


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries

– More than 615,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Jun ‘19 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 15th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

Partner in the TACC Frontera System

http://mvapich.cse.ohio-state.edu/


0

100000

200000

300000

400000

500000

600000Se

p-04

Feb-

05Ju

l-05

Dec-

05M

ay-0

6O

ct-0

6M

ar-0

7Au

g-07

Jan-

08Ju

n-08

Nov

-08

Apr-

09Se

p-09

Feb-

10Ju

l-10

Dec-

10M

ay-1

1O

ct-1

1M

ar-1

2Au

g-12

Jan-

13Ju

n-13

Nov

-13

Apr-

14Se

p-14

Feb-

15Ju

l-15

Dec-

15M

ay-1

6O

ct-1

6M

ar-1

7Au

g-17

Jan-

18Ju

n-18

Nov

-18

Apr-

19

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GD

R 2.

0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3.2

MV2

-X2.

3rc

2

MV2

Virt

2.2

MV2

2.3

.2

OSU

INAM

0.9

.3

MV2

-Azu

re 2

.3.2

MV2

-AW

S2.

3

MVAPICH2 Release Timeline and Downloads


Architecture of MVAPICH2 Software Family (HPC and DL)

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-

Awareness

Remote Memory Access

I/O and

File Systems

Fault

ToleranceVirtualization

Active Messages

Job StartupIntrospection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

Memory CMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM


MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE

MVAPICH2-X

MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB


• Enhanced architecture and IB HCA detection for various ARM systems

• Optimization and tuning for– Intra-node and inter-node point-to-point operations

– Intra-node shared memory communication protocols

– Collective operations for different message sizes and job/system sizes using the existing collective algorithms in MVAPICH2-X

• Optimizations to job startup performance to achieve scalable job startup when running large-scale jobs on ARM systems

• Support for latest GCC and ARM compilers

Features and Improvement in MAVPIACH2-X for ARM


• EPCC Fulhame Cluster– Nodes: 16 x ARM ThunderX2

– Processor: 2x 32 core ARM ThunderX2

– Network: EDR 100Gbps MT4119

– Operating System: Linux 4.12.14-23-default

– MPI and Communication Libraries• MVAPICH2-X (latest)

• HPCX-v2.4.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-suse15.0-aarch64

• OpenMPI-4.0.2 w/ latest UCX

– OSU-Microbenchmarks-v5.6.2

Performance Evaluation of Optimized MVAPICH2-X• Mayer Cluster

– Nodes: 14 x ARM ThunderX2

– Processor: 2x 28 core ARM ThunderX2

– Network: EDR 100Gbps MT4119

– Operating System: Linux 4.14.0-115.13

– MPI and Communication Libraries• MVAPICH2-X (latest)

• OpenMPI 4.0.1

• UCX 1.5.2

– OSU-Microbenchmarks-v5.6.2


• EPCC Fulhame ARM cluster with up to 16 dual-socket 32-core ThunderX2 nodes

• Comparison among MVAPICH2X (Next), OpenMPI+UCX, and HPCX communication libraries

• OSU Micro-benchmark Suite (OMB) v5.6.2

• Measure the MPI-level communication performance of latency, bandwidth, bi-directional bandwidth, and message rate

• Three different configurations– Intra-socket

– Inter-socket

– Inter-node

Evaluation of Point-to-point on EPCC ARM System


0

0.1

0.2

0.3

0.4

0.5

0 2 8 32

Late

ncy

(us)

Message Size (Bytes)

Latency - Small Messages

MVAPICH2-X-NextHPCXOpenMPI+UCX 0

1

2

3

4

128 512 2048 8192

Late

ncy

(us)


Latency - Medium MessagesMVAPICH2-X-NextHPCXOpenMPI+UCX

0

100

200

300

400

500

600

32K 128K 512K 2M

Late

ncy

(us)


Latency - Large Messages

MVAPICH2-X-NextHPCXOpenMPI+UCX

0

100

200

300

400

500

600

1 4 16 64

Band

wid

th (

MB/

s)


Bandwidth - Small Messages

MVAPICH2-X-Next

HPCX

OpenMPI+UCX

0

2000

4000

6000

8000

10000

256 1K 4K 16K

Band

wid

th (

MB/

s)


Bandwidth – Medium Messages


0

3000

6000

9000

12000

15000

64K 256K 1M 4M

Band

wid

th (

MB/

s)


Bandwidth - Large Messages


Point-to-point: Latency & Bandwidth (Intra-socket)

70% better


Point-to-point: Bi-Bandwidth (Intra-socket)

0

200

400

600

800

1 4 16 64

Band

wid

th (

MB/

s)


Bi-bandwidth - Small Messages


0

4000

8000

12000

16000

256 1K 4K 16K

Band

wid

th (

MB/

s)



0

4000

8000

12000

16000

20000

64K 256K 1M 4M

Band

wid

th (

MB/

s)


Bi-bandwidth - Large Messages


37% better

Bi-bandwidth - Medium Messages

0123456

1 4 16 64Mill

ion

Mes

sage

/ S

ec


Message Rate - Small Messages


0

1

2

3

4

256 1K 4K 16K

Mill

ion

Mes

sage

/ S

ec



0

0.05

0.1

0.15

0.2

64K 256K 1M 4MMill

ion

Mes

sage

/ S

ec


Message Rate - Large Messages


Message Rate - Medium Messages


00.20.40.60.8

11.2

0 2 8 32

Late

ncy

(us)



MVAPICH2-X-NextHPCXOpenMPI+UCX 0

2

4

6

8

128 512 2048 8192

Late

ncy

(us)


Latency - Medium Messages


0

200

400

600

800

32K 128K 512K 2M

Late

ncy

(us)


Latency - Large MessagesMVAPICH2-X-NextHPCXOpenMPI+UCX

0

100

200

300

400

1 4 16 64

Band

wid

th (

MB/

s)


Bandwidth - Small MessagesMVAPICH2-X-NextHPCXOpenMPI+UCX

0

2000

4000

6000

8000

10000

256 1K 4K 16K

Band

wid

th (

MB/

s)




0

4000

8000

12000

16000

64K 256K 1M 4M

Band

wid

th (

MB/

s)




Point-to-point: Latency & Bandwidth (Inter-socket)

42%better

76% better


Point-to-point: Bi-Bandwidth (Inter-socket)

0

100

200

300

400

1 4 16 64

Band

wid

th (

MB/

s) MVAPICH2-X-NextHPCXOpenMPI+UCX

0

3000

6000

9000

12000

256 1K 4K 16K

Band

wid

th (

MB/

s) MVAPICH2-X-NextHPCXOpenMPI+UCX

0

4000

8000

12000

16000

64K 256K 1M 4M

Band

wid

th (

MB/

s)


24% better

Bi-bandwidth - Medium MessagesBi-bandwidth - Small Messages Bi-bandwidth - Large Messages

Message Size (Bytes)Message Size (Bytes)Message Size (Bytes)

0

1

2

3

4

1 4 16 64Mill

ion

Mes

sage

/ S

ec


Message Rate - Small Messages


0

0.04

0.08

0.12

0.16

64K 256K 1M 4MMill

ion

Mes

sage

/ S

ec


Message Rate - Large Messages


00.5

11.5

22.5

3

256 1K 4K 16K

Mill

ion

Mes

sage

/ S

ec


Message Rate - Medium Messages



0

0.5

1

1.5

2

2.5

0 2 8 32

Late

ncy

(us)




0

8

16

24

32

128 512 2048 8192

Late

ncy

(us)


Latency - Medium Messages


0

100

200

300

400

32K 128K 512K 2M

Late

ncy

(us)


Latency - Large Messages


0100200300400500600

1 4 16 64

Band

wid

th (

MB/

s)


Bandwidth - Small MessagesMVAPICH2-X-NextHPCXOpenMPI+UCX

0

3000

6000

9000

12000

15000

64K 256K 1M 4M

Band

wid

th (

MB/

s)




Point-to-point: Latency & Bandwidth (Inter-Node)

02000400060008000

1000012000

256 1K 4K 16K

Band

wid

th (

MB/

s)





• Fulhame cluster with up to 16 dual-socket 32-core ThunderX2 nodes

• Comparison among MVAPICH2X (Next), OpenMPI+UCX, and HPCX communication libraries

• OSU Micro-benchmark Suite (OMB) 5.6.2

• Measure the MPI-level communication performance of collectives communication latency

• Evaluate single-socket (half-subscription) and dual-socket (full-subscription) scenarios on varying scale

Evaluation of Collectives Communication on EPCC ARM System


Collectives: Single Node (64-ppn)

1

10

100

1000

10000

100000

4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Allreduce – 64 ppn

MVAPICH2-X-Next

OpenMPI+UCX

1

10

100

1000

10000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Bcast – 64 ppnMVAPICH2-X-Next

OpenMPI+UCX

1

10

100

1000

10000

4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Reduce – 64 ppn

MVAPICH2-X-Next

OpenMPI+UCX

1

10

100

1000

10000

1 4 16 64 256 1K 4K 16K 64K 256K 1MLa

tenc

y (u

s)Message Size (Bytes)

Scatter – 64 ppn

MVAPICH2-X-Next

OpenMPI+UCX

3x better 2x better

2.2x better


Collectives: 4 & 16 Nodes (32-ppn)

1

10

100

1000

10000

4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Allreduce (4-node)MVAPICH2-X-NextHPCXOpenMPI+UCX

1

10

100

1000

10000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Bcast (4-node)


1

10

100

100010000

100000

4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Allreduce (16-node)


1

10

100

100010000

100000

1 4 16 64 256 1K 4K 16K 64K 256K 1MLa

tenc

y (u


Bcast (16-node)


5.7x better7.6x better

3.7x better

9.5x better


Collectives: 16 Nodes (64-ppn)

1

10

100

1000

10000

100000

4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Allreduce – 64 ppnMVAPICH2-X-Next

OpenMPI+UCX

110

1001000

10000100000

1 4 16 64256 1K 4K

16K64K

256K 1M

Late

ncy

(us)


Bcast – 64 ppnMVAPICH2-X-Next

OpenMPI+UCX

1

10

100

1000

10000

4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)


Reduce – 64 ppnMVAPICH2-X-Next

OpenMPI+UCX

1

10

100

100010000

100000

1 4 16 64 256 1K 4K 16K 64K 256KLa

tenc

y (u


Scatter – 64 ppnMVAPICH2-X-Next

OpenMPI+UCX

10x better

5x better

7.5x better8x better


MPI Job Startup Evaluation on different ARM clustersEPCC Fulhame

0

5000

10000

15000

20000

64 128 256 512 1024 2048

MVAPICH2-XOpenMPI+UCXMVAPICH2-X-Next

Tim

e (m

illis

econ

d)

No. of Processes (64 ppn)

0

5000

10000

15000

20000

25000

56 112 224 448

MVAPICH2-XOpenMPI+UCXMVAPICH2-X-Next

Tim

e (m

illis

econ

d)


Mayer

6.4x better1.6x better

• Up to 1.6x speedup over OpenMPI w/UCX on Catalyst Fulhame system

• Up to 6.4x speedup over OpenMPI w/ UCX on Mayer system

.


• Evaluation of NAS Parallel Benchmarks, MiniAMR, and Cloverleaf kernels

• Comparison among MVAPICH2-X (Next), OpenMPI+UCX, and HPCX communication libraries

• Measure the application communication performance at varying scales with full-subscription scenarios on up to 1,024 processes

• Significant performance improvement is observed when using MVAPICH2-X

Evaluation of Application Kernels


0

200

400

600

800

1000

32 64 128 256 512

Exec

utio

n Ti

me

(Sec

)


NPB-CGMVAPICH2-X-NextHPCX

Application Evaluation – (NAS Parallel Benchmarks)

0

200

400

600

800

32 64 128 256

Exec

utio

n Ti

me

(Sec

)


NPB-FTMVAPICH2-X-NextHPCX

30% better

29% better

• NPB-3.4 Class-D comparing MVAPICH2-X (upcoming) and HPCX on EPCC Fulhame

• Up to 30% and 29% improvement over HPCX for CG and FT kernels.


Application Evaluation – (MiniAMR)

050

100150200250300350

32 64 128 256

Exec

utio

n Ti

me

(s)


MVAPICH2-X-NextHPCX

Input Parameters: --percent_sum 0 --num_vars 10 --stencil 21 --report_diffusion 0 --report_perf 2 --num_tsteps 100 --num_spikes 1

23% better

• MiniAMR kernel comparing MVAPICH2-X (upcoming) and HPCX on EPCC Fulhame

• Up to 23% improvement over HPCX is observed.


• ARM has emerged as a new platform for HPC systems

• Requires high-performance middleware designs while exploiting modern interconnects (InfiniBand)

• Provided the approaches being taken care of by the MVAPICH2 project to provide MPI support with high-performance

• Will continue to optimize and tune the MVAPICH2 stack for higher performance and scalability on ARM platforms

Conclusions


• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:

– Help and guidance with installation of the library

– Platform-specific optimizations and tuning– Timely support for operational issues encountered with the library

– Web portal interface to submit issues and tracking their progress

– Advanced debugging techniques– Application-specific optimizations and tuning

– Obtaining guidelines on best practices

– Periodic information on major fixes and updates– Information on major releases

– Help with upgrading to the latest release

– Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years

Commercial Support for MVAPICH2, HiBD, and HiDL Libraries

http://x-scalesolutions.com/


• Presentations at OSU and X-Scale Booth (#2094)– Members of the MVAPICH, HiBD and HiDL members

– External speakers

• Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase)

• Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events

• Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/

Multiple Events at SC ‘19

http://mvapich.cse.ohio-state.edu/conference/752/talks/


Funding AcknowledgmentsFunding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)– C.-H. Chu (Ph.D.)

– J. Hashmi (Ph.D.)

– A. Jain (Ph.D.)– K. S. Kandadi (M.S.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– S. Chakraborthy (Ph.D.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– R. Rajachandrasekar (Ph.D.)

– D. Shankar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur– X. Lu

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– Kamal Raj (M.S.)

– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)

– A. Quentin (Ph.D.)

– B. Ramesh (M. S.)– S. Xu (M.S.)

– J. Lin

– M. Luo

– E. Mancini

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– M. S. Ghazimeersaeed

– A. Ruhela– K. Manian

Current Students (Undergraduate)– V. Gangal (B.S.)

– N. Sarkauskas (B.S.)

Past Research Specialist– M. Arnold

Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

http://cse.ohio-state.edu

mvapich performance on arm at scale

Documents