mvapich2-gdr: pushing the frontier of hpc and deep...

Post on 20-May-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning

Dhabaleswar K. (DK) Panda

The Ohio State UniversityE-mail: panda@cse.ohio-state.edu

http://www.cse.ohio-state.edu/~panda

Talk at Mellanox booth (SC ‘19)

by

Mellanox Booth (SC ’19) 2Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR • Conclusions

Outline

Mellanox Booth (SC ’19) 3Network Based Computing Laboratory

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (Supercomputing 2002)

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,050 organizations in 89 countries

– More than 615,000 (> 0.6 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘19 ranking)

• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 5th, 448, 448 cores (Frontera) at TACC

• 8th, 391,680 cores (ABCI) in Japan

• 14th, 570,020 cores (Neurion) in South Korea and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decadePartner in the 5th ranked TACC Frontera System

Mellanox Booth (SC ’19) 4Network Based Computing Laboratory

0

100000

200000

300000

400000

500000

600000Se

p-04

Feb-

05Ju

l-05

Dec-

05M

ay-0

6O

ct-0

6M

ar-0

7Au

g-07

Jan-

08Ju

n-08

Nov

-08

Apr-

09Se

p-09

Feb-

10Ju

l-10

Dec-

10M

ay-1

1O

ct-1

1M

ar-1

2Au

g-12

Jan-

13Ju

n-13

Nov

-13

Apr-

14Se

p-14

Feb-

15Ju

l-15

Dec-

15M

ay-1

6O

ct-1

6M

ar-1

7Au

g-17

Jan-

18Ju

n-18

Nov

-18

Apr-

19

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GD

R 2.

0b

MV2

-MIC

2.0

MV2

-GD

R 2

.3.2

MV2

-X2.

3rc

2

MV2

Virt

2.2

MV2

2.3

.2

OSU

INAM

0.9

.3

MV2

-Azu

re 2

.3.2

MV2

-AW

S2.

3

MVAPICH2 Release Timeline and Downloads

Mellanox Booth (SC ’19) 5Network Based Computing Laboratory

Architecture of MVAPICH2 Software Family (HPC and DL)

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-

Awareness

Remote Memory Access

I/O and

File Systems

Fault

ToleranceVirtualization Active

MessagesJob Startup

Introspection & Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC SRD UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

Optane* NVLink CAPI*

* Upcoming

XPMEM

Mellanox Booth (SC ’19) 6Network Based Computing Laboratory

MVAPICH2 Software Family Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE

MVAPICH2-X

MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Mellanox Booth (SC ’19) 7Network Based Computing Laboratory

CPU CPUQPI

GPU

PCIe

GPU

GPU

CPU

GPU

IB

Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host6. Inter-Socket GPU-Host

Memory buffers

8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .

• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters

Mellanox Booth (SC ’19) 8Network Based Computing Laboratory

At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

Mellanox Booth (SC ’19) 9Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases

• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication

(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU

adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from

GPU device buffers• Unified memory

Mellanox Booth (SC ’19) 10Network Based Computing Laboratory

• Released on 08/08/2019

• Major Features and Enhancements– Based on MVAPICH2 2.3.1

– Support for CUDA 10.1

– Support for PGI 19.x

– Enhanced intra-node and inter-node point-to-point performance

– Enhanced MPI_Allreduce performance for DGX-2 system

– Enhanced GPU communication support in MPI_THREAD_MULTIPLE mode

– Enhanced performance of datatype support for GPU-resident data

• Zero-copy transfer when P2P access is available between GPUs through NVLink/PCIe

– Enhanced GPU-based point-to-point and collective tuning

• OpenPOWER systems such as ORNL Summit and LLNL Sierra ABCI system @AIST, Owens and Pitzer systems @Ohio Supercomputer Center

– Scaled Allreduce to 24,576 Volta GPUs on Summit

– Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems

– Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems

– Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get

– Flexible support for running TensorFlow (Horovod) jobs

MVAPICH2-GDR 2.3.2

Mellanox Booth (SC ’19) 11Network Based Computing Laboratory

0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3

0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR 2.3

MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.85us11X

Presenter
Presentation Notes
This one can be updated with the latest version of GDR

Mellanox Booth (SC ’19) 12Network Based Computing Laboratory

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

MV2 MV2+GDR

0500

100015002000250030003500

4 8 16 32Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

64K Particles 256K Particles

2X2X

Mellanox Booth (SC ’19) 13Network Based Computing Laboratory

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/

Mellanox Booth (SC ’19) 14Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Multi-stream Communication for IPC• CMA-based Intra-node Communication Support• Support for OpenPower and NVLink with GDRCOPY2• Maximal overlap in MPI Datatype Processing

• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR • Conclusions

Outline

Mellanox Booth (SC ’19) 15Network Based Computing Laboratory

Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1

• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect

• Up to 30% higher D2D bandwidth on DGX-1 with NVLink

05000

10000150002000025000300003500040000

128K 256K 512K 1M 2M 4M

Mill

ion

Byte

s (M

B)/s

econ

d

Message Size (Bytes)

Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design

1-stream 4-streams

16% better

02000400060008000

100001200014000160001800020000

16K 32K 64K 128K 256K 512K 1M 2M 4M

Mill

ion

Byte

s (M

B)/s

econ

d

Message Size (Bytes)

Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design

1-stream 4-streams

30% better

Available since MVAPICH2-GDR-2.3a

Mellanox Booth (SC ’19) 16Network Based Computing Laboratory

0

5000

10000

15000

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

CMA-based Intra-node Communication Support

0100200300400500600

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE Pt-to-Pt (H2H) LATENCY

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

MVAPICH2-GDR-2.3.2Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores

NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCACUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA

• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth

30% better30% better

Mellanox Booth (SC ’19) 17Network Based Computing Laboratory

0

10

20

30

40

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0

Scalable Host-based Collectives on OpenPOWER (Intra-node Reduce & AlltoAll)(N

odes

=1, P

PN=2

0)

0

50

100

150

200

4 8 16 32 64 128256512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0

(Nod

es=1

, PPN

=20)

Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively

3.6X

Alltoall

Reduce

5.2X

3.2X3.3X

0

400

800

1200

1600

2000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0

(Nod

es=1

, PPN

=20)

0

2500

5000

7500

10000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0

(Nod

es=1

, PPN

=20)

3.2X

1.4X

1.3X1.2X

Mellanox Booth (SC ’19) 18Network Based Computing Laboratory

D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

0

1

2

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLate

ncy

(us)

Message Size (Bytes)

Intra-Node Latency (Small Messages)

0

50

100

16K 32K 64K 128K 256K 512K 1M 2M 4MLate

ncy

(us)

Message Size (Bytes)

Intra-Node Latency (Large Messages)

020000400006000080000

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (M

B/s)

Message Size (Bytes)

Intra-Node Bandwidth

0

5

10

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLate

ncy

(us)

Message Size (Bytes)

Inter-Node Latency (Small Messages)

0100200300

16K 32K 64K 128K 256K 512K 1M 2M 4MLate

ncy

(us)

Message Size (Bytes)

Inter-Node Latency (Large Messages)

05000

10000150002000025000

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (M

B/s)

Message Size (Bytes)

Inter-Node Bandwidth

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

Intra-node Bandwidth: 65.48 GB/sec for 4MB (via NVLINK2)Intra-node Latency: 0.76 us (with GDRCopy)

Inter-node Bandwidth: 23 GB/sec for 4MB (via 2 Port EDR)Inter-node Latency: 2.18 us (with GDRCopy 2.0)

Mellanox Booth (SC ’19) 19Network Based Computing Laboratory

D-to-H & H-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

Intra-node D-H Bandwidth: 16.70 GB/sec for 2MB (via NVLINK2)

Intra-node D-H Latency: 0.49 us (with GDRCopy)

Intra-node H-D Latency: 0.49 us (with GDRCopy 2.0)Intra-node H-D Bandwidth: 26.09 GB/sec for 2MB (via NVLINK2)

0204060

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLa

tenc

y (u

s)

Message Size (Bytes)

D-H INTRA-NODE LATENCY (SMALL)

Spectrum MPI MV2-GDR

0

100

200

300

400

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

D-H INTRA-NODE LATENCY (LARGE)

Spectrum MPI MV2-GDR

05

101520

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)

Message Size (Bytes)

D-H INTRA-NODE BW

Spectrum MPI MV2-GDR

0204060

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8KLate

ncy

(us)

Message Size (Bytes)

H-D INTRA-NODE LATENCY (SMALL)

Spectrum MPI MV2-GDR

0100200300400

16K 32K 64K 128K 256K 512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

H-D INTRA-NODE LATENCY (LARGE)

Spectrum MPI MV2-GDR

0

10

20

30

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)

Message Size (Bytes)

H-D INTRA-NODE BW

Spectrum MPI MV2-GDR

Mellanox Booth (SC ’19) 20Network Based Computing Laboratory

Managed Memory Performance (OpenPOWER Intra-node)

Latency MD MD Bandwidth MD MD

Bi-Bandwidth MD MD

Mellanox Booth (SC ’19) 21Network Based Computing Laboratory

0

2

4

6

4 8 16 32 64 128 256 512

Late

ncy

(us)

MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP

0

5

10

15

4 8 16 32 64 128 256 512

Late

ncy

(us)

MVAPICH2-GDR-Next

MVAPICH2-GDR-Next w/ SHARP

MVAPICH2 with SHARP Support (Preliminary Results)GPU AllReduce w/ SHARP Support – 2 nodes

0

5

10

15

20

4 8 16 32 64 128 256 512

Late

ncy

(us)

MVAPICH2-GDR-NextMVAPICH2-GDR-Next w/ SHARP

3.41 us

GPU AllReduce w/ SHARP Support – 4 nodes

2.34 us

HOST AllReduce w/ SHARP Support – 2 nodes

3.14 us

0

2

4

6

8

10

4 8 16 32 64 128 256 512

Late

ncy

(us)

MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP

HOST AllReduce w/ SHARP Support – 8 nodes

3.97 us

Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect

Mellanox Booth (SC ’19) 22Network Based Computing Laboratory

• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions

• Halo data exchange• Duplicate the boundary• Exchange the boundary in each

iteration

Halo data exchange

Non-contiguous Data Exchange

Mellanox Booth (SC ’19) 23Network Based Computing Laboratory

MPI Datatype support in MVAPICH2• Datatypes support in MPI

– Operate on customized datatypes to improve productivity

– Enable MPI library to optimize non-contiguous data

At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.

Mellanox Booth (SC ’19) 24Network Based Computing Laboratory

MPI Datatype Processing (Computation Optimization )

• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block

• Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels• Vector

• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM

• Indexed_block• MV2_CUDA_KERNEL_IDXBLK_XDIM

Mellanox Booth (SC ’19) 25Network Based Computing Laboratory

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPUCommon Scenario

*A, B…contain non-contiguousMPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);

Mellanox Booth (SC ’19) 26Network Based Computing Laboratory

16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)

pre-comm post-recv post-

send wait-recv wait-send

post-comm start-up test-

commbench-comm

Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229

MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396

MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602

Application: COMB

Run Scripts pushed to COMB Github repo: https://github.com/LLNL/Comb/pull/2

18x

27x

- Improvements due to enhanced support for GPU-kernel based packing/unpacking routines

Mellanox Booth (SC ’19) 27Network Based Computing Laboratory

Application: HYPRE - BoomerAMG

0.971.196

1.4361.756

0.94 1.01 1.082

1.628

0

0.5

1

1.5

2

16 GPUs 32 GPUs 64 GPUs 128 GPUs

Seco

nds

(Sum

of S

etup

Pha

se &

Sol

ve

Phas

e Ti

mes

)

# of GPUs

HYPRE - BoomerAMGSpectrum-MPI 10.3.0.1 MVAPICH2-GDR 2.3.2

RUN MVAPICH2-GDR 2.3.2: export MV2_USE_CUDA=1 MV2_USE_GDRCOPY=0 MV2_USE_RDMA_CM=0export MV2_USE_GPUDIRECT_LOOPBACK=0 MV2_HYBRID_BINDING_POLICY=spread MV2_IBA_HCA=mlx5_0:mlx5_3OMP_NUM_THREADS=20 lrun -n 128 -N 32 mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18

RUN Spectrum-MPI 10.3.0.1: OMP_NUM_THREADS=20 lrun -n 128 -N 32 --smpiargs "-gpu --disable_gdr" mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18

25%16%

Mellanox Booth (SC ’19) 28Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR

• Benefits of CUDA-Aware MPI with TensorFlow• Optimized Collectives for Deep Learning• Out-of-core DNN Training

• Conclusions

Outline

Mellanox Booth (SC ’19) 29Network Based Computing Laboratory

• Need to understand several options currently available

• gRPC (official support)– Open-source – can be enhanced by others

– Accelerated gRPC (add RDMA to gRPC)

• gRPC+X– Use gRPC for bootstrap and rendezvous

– Actual communication is in “X”

– XMPI, Verbs, GPUDirect RDMA (GDR), etc.

• No-gRPC– Baidu – the first one to use MPI Collectives for TF

– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)

Data Parallel Training with TensorFlow (TF)

A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19. https://arxiv.org/abs/1810.11112

Mellanox Booth (SC ’19) 30Network Based Computing Laboratory

• MVAPICH2-GDR offers excellent performance via advanced designs for MPI_Allreduce.

• Up to 11% better performance on the RI2 cluster (16 GPUs)

• Near-ideal – 98% scaling efficiency

Exploiting CUDA-Aware MPI for TensorFlow (Horovod)

0100200300400500600700800900

1000

1 2 4 8 16

Imag

es/s

econ

d (H

ighe

r is b

ette

r)No. of GPUs

Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal

MVAPICH2-GDR 2.3 (MPI-Opt) is up to 11% faster than MVAPICH2

2.3 (Basic CUDA support)

A. A. Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19, https://arxiv.org/abs/1810.11112

Mellanox Booth (SC ’19) 31Network Based Computing Laboratory

MVAPICH2-GDR vs. NCCL2: Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1

10

100

1000

10000

100000

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR NCCL2

~1.2X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1

10

100

1000

4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

Late

ncy

(us)

Message Size (Bytes)

MVAPICH2-GDR NCCL2

~3X better

Mellanox Booth (SC ’19) 32Network Based Computing Laboratory

MVAPICH2-GDR vs. NCCL2: Allreduce Optimization (DGX-2)• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on a DGX-2 machine

020406080

Band

wid

th (G

B/s)

Message Size (Bytes)

Bandwidth

MVAPICH2-GDR 2.3.2 NCCL 2.4

2.4X better

110

1001000

10000

4 16 64 256 1K 4K 16K

64K

256K 1M 4M 16M

64M

Late

ncy

(us)

Message Size (Bytes)

Latency

MVAPICH2-GDR 2.3.2 NCCL 2.4

5.8X better

Platform: Nvidia DGX-2 system @ PSC (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2

Mellanox Booth (SC ’19) 33Network Based Computing Laboratory

MVAPICH2-GDR: MPI_Allreduce (Device Buffers) on Summit• Optimized designs in MVAPICH2-GDR offer better performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs

0

1

2

3

4

5

6

32M 64M 128M 256M

Band

wid

th (G

B/s)

Message Size (Bytes)

Bandwidth on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.7X better

050

100150200250300350400450

4 16 64 256 1K 4K 16K

Late

ncy

(us)

Message Size (Bytes)

Latency on 1,536 GPUs

MVAPICH2-GDR-2.3.2 NCCL 2.4

1.6X better

Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

0

2

4

6

8

10

24 48 96 192 384 768 1536

Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1

NCCL 2.4 MVAPICH2-GDR-2.3.2

1.7X better

Mellanox Booth (SC ’19) 34Network Based Computing Laboratory

Distributed Training with TensorFlow and MVAPICH2-GDR on Summit• ResNet-50 Training using

TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!

• 1,281,167 (1.2 mil.) images

• Time/epoch = 3.6 seconds

• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

(Tho

usan

ds)

Number of GPUs

NCCL-2.4 MVAPICH2-GDR-2.3.2

Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2

*We observed errors for NCCL2 beyond 96 GPUs

MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!

ImageNet-1k has 1.2 million images

Presenter
Presentation Notes
Time is calculated based on the images/second performance. It might be different when actual training happens.

Mellanox Booth (SC ’19) 35Network Based Computing Laboratory

• Near-linear scaling may be achieved by tuning Horovod/MPI• Optimizing MPI/Horovod towards large message sizes for high-resolution images

• Develop a generic Image Segmentation benchmark• Tuned DeepLabV3+ model using the benchmark and Horovod, up to 1.3X better than default

New Benchmark for Image Segmentation on Summit

*Anthony et al., “Scaling Semantic Image Segmentation using Tensorflow and MVAPICH2-GDR on HPC Systems” (Submission under review)

Mellanox Booth (SC ’19) 36Network Based Computing Laboratory

• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses– Multi-GPU Training within a single node

– Performance degradation for GPUs across different sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across

multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

0

50

100

150

200

250

8 16 32 64 128

Trai

ning

Tim

e (s

econ

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use caseOSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

Mellanox Booth (SC ’19) 37Network Based Computing Laboratory

Scalability and Large (Out-of-core) Models?• Large DNNs cannot be trained on GPUs due to memory limitation!

– ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45

– Next generation models: Neural Machine Translation (NMT) • Ridiculously large (billions of parameters),

• Will require even more memory!

– Can we exploit new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?

• General intuition is that managed allocations “will be” slow!

– The proposed framework called OC-Caffe (Out-of-Core Caffe)shows the potential of managed memory designs that can provide performance with negligible/no overhead.

• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7

A. A. Awan, C.-H. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18

Mellanox Booth (SC ’19) 38Network Based Computing Laboratory

• CPU based results– AMD EPYC

– Intel Xeon

• Excellent speedups for – VGG-19

– ResNet-110

– ResNet-1000 (1k layers)

• Able to train “future” models– E.g. ResNet-5000 (a synthetic

5000-layer model we benchmarked)

HyPar-Flow (HF): Hybrid Parallelism for TensorFlow

*Awan et al., “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, arXiv ’19. https://arxiv.org/pdf/1911.05146.pdf

110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)

Mellanox Booth (SC ’19) 39Network Based Computing Laboratory

• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR• Conclusions

Outline

Mellanox Booth (SC ’19) 40Network Based Computing Laboratory

• MVAPICH2-GDR Library provides optimized MPI communication on InfiniBand and RoCE clusters with GPUs

• Supports both X86 and OpenPower with NVLink

• Takes advantage of CUDA features like IPC and GPUDirect RDMA families

• Allows flexible solutions for streaming applications with GPUs

• Provides optimized solutions (scale-up and scale-out) for High-Performance Deep Learning

Conclusions

Mellanox Booth (SC ’19) 41Network Based Computing Laboratory

• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:

– Help and guidance with installation of the library

– Platform-specific optimizations and tuning

– Timely support for operational issues encountered with the library

– Web portal interface to submit issues and tracking their progress

– Advanced debugging techniques

– Application-specific optimizations and tuning

– Obtaining guidelines on best practices

– Periodic information on major fixes and updates

– Information on major releases

– Help with upgrading to the latest release

– Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years

Commercial Support for MVAPICH2, HiBD, and HiDL Libraries

Mellanox Booth (SC ’19) 42Network Based Computing Laboratory

• Presentations at OSU and X-Scale Booth (#2094)– Members of the MVAPICH, HiBD and HiDL members

– External speakers

• Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase)

• Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events

• Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/

Multiple Events at SC ‘19

Mellanox Booth (SC ’19) 43Network Based Computing Laboratory

Funding AcknowledgmentsFunding Support by

Equipment Support by

Mellanox Booth (SC ’19) 44Network Based Computing Laboratory

Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– C.-H. Chu (Ph.D.)

– J. Hashmi (Ph.D.)– A. Jain (Ph.D.)

– K. S. Kandadi (M.S.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– S. Chakraborthy (Ph.D.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– R. Rajachandrasekar (Ph.D.)

– D. Shankar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

– X. Lu

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– Kamal Raj (M.S.)

– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)

– A. Quentin (Ph.D.)

– B. Ramesh (M. S.)

– S. Xu (M.S.)

– J. Lin

– M. Luo

– E. Mancini

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– M. S. Ghazimeersaeed

– A. Ruhela

– K. ManianCurrent Students (Undergraduate)

– V. Gangal (B.S.)

– N. Sarkauskas (B.S.)

Past Research Specialist– M. Arnold

Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)

Mellanox Booth (SC ’19) 45Network Based Computing Laboratory

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

panda@cse.ohio-state.edu

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Follow us on

https://twitter.com/mvapich

top related