accelerating high performance computing with gpudirect...

Wednesday, August 7, 2013 - 10AM-11AM PST

Accelerating High Performance Computing with

GPUDirect RDMA

© 2013 Mellanox Technologies 2

Leading Supplier of End-to-End Interconnect Solutions

Virtual Protocol Interconnect

Storage Front / Back-End

Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Virtual Protocol Interconnect

Host/Fabric Software ICs Switches/Gateways Adapter Cards Cables

Comprehensive End-to-End InfiniBand and Ethernet Portfolio


I/O Offload Frees Up CPU for Application Processing

~88% CPU

Efficiency

User

Sp

ace

Sys

tem

Sp

ac

e

~53% CPU

Efficiency

~47% CPU

Overhead/Idle

~12% CPU

Overhead/Idle

Without RDMA With RDMA and Offload

Us

er

Sp

ac

e

Sys

tem

Sp

ac

e


2008

QDR InfiniBand End-to-End GPUDirect Technology Released MPI/SHMEM Collectives Offloads (FCA), Scalable HPC (MXM), Open

SHMEM, PGAS/UPC

Long-Haul Solutions

Mellanox Interconnect Development Timeline

Connect-IB - 100Gb/s HCA Dynamically Connected

Transport

2009 2010 2011 2012 2013

FDR InfiniBand End-to-End

GPUDirect RDMA CORE-Direct Technology InfiniBand – Ethernet Bridging World’s First Petaflop Systems

Technology and Solutions Leadership


The GPUDirect project was announced Nov 2009

• “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”

GPUDirect was developed together by Mellanox and NVIDIA

• New interface (API) within the Tesla GPU driver

• New interface within the Mellanox InfiniBand drivers

• Linux kernel modification to allow direct communication between drivers

GPUDirect 1.0 was announced Q2’10

• “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency”

• “Mellanox was the lead partner in the development of NVIDIA GPUDirect”

GPUDirect RDMA alpha release is available today

• Mellanox has over two dozen developers using and providing feedback

• “Proof of Concept” designs in flight today with commercial end-customers and government entities

• Ohio State University has a version of MVAPICH2 for GPUDirect RDMA available for MPI application developers

GPUDirect RDMA is targeted for a GA release in Q4’13

GPUDirect History


GPU communications uses “pinned” buffers for data movement

• A section in the host memory that is dedicated for the GPU

• Allows optimizations such as write-combining and overlapping GPU computation and data transfer for

best performance

GPU-InfiniBand Bottleneck (pre-GPUDirect)

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory 1 2

InfiniBand uses “pinned” buffers for efficient RDMA transactions

• Zero-copy data transfers, Kernel bypass

• Reduces CPU overhead


GPUDirect 1.0

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory 1 2

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

2

Transmit Receive

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

Non GPUDirect

GPUDirect 1.0


LAMMPS

• 3 nodes, 10% gain

Amber – Cellulose


Amber – FactorIX


GPUDirect 1.0 – Application Performance

3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node


GPUDirect RDMA

Transmit Receive

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

GPUDirect RDMA

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

CPU

GPU Chip

set

GPU Memory

InfiniBand

System

Memory

1

GPUDirect 1.0


Hardware considerations for GPUDirect RDMA

CPU

Chip

set

CPU

Chipset or

PCIE switch

Note : A requirement for GPUDirect RDMA to work properly is that the NVIDIA

GPU and the Mellanox InfiniBand Adapter share the same root complex


How to get started evaluating GPUDirect RDMA…

How do I get started with the GPUDirect RDMA alpha code release?

The only way to get access to the alpha release is by sending an email to :

[email protected] , you will receive a response within 24 hours, and will be able to

download the code via an FTP site.

If you would like to evaluate MVAPICH2.1.9-GDR, please state this in the email request,

and you will receive a separate email on how to download it.

How can I ensure I get the latest updates and information?

The Community Site at Mellanox (http://community.mellanox.com) is a great place to get

the very latest information on GPUDirect RDMA. You will also be able to connect with

your peers, ask questions, exchange ideas, find additional resources, and share best

practices.

mailto:[email protected]


A Community for Mellanox Technology Enthusiasts


Thank You

MVAPICH2-GDR: MVAPICH2 with GPUDirect RDMA

Prof. Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] MVAPICH Project

https://mvapich.cse.ohio-state.edu/

Webinar on

Accelerating High Performance Computing with GPUDirect RDMA

by


• 205 IB Clusters (41%) in the June 2013 Top500 list

(http://www.top500.org)

• Installations in the Top 40 (18 systems):

MVAPICH2 with GPUDirect RDMA

Large-scale InfiniBand Cluster Installations

462,462 cores (Stampede) at TACC (6th) 138,368 cores (Tera-100) at France/CEA (25th)

147, 456 cores (Super MUC) in Germany (7th) 53,504 cores (PRIMERGY) at Australia/NCI (27th)

110,400 cores (Pangea) at France/Total (11th) 77,520 cores (Conte) at Purdue University (28th)

73,584 (Spirit) at USA/Air Force (14th) 48,896 cores (MareNostrum) at Spain/BSC (29th)

77,184 cores (Curie thin nodes) at France/CEA (15th) 78,660 cores (Lomonosov) in Russia (31st )

120, 640 cores (Nebulae) atChina/NSCS (16th) 137,200 cores (Sunway Blue Light) in China 33rd)

72,288 cores (Yellowstone) at NCAR (17th) 46,208 cores (Zin) at LLNL (34th)

125,980 cores (Pleiades) at NASA/Ames (19th) 38,016 cores at India/IITM (36th)

70,560 cores (Helios) at Japan/IFERC (20th) More are getting installed !

73,278 cores (Tsubame 2.0) at Japan/GSIC (21st)

2

http://www.top500.org/

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 2,055 organizations (HPC Centers, Industry and Universities) in 70 countries

– More than 180,000 downloads from OSU site directly

– Empowering many TOP500 clusters • 7th ranked 204,900-core cluster (Stampede) at TACC

• 14th ranked 125,980-core cluster (Pleiades) at NASA

• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• 75th ranked 16,896-core cluster (Keenland) at GaTech

• and many others

– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

3

MVAPICH2/MVAPICH2-X Software


http://mvapich.cse.ohio-state.edu/

• Released on 05/06/13

• Major Features and Enhancements – Based on MPICH-3.0.3

• Support for all MPI-3 features – Non-blocking collectives, Neighborhood Collectives, etc.

– Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach)

• Provides flexibility for intra-node communication: shared memory, LiMIC2, and CMA

– Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)

• Support for application-level checkpointing

• Support for hierarchical system-level checkpointing

– Scalable UD-multicast-based designs and tuned algorithm selection for collectives

– Improved job startup time

• Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on homogeneous clusters

– Revamped Build system with support for parallel builds

– Many enhancements related to GPU

• MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models.

– Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d 4

MVAPICH2 1.9 and MVAPICH2-X 1.9


• OSU started this research and development direction in 2011

• Initial support was provided in MVAPICH2 1.8a (SC ‘11)

• Since then many enhancements and new designs related to GPU communication have been incorporated in 1.8 and 1.9 series

• Have also extended OSU Micro-Benchmark Suite (OMB) to test and evaluate – GPU-aware MPI communication

– OpenACC

5 MVAPICH2 with GPUDirect RDMA

Designing GPU-Aware MPI Library


What is GPU-Aware MPI Library?

PCIe

GPU

CPU

NIC

Switch

At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .);

At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• Data movement in applications with standard MPI and CUDA interfaces

High Productivity and Low Performance


MPI + CUDA - Naive

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz,

…); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall();

<<Similar at receiver>>

• Pipelining at user level with non-blocking MPI and CUDA interfaces

Low Productivity and High Performance


MPI + CUDA - Advanced

At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);

inside MVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);


GPU-Aware MPI Library: MVAPICH2-GPU

• 45% improvement compared with a naïve user-level implementation (Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation (MemcpyAsync+Isend), for 4MB messages

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11

Better


24 %

45 %

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (u

s)

Message Size (bytes)

Memcpy+SendMemcpyAsync+IsendMVAPICH2-GPU

MPI Micro-benchmark Performance

• Support for MPI communication from NVIDIA GPU device memory

• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host, and Host-GPU)

• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node

• Optimized and tuned collectives for GPU device buffers

• MPI datatype support for point-to-point and collective communication from GPU device buffers


MVAPICH2 1.9 Features for NVIDIA GPU Clusters

• Fastest possible communication between GPU and other PCI-E devices

• Network adapter can directly read/write data from/to GPU device memory

• Avoids copies through the host

• Allows for better asynchronous communication

• OFED with GPU-Direct is under work by NVIDIA and Mellanox

GPU-Direct RDMA with CUDA 5.0

InfiniBand

GPU

GPU Memory

CPU

Chip set

System Memory


• Peer2Peer (P2P) bottlenecks on Sandy Bridge

• Design of MVAPICH2 – Hybrid design

– Takes advantage of GPU-Direct-RDMA for writes to GPU

– Uses host-based buffered design in current MVAPICH2 for reads

– Works around the bottlenecks transparently

13

Initial Design of OSU-MVAPICH2 with GPU-Direct-RDMA


IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNB E5-2670

0

5

10

15

20

25

30

1 4 16 64 256 1K 4K

MVAPICH2-1.9MVAPICH2-1.9-GDR

Small Message Latency


Late

ncy

(us)

14

Performance of MVAPICH2 with GPU-Direct-RDMA

Based on MVAPICH2-1.9 Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Telsa K20c GPU, Mellanox ConnectX-3 FDR HCA CUDA 5.5, OFED 1.5.4.1 with GPU-Direct-RDMA Patch

GPU-GPU Internode MPI Latency

0

100

200

300

400

500

600

700

800

900

1000

16K 64K 256K 1M 4M


Large Message Latency


Late

ncy

(us)

Better

Better

19.1

67.5 %

6.2


0

1000

2000

3000

4000

5000

6000

7000

8K 32K 128K 512K 2M



Band

wid

th (M

B/s)

Large Message Bandwidth

0

100

200

300

400

500

600

700

800

1 4 16 64 256 1K 4K



Band

wid

th (M

B/s)

Small Message Bandwidth

15




GPU-GPU Internode MPI Uni-Directional Bandwidth

2.8x 33%

Bett

er

Bett

er


0

2000

4000

6000

8000

10000

12000

8K 32K 128K 512K 2M

MVAPICH2-1.9

MVAPICH2-1.9-GDR


Band

wid

th (M

B/s)

Large Message Bi-Bandwidth

0

100

200

300

400

500

600

700

800

900

1000

1 4 16 64 256 1K 4K

MVAPICH2-1.9

MVAPICH2-1.9-GDR


Band

wid

th (M

B/s)

Small Message Bi-Bandwidth

16




GPU-GPU Internode MPI Bi-directional Bandwidth

54%

Bett

er

Bett

er


3x

How will it help me?

• MPI applications can be made GPU-aware to use direct communication from/to GPU buffers, as supported by MVAPICH2 1.9, and extract performance benefits

• GPU-Aware MPI applications using short and medium messages can extract added performance and scalability benefits with MVAPICH2-GPUDirect RDMA (MVAPICH2-GDR)


How can I get Started with GDR Experimentation?

• Two modules are needed – GPUDirect RDMA (GDR) driver from Mellanox – MVAPICH2-GDR from OSU

• Send a note to [email protected] • You will get alpha versions of GDR driver and MVAPICH2-GDR

(based on MVAPICH2 1.9 release) • You can get started with this version • MVAPICH2 team is working on multiple enhancements (collectives,

datatypes, one-sided) to exploit the advantages of GDR • As GDR driver matures, successive versions of MVAPICH2-GDR with

enhancements will be made available to the community



Will it be too Hard to Use GDR?

• No • You need to first install the OFED-GDR driver from Mellanox • Install MVAPICH2-GDR • Current GPU-aware features in MVAPICH2 are triggered with

a runtime parameter: MV2_USE_CUDA=1 • To activate GDR functionality, you just need to use one more

runtime parameter: MV2_USE_GPUDIRECT =1 • A short demo will be shown now to illustrate the easy usage

of MVAPICH2-GDR


Additional Information and Contact Point for Questions

http://www.cse.ohio-state.edu/~panda http://nowlab.cse.ohio-state.edu

MVAPICH Web Page http://mvapich.cse.ohio-state.edu

[email protected]


http://www.cse.ohio-state.edu/~panda




http://nowlab.cse.ohio-state.edu/








Accelerating High Performance Computing with GPUDirect™ RDMA

NVIDIA webinar 8/7/2013

Outline

! GPUDirect technology family !   Current NVIDIA Software and Hardware Requirements !   Current MPI Status !   Using GPU Direct with IBVerbs extensions !   Using GPUDirect RDMA and MPI !   CUDA 6 – moving from GPUDirect alpha to beta to GA !   Team Q & A

GPUDirect is a family of technologies

! GPUDirect Shared GPU-Sysmem for inter-node copy optimization !   How: Use GPUDirect-aware 3rd party network drivers

! GPUDirect P2P for intra-node, accelerated GPU-GPU memcpy !   How: Use CUDA APIs directly in application !   How: use P2P-aware MPI implementation

! GPUDirect P2P for intra-node, inter-GPU LD/ST access !   How: Access remote data by address directly in GPU device code

! GPUDirect RDMA for inter-node copy optimization !   What: 3rd party PCIe devices can read and write GPU memory !   How: Use GPUDirect RDMA-aware 3rd party network drivers* and MPI

implementations* or custom device drivers for other hardware

* forthcoming

NVIDIA Software and Hardware Requirements

!   What drivers and CUDA versions are required to support GPU Direct?

!   Alpha Patches work with – CUDA 5.0 or CUDA 5.5 !   Final release based on CUDA 6.0 (beta in October)

!   New driver, probably version 331 !   Register at developer.nvidia.com for early access

!   NVIDIA Hardware Requirements !   RDMA available on Tesla and Quadro Kepler class hardware

GPU Aware MPI Libraries – Current Status

All libraries allow •  GPU and network device share same sysmem buffers •  Utilizes best transfer mode (such as CUDA IPC direct

transfer within a node between GPUs) •  Send and receive of GPU buffers and most

collectives Versions: •  MVAPICH2 1.9 •  OpenMPI 1.7.2 •  IBM Platform MPI V9.1 Reference •  NVIDIA GPUDirect Technology Overview

MVAPICH

Open MPI

IBM Platform Computing Computing

IBM Platform MPI

IB verbs extensions for GPUDirect RDMA

!   Developers may program at the IB verbs level or with MPI !   Current version with RDMA support (available via Mellanox)

!   Gives application developers early access to an RDMA path !   IB verbs was changed to provide:

!   Extended memory registration API’s to support GPU buffer !   GPU memory de-allocation call-back (for efficient MPI implementations)

IB verbs with GPUDirect RDMA

Use existing memory registration APIs: ! struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,

size_t length, int access)

! pd: protection domain !   Access:

IBV_ACCESS_LOCAL_WRITE = 1, IBV_ACCESS_REMOTE_WRITE = (1<<1), IBV_ACCESS_REMOTE_READ = (1<<2), IBV_ACCESS_REMOTE_ATOMIC = (1<<3), IBV_ACCESS_MW_BIND = (1<<4)

! int ibv_dereg_mr(struct ibv_mr *mr)

Example:

d_buf=cudaMalloc(); mr=ibv_reg_mr(pd,d_buf,size,…); // … RDMA on buffer here ibv_dereg_mr(mr) cudaFree(d_buf)

GPUDirect RDMA

GPU PCIe

GPU Memory

CPU/IOH

CPU Memory

PCIe

Third Party Hardware

Today

Mellanox

Tomorrow??

GPUDirect RDMA: Common use cases

!   Inter-Node MPI communication !   Transfer data between local GPU memory and a remote Node

!   Interface with third party hardware !   It requires adopting NVIDIA GPUDirect-Interop API in vendor software

stack

GPUDirect RDMA: What does it get you?

! MPI_Send latency of ~20us with Shared GPU-Sysmem !   No overlap possible !   Bidirectional transfer is difficult

! MPI_Send latency of ~6us with RDMA !   Does not affect running kernels !   Unlimited concurrency !   RDMA possible!

So what happens at CUDA 6?

!   No change to MPI programs !   Interfaces are simplified, reducing the work for MPI implementors !   Programmers working at the verbs level also benefit !   Requires upgrading to CUDA 6 and then current NVIDIA driver

!   Register to receive updates !   Release of CUDA 6 (RC1, then final) !   Release of MVAPICH2 beta, then final !   Progress from other MPI vendors

Contacts and Resources

!   NVIDIA !   Register at developer.nvidia.com for

early access to CUDA 6 ! Developer Zone GPU Direct page ! Developer Zone RDMA page

! Mellanox !   Register at

community.mellanox.com !   Ohio State

! http://mvapich.cse.ohio-state.edu !   Emails

! [email protected] ! [email protected] ! [email protected]

The End

Question Time

Upcoming GTC Express Webinars

August 13 - GPUs in the Film Visual Effects Pipeline

August 14 - Beyond Real-time Video Surveillance Analytics with GPUs

August 15 - CUDA 5.5 Production Release: Features Overview

September 5 - Data Discovery through High-Data-Density Visual Analysis using NVIDIA GRID GPUs

September 12 - Guided Performance Analysis with NVIDIA Visual Profiler

Register at www.gputechconf.com/gtcexpress

GTC 2014 Call for Submissions

Looking for submissions in the fields of

!   Science and research !   Professional graphics

!   Mobile computing

!   Automotive applications

!   Game development

!   Cloud computing

Submit by September 27 at www.gputechconf.com

accelerating high performance computing with gpudirect...

Documents