cliff woolley, sr. manager, developer technology software ...scatter/gather distinct messages from...

56
Cliff Woolley, Sr. Manager, Developer Technology Software, NVIDIA NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS

Upload: others

Post on 21-Aug-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

Cliff Woolley, Sr. Manager, Developer Technology Software, NVIDIA

NCCL: ACCELERATED MULTI-GPU COLLECTIVE COMMUNICATIONS

Page 2: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

2

BACKGROUND

Efficiency of parallel computation tasks

• Amount of exposed parallelism

• Amount of work assigned to each processor

Expense of communications among tasks

• Amount of communication

• Degree of overlap of communication with computation

What limits the scalability of parallel applications?

Page 3: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

3

COMMON COMMUNICATION PATTERNS

Page 4: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

4

COMMUNICATION AMONG TASKS

Point-to-point communication

• Single sender, single receiver

• Relatively easy to implement efficiently

Collective communication

• Multiple senders and/or receivers

• Patterns include broadcast, scatter, gather, reduce, all-to-all, …

• Difficult to implement efficiently

What are common communication patterns?

Page 5: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

5

POINT-TO-POINT COMMUNICATION

Most common pattern in HPC, where communication is usually to nearest neighbors

Single-sender, single-receiver per instance

2D Decomposition

Boundary

exchanges

Page 6: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

7

COLLECTIVE COMMUNICATIONMultiple senders and/or receivers

Page 7: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

8

BROADCASTOne sender, multiple receivers

GPU0 GPU1 GPU2 GPU3

A

GPU0 GPU1 GPU2 GPU3

A A A A

broadcast

Page 8: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

9

SCATTEROne sender; data is distributed among multiple receivers

scatter

GPU0 GPU1 GPU2 GPU3

A B C D

GPU0 GPU1 GPU2 GPU3

A

B

C

D

Page 9: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

10

GATHERMultiple senders, one receiver

GPU0 GPU1 GPU2 GPU3

A

B

C

D

GPU0 GPU1 GPU2 GPU3

A B C D

gather

Page 10: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

11

ALL-GATHERGather messages from all; deliver gathered data to all participants

GPU0 GPU1 GPU2 GPU3

A B C D

GPU0 GPU1 GPU2 GPU3

A A A A

B B B B

C C C C

D D D D

all-gather

Page 11: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

12

REDUCECombine data from all senders; deliver the result to one receiver

GPU0 GPU1 GPU2 GPU3

A B C D

reduce

GPU0 GPU1 GPU2 GPU3

A

+

B

+

C

+

D

Page 12: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

13

ALL-REDUCECombine data from all senders; deliver the result to all participants

GPU0 GPU1 GPU2 GPU3

A B C D

all-reduce

GPU0 GPU1 GPU2 GPU3

A

+

B

+

C

+

D

A

+

B

+

C

+

D

A

+

B

+

C

+

D

A

+

B

+

C

+

D

Page 13: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

14

REDUCE-SCATTERCombine data from all senders; distribute result across participants

reduce-scatter

GPU0 GPU1 GPU2 GPU3

A0+B0+

C0+D0

A1+B1+

C1+D1

A2+B2+

C2+D2

A3+B3+

C3+D3

GPU0 GPU1 GPU2 GPU3

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

Page 14: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

15

ALL-TO-ALLScatter/Gather distinct messages from each participant to every other

GPU0 GPU1 GPU2 GPU3

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

all-to-all

GPU0 GPU1 GPU2 GPU3

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

Page 15: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

16

THE CHALLENGE OF COLLECTIVES

Page 16: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

17

THE CHALLENGE OF COLLECTIVES

Having multiple senders and/or receivers compounds communication inefficiencies

• For small transfers, latencies dominate; more participants increase latency

• For large transfers, bandwidth is key; bottlenecks are easily exposed

• May require topology-aware implementation for high performance

• Collectives are often blocking/non-overlapped

Collectives are often avoided because they are expensive. Why?

Page 17: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

18

THE CHALLENGE OF COLLECTIVES

Collectives are central to scalability in a variety of key applications:

• Deep Learning (All-reduce, broadcast, gather)

• Parallel FFT (Transposition is all-to-all)

• Molecular Dynamics (All-reduce)

• Graph Analytics (All-to-all)

• …

If collectives are so expensive, do they actually get used? YES!

Page 18: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

19

THE CHALLENGE OF COLLECTIVES

Scaling requires efficient communication algorithms and careful implementation

Communication algorithms are topology-dependent

Topologies can be complex – not every system is a fat tree

Most collectives amenable to bandwidth-optimal implementation on rings, andmany topologies can be interpreted as one or more rings [P. Patarasuk and X. Yuan]

Many implementations seen in the wild are suboptimal

Page 19: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

20

RING-BASED COLLECTIVES: A PRIMER

Page 20: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

21

BROADCASTwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 21: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

22

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 22: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

23

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵

Step 2: ∆𝑡 = 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 23: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

24

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵

Step 2: ∆𝑡 = 𝑁/𝐵

Step 3: ∆𝑡 = 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 24: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

25

BROADCAST

Step 1: ∆𝑡 = 𝑁/𝐵

Step 2: ∆𝑡 = 𝑁/𝐵

Step 3: ∆𝑡 = 𝑁/𝐵

Total time: 𝑘 − 1 𝑁/𝐵

𝑁: bytes to broadcast

𝐵: bandwidth of each link

𝑘: number of GPUs

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 25: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

26

BROADCASTwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 26: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

27

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 27: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

28

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 28: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

29

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 29: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

30

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 30: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

31

BROADCAST

Split data into 𝑆 messages

Step 1: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 2: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 3: ∆𝑡 = 𝑁/(𝑆𝐵)

Step 4: ∆𝑡 = 𝑁/(𝑆𝐵)

...

Total time: S𝑁/(𝑆𝐵) + (𝑘 − 2) 𝑁 (𝑆𝐵)= 𝑁(𝑆 + 𝑘 − 2)/(𝑆𝐵) → 𝑁/𝐵

with unidirectional ring

GPU0 GPU1 GPU2 GPU3

Page 31: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

32

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 0

Page 32: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

33

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 1

Page 33: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

34

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 2

Page 34: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

35

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 3

Page 35: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

36

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 4

Page 36: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

37

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 5

Page 37: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

38

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 6

Page 38: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

39

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 1Step: 7

Page 39: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

40

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 0

Page 40: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

41

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 1

Page 41: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

42

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

Chunk: 2Step: 2

Page 42: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

43

ALL-REDUCEwith unidirectional ring

GPU0 GPU1 GPU2 GPU3

done

Page 43: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

44

RING-BASED COLLECTIVESA primer

GPU0 GPU1

CPU

4-GPU-PCIe

GPU2 GPU3

Switch

PCIe Gen3 x16 ~12 GB/s

Page 44: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

45

RING-BASED COLLECTIVESA primer

GPU0 GPU1

CPU

4-GPU-PCIe

GPU2 GPU3

Switch

PCIe Gen3 x16 ~12 GB/s

Page 45: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

46

RING-BASED COLLECTIVES…apply to lots of possible topologies

GPU0 GPU1

CPU

GPU2 GPU3

Switch

GPU4 GPU5

CPU

GPU6 GPU7

Switch

PCIe Gen3 x16 ~12 GB/s

SMP Connection (e.g., QPI)

Page 46: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

47

RING-BASED COLLECTIVES…apply to lots of possible topologies

GPU2

CPU

4-GPU-FC

GPU1

GPU3

GPU0

Switch

32 GB/s

GPU2

CPU

4-GPU-Ring

GPU1

GPU3

GPU0

Switch

32 GB/s

NVLink~16 GB/s

PCIe Gen3 x16 ~12 GB/s

Page 47: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

48

RING-BASED COLLECTIVES…apply to lots of possible topologies

GPU2

CPU

4-GPU-FC

GPU1

GPU3

GPU0

Switch

GPU2

CPU

4-GPU-Ring

GPU1

GPU3

GPU0

Switch

NVLink~16 GB/s

PCIe Gen3 x16 ~12 GB/s

Page 48: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

49

INTRODUCING NCCL (“NICKEL”):ACCELERATED COLLECTIVES

FOR MULTI-GPU SYSTEMS

Page 49: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

50

INTRODUCING NCCL

GOAL:

• Build a research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-GPU applications

APPROACH:

• Pattern the library after MPI’s collectives

• Handle the intra-node communication in an optimal way

• Provide the necessary functionality for MPI to build on top to handle inter-node

Accelerating multi-GPU collective communications

Page 50: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

51

NCCL FEATURES AND FUTURES

Collectives

• Broadcast

• All-Gather

• Reduce

• All-Reduce

• Reduce-Scatter

• Scatter

• Gather

• All-To-All

• Neighborhood

Key Features

• Single-node, up to 8 GPUs

• Host-side API

• Asynchronous/non-blocking interface

• Multi-thread, multi-process support

• In-place and out-of-place operation

• Integration with MPI

• Topology Detection

• NVLink & PCIe/QPI* support

(Green = Currently available)

Page 51: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

52

NCCL IMPLEMENTATION

Implemented as monolithic CUDA C++ kernels combining the following:

• GPUDirect P2P Direct Access

• Three primitive operations: Copy, Reduce, ReduceAndCopy

• Intra-kernel synchronization between GPUs

• One CUDA thread block per ring-direction

Page 52: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

53

NCCL EXAMPLE

#include <nccl.h>

ncclComm_t comm[4];

ncclCommInitAll(comm, 4, {0, 1, 2, 3});

foreach g in (GPUs) { // or foreach thread

cudaSetDevice(g);

double *d_send, *d_recv;

// allocate d_send, d_recv; fill d_send with data

ncclAllReduce(d_send, d_recv, N, ncclDouble, ncclSum, comm[g], stream[g]);

// consume d_recv

}

All-reduce

Page 53: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

54

NCCL PERFORMANCEBandwidth at different problem sizes (4 Maxwell GPUs)

All-Gather

All-Reduce

Reduce-Scatter

Broadcast

Page 54: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

55

AVAILABLE NOWgithub.com/NVIDIA/nccl

Page 55: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2

56

THANKS TO MY COLLABORATORS

Nathan Luehr

Jonas Lippuner

Przemek Tredak

Sylvain Jeaugey

Natalia Gimelshein

Simon Layton

This research is funded in part by the U.S. DOE DesignForward program

Page 56: Cliff Woolley, Sr. Manager, Developer Technology Software ...Scatter/Gather distinct messages from each participant to every other GPU0 GPU1 GPU2 GPU3 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2