nccl 2 - gtc on-demand featured talkson-demand.gputechconf.com/gtc/2017/presentation/s... · 45 2...
TRANSCRIPT
Sylvain Jeaugey
NCCL 2.0
2
DEEP LEARNING ON GPUS Making DL training times shorter
Multi-core CPU GPU
CUDA
Multi-GPU
NCCL 1
Multi-GPU
Multi-node
NCCL 2
Deeper neural networks, larger data sets … training is a very, very long operation !
3
NCCL A multi-GPU communication library
PCIe
NVLink
Sockets (Ethernet)
Infiniband, with GPU Direct RDMA
To other systems
Within a system
GPU Direct P2P
4
NCCL Architecture
NCCL
CUDA
Caffe
CUBLAS
Caffe2 Torch TF MXNET CNTK
Deep Learning Frameworks
NVIDIA GPUs
CUDNN
5
AGENDA
NCCL History Design
NCCL 2.0 New features API Changes
Performance
Future
6
HISTORY
Q4 2015: NCCL 1.x
Open-source research project on github, helping Deep Learning frameworks compute on multiple GPUs with efficient collective operations.
Limited to intra-node.
Q2 2017: NCCL 2.x and beyond
NVIDIA Library, multi-node support and improved API.
7
DESIGN
Optimized collective communication library between CUDA devices.
Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.
Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory, parallel reductions, NVLink usage.
Operates on CUDA pointers. Operations are tied to a CUDA stream.
Uses as little threads as possible to permit other computation to progress simultaneously.
What is NCCL ?
8
DESIGN Rings
NCCL uses rings to move data across all GPUs and perform reductions.
9
DESIGN Rings
NCCL uses rings to move data across all GPUs and perform reductions.
PCIe / QPI : 1 unidirectional ring
10
DESIGN Rings
NCCL uses rings to move data across all GPUs and perform reductions.
DGX-1 : 4 unidirectional rings PCIe / QPI : 1 unidirectional ring
11
Reduction
DESIGN Kernels
sendbuff recvbuff
FIFO
Next GPU
in the ring
Previous GPU
in the ring
12
NCCL 2.0
13
NCCL 2.0
Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.
Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.
Inter-node communication
PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband
14
NCCL 2.0
Inter-node communication using Sockets or Infiniband verbs, with multi-rail support, topology detection and automatic use of GPU Direct RDMA.
Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth and create rings across nodes.
Inter-node communication
PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband
15
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node 1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node N-1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7 4 GPUs per
socket
2 sockets
per node
n nodes
16
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node 1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node n-1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
1 process
per GPU P
0
P
1
P
2
P
3
P
4
P
5
P
6
P
7
P
8
P
9 P10
P11
P12
P13
P14
P15
P8n
-8
P 8n
-7
P 8n
-6
P 8n
-5
P 8n
-4
P 8n
-3
P 8n
-2
P 8n
-1
17
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node 1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node n-1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Process 0
t0 t1 t2 t3
Process 1
t0 t1 t2 t3
Process 2
t0 t1 t2 t3
Process 3
t0 t1 t2 t3
Process 2n-2
t0 t1 t2 t3
Process 2n-1
t0 t1 t2 t3
1 thread
per GPU
1 process
per socket
18
NCCL 2.0
Supports a combination of processes (potentially across nodes), threads per process and GPUs per thread.
Processes, threads and GPUs
Node 0
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node 1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Node n-1
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
1 process
per node Process 0 Process 1 Process n-1
8 GPUs per
process
19
NCCL 2.0 API
NCCL 2.0 is introducing mandatory new verbs ncclGroupStart/ncclGroupEnd when managing multiple devices from a single thread
NCCL 1.x : for (int i=0; i<ngpus; i++) {
cudaSetDevice(devices[i]);
ncclAllReduce(…, comms[i], streams[i]);
}
NCCL 2.0 : ncclGroupStart();
for (int i=0; i<ngpus; i++) {
ncclAllReduce(…, comms[i], streams[i]);
}
ncclGroupEnd();
Group calls
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Process 0
20
NCCL 2.0 API
Inter-node communicator creation still uses the NCCL 1.x verbs : ncclGetUniqueId/ncclCommInitRank if (rank == 0) ncclGetUniqueId(&id)
My_Bcast(&id);
ncclCommInitRank(&comm, nranks, id, rank);
Multi-process + multi-GPU per process (from a single thread) : combine ncclCommInitRank with ncclGroupStart/ncclGroupEnd if (rank == 0) ncclGetUniqueId(&id)
My_Bcast(&id);
ncclGroupStart();
for (int i=0; i<ndev; i++) {
cudaSetDevice(devices[i]);
ncclCommInitRank(&comm, ndev*nranks, id, ndev*rank+i);
}
ncclGroupEnd();
Integration with parallel environments
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
P
0
P
1
P
2
P
3
P
4
P
5
P
6
P
7
GPU
0
CPU0
GPU
1
GPU
2
GPU
3
GPU
4
CPU1
GPU
5
GPU
6
GPU
7
Process 0
21
NCCL 2.0 API
Other small API adjustments over the NCCL 1.x API :
Counts are now of type size_t instead of int
allGather arguments order has been fixed to be similar to other operations
Additions/clarification on datatypes : integral : int8 = char, uint8, int32 = int, uint32, int64, uint64 floating point : float16 = half, float32 = float, float64 = double
Clarifications and fixes for allgather and reduce_scatter send/receive counts and in-place operations
Others
22
PERFORMANCE
23
PERFORMANCE Intra-node performance
0
10
20
30
40
50
60
4 QPI 4 CPU 4 PCI DGX-1
AllReduce bandwidth (OMB, size=128MB, in GB/s)
24
PERFORMANCE Inter-node performance
0
5
10
15
20
25
30
35
40
45
2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)
AllReduce bandwidth (OMB, size=128MB, in GB/s)
MPI
Baidu Allreduce
NCCL
25
PERFORMANCE Deep Learning - CNTK
0
1000
2000
3000
4000
5000
6000
7000
8000
0 8 16 24 32
CNTK scaling ResNet50, images/s
Ideal MPI NCCL
217
1684
3281
6569
1645 1744
3360
26
FUTURE
27
FUTURE
Additional communication primitives : point-to-point communication scatter (1 to N), gather (N to 1), alltoall (N to N) neighbor collectives (send/receive in multiple dimensions)
User-defined reduction operations also, trying to merge computation and communication better
Windows support
Please let us know your needs ! Connect with experts / NCCL session : Wed Apr 10, 4pm
Top asked features