lecture 11: distributed training and communication protocols · fullc-forward fullc-forward...
TRANSCRIPT
![Page 1: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/1.jpg)
Lecture 11: Distributed Training and Communication Protocols
CSE599W: Spring 2018
![Page 2: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/2.jpg)
Where are we
Gradient Calculation (Differentiation API)
Computational Graph Optimization and Execution
Runtime Parallel Scheduling
GPU Kernels, Optimizing Device Code
Programming API
Accelerators and Hardwares
User API
System Components
Architecture
High level Packages
![Page 3: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/3.jpg)
Where are we
Gradient Calculation (Differentiation API)
Computational Graph Optimization and Execution
Runtime Parallel Scheduling / Networks
Programming API
GPU Kernels, Optimizing Device Code
Accelerators and Hardwares
![Page 4: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/4.jpg)
Recap: Parallel Scheduling Engine
Execution Function (box)
A.data
B.data
v2
v1
lambda: B.data=A.data+1
The Tagged Data Pack Reference to Related Things into Execution Function (via Closure)
engine.push( Exec Function , read = [ ],
mutate= [ ])
Push the Operation to Engine
v1
A.data B.data
v2
![Page 5: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/5.jpg)
Recap: Example Scheduling
engine.push(lambda: A.data=2, read=[], mutate= [A.var])
engine.push(lambda: B.data=A.data+1, read=[A.var], mutate= [B.var])
engine.push(lambda: D.data=A.data * B.data, read=[A.var, B.var], mutate=[D.var])
A = 2
B = A + 1
D = A * B
![Page 6: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/6.jpg)
Data Parallelism
● Train replicated version of model in each machine
● Synchronize the gradient
![Page 7: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/7.jpg)
How to do Synchronization over Network
fullc-forward
fullc-forward
fullc-backward
fullc-backward
softmax-forward softmax-backward
log-loss
w1
w2 g2
g1
data
label
sync g1 update w1
fullc-forward
fullc-forward
softmax-forward
log-loss
w1
w2
data
label
sync g2update w2
This Lecture
![Page 8: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/8.jpg)
Distributed Gradient Aggregation, Local Update
fullc-forward
fullc-forward
fullc-backward
fullc-backward
softmax-forward softmax-backward
log-loss
w1
w2 g2
g1
data
label
Network protocolg1
Many replicas of the same graph run in parallel
G1 = sum(g1 over replicas)
w1 -= lr * G1
![Page 9: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/9.jpg)
Allreduce: Collective Reduction
result = allreduce(float buffer[size])Interface
a = [1, 2, 3]
b = comm.allreduce(a, op=sum)
a = [1, 0, 1]
Machine 1 Machine 2
b = comm.allreduce(a, op=sum)
Running Example
assert b == [2, 2, 4] assert b == [2, 2, 4]
comm = communicator.create() comm = communicator.create()
![Page 10: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/10.jpg)
Use Allreduce for Data Parallel Training
grad = gradient(net, w)
for epoch, data in enumerate(dataset): g = net.run(grad, in=data) gsum = comm.allreduce(g, op=sum)
w -= lr * gsum / num_workers
![Page 11: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/11.jpg)
Common Connection Topologies
All-to-all: (plugged to same switch)
Ring (NVLink) Tree-Shape
![Page 12: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/12.jpg)
Discussion: 3min
● How to Implement Allreduce over Network
● What is impact of network topology on this
![Page 13: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/13.jpg)
Tree Shape Reduction
2
3 1
1
2
● Logically form a reduction tree between nodes
● Aggregate to root then broadcast
![Page 14: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/14.jpg)
Tree Shape Reduction
2
3 1
1
2
3 1
2
![Page 15: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/15.jpg)
Tree Shape Reduction
2
3 1
1
2
3 1
26
![Page 16: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/16.jpg)
Tree Shape Reduction
2
3 1
1
2
99
9 9
Question: What is Time Complexity of Tree Shape Reduction
![Page 17: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/17.jpg)
Ring based Reduction
● Form a logical ring between nodes
● Streaming aggregation
![Page 18: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/18.jpg)
Ring based Reduction
![Page 19: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/19.jpg)
Ring based Reduction
![Page 20: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/20.jpg)
Ring based Reduction
![Page 21: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/21.jpg)
Ring based Reduction
![Page 22: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/22.jpg)
Ring based Reduction
Each node have correctly reduced result of one segment!This is called reduce_scatter
![Page 23: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/23.jpg)
Ring based Reduction: Allgather phase
![Page 24: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/24.jpg)
Ring based Reduction: Allgather phase
![Page 25: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/25.jpg)
Ring based Reduction: Allgather phase
![Page 26: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/26.jpg)
Ring based Reduction: Allgather phase
Question: What is Time Complexity of Ring based Reduction
![Page 27: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/27.jpg)
Allreduce Libraries
● MPI offers efficient CPU allreduce● dmlc/rabit: fault tolerant variant● facebookincubator/gloo● Parameter Hub: from UW
● NCCL: Nvidia’ efficient multiGPU collective
![Page 28: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/28.jpg)
GPUDirect and RMDA
From Nvidia
![Page 29: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/29.jpg)
NCCL: Nvidia’s Efficient Multi-GPU Collective
● Uses unified GPU direct memory accessing
● Each GPU launch a working kernel, cooperate with each other to do ring based reduction
● A single C++ kernel implements intra GPU synchronization and Reduction
![Page 30: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/30.jpg)
Discussion: 4min
● What are advantages and limitations of Allreduce
● How to integrate allreduce with dependency scheduler?
![Page 31: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/31.jpg)
Schedule Allreduce Asynchronously
Make use of mutation semantics!
engine.push( lambda: A.data=2, read=[], mutate= [A.var])
engine.push( lambda: B.data=allreduce(A.data), read=[A.var], mutate=[B.var, comm.var])
engine.push( lambda: D.data=A.data * B.data, read=[A.var, B.var], mutate=[D.var])
A = 2
B = comm.allreduce(A)
D = A * B
![Page 32: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/32.jpg)
Distributed Gradient Aggregation, Remote Update
fullc-forward
fullc-forward
fullc-backward
fullc-backward
softmax-forward softmax-backward
log-loss
w1
w2 g2
g1
data
label
g1
Many replicas of the same graph run in parallel
w1 -= lr * sum(g1 over replicas)
Parameter Server
w1
Update result on remote server and send updated results back
![Page 33: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/33.jpg)
Parameter Server Abstraction
ps.push(index, gradient)
Interface
ps.pull(index)
![Page 34: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/34.jpg)
PS Interface for Data Parallel Training
grad = gradient(net, w)
for epoch, data in enumerate(dataset): g = net.run(grad, in=data)
ps.push(weight_index, g) w = ps.pull(weight_index)
![Page 35: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/35.jpg)
PS Data Consistency: BSP
pull weight
push gradient
update weight
● “Synchronized”
○ Gradient aggregated over all works
○ All workers receives the same parameters
○ Give same result as single batch update
○ Brings challenges to synchronization
![Page 36: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/36.jpg)
PS Consistency: Asynchronous
![Page 37: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/37.jpg)
The Cost of PS Model: All to All Pattern
● Each worker talks to all servers
● Shard the parameters over different servers
● What is the time complexity of communication?
![Page 38: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/38.jpg)
Integrate Schedule with Networking using Events
A.data lambda cb: ps.receive(A.data)
Receive Function Engine
def event.on_data_received(): #notify engine receive complete cb();
Asynchronous function that takes a callback from engine
Use the callback to notify enginethat data receive is finished
![Page 39: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/39.jpg)
Model Parallel Training
● Map parts of workload to different devices
● Require special dependency patterns (wave style)○ e.g. LSTM
![Page 40: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/40.jpg)
Question: How to Write Model Parallel Program?
for i in range(num_layers): for t in range(num_time_stamp): out, state = layer[i].forward(data[i][t], state) data[i+1][t] = out.copyto(device[i])
Scheduler tracks these dependencies
![Page 41: Lecture 11: Distributed Training and Communication Protocols · fullc-forward fullc-forward fullc-backward fullc-backward softmax-forward softmax-backward log-loss w1 w2 g2 g1 data](https://reader033.vdocument.in/reader033/viewer/2022050216/5f625d7870cf225653720396/html5/thumbnails/41.jpg)
Discussion: What’s Special about Communication
Requirements● Track dependency correctly● Resolve resource contention and allocation● Some special requirement on channel
○ Allreduce: ordered call
Most of them are simplified by a scheduler