crossbow: scaling deep learning on multi-gpu servers
TRANSCRIPT
![Page 1: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/1.jpg)
Crossbow: Scaling Deep Learning on Multi-GPU Servers
Peter Pietzuchwith Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa
Imperial College Londonhttp://lsds.doc.ic.ac.uk<[email protected]>
CASTOR Software Days – Stockholm, Sweden – October 2019
Large-Scale Data & Systems Group
![Page 2: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/2.jpg)
Machine Learning with Deep Neural Networks (DNN)
Revolutionised solutions in vision, speech recognition, …DNN models are trained by giving examples (instead of programming)
`
`
`
audio
words
text
topics
hello audience
images
labels`
`
`
Peter Pietzuch - Imperial College London 2
When DNN output is wrong, tweak its parameters
![Page 3: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/3.jpg)
Training DNNs
• Obtain DNN model that minimises classification error
• Use Stochastic Gradient Descent (SGD) for training:
• 1. Begin with random model• 2. Consider mini-batch of
training data• 3. Iteratively calculate gradients
& update model parameters w
Model parameters wEr
ror
lowest error
random optimal
• converge
Peter Pietzuch - Imperial College London 3
![Page 4: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/4.jpg)
Training DNNs on GPUs
• GPUs are good at parallelising gradient computation
Peter Pietzuch - Imperial College London 4
![Page 5: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/5.jpg)
Training DNNs in Parallel with GPUs
• With large datasets, speed up by calculating gradients on multiple GPUs • Every GPU has model replica with a copy of model parameters (or weights)
0
0
0
0
0
1
1 1
11
1 0
10
1
0
0
0
0
0
1
1
1
11
1
0
1
00
01 0
0
0
0
0
01
11
1
1
10
01
GPU 1 GPU 2 GPU 3
gradient
model
gradient gradient
model model
Shard 1 Shard 2 Shard 3
But model replicaswould divergeover time…
Mini-batch (of training data)Peter Pietzuch - Imperial College London 5
![Page 6: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/6.jpg)
Model Synchronisation among GPUs
• Parameter server: Maintains global model
GPU 1 GPU 2 GPU 3
model model model
model
gradient gradientgradient
Globalmodel
• GPUs:1. Send gradients to
update global model2. Synchronise local
model replicas with global model
Peter Pietzuch - Imperial College London 6
![Page 7: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/7.jpg)
The Problem with Large Batch Sizes
Training with large mini-batches is bad for your health.
More importantly, it’s bad for your test error.
Friends don’t let friends use mini-batches larger than 32.
Yann LeCun @ylecun
2:00 PM – 26 Apr 2018447 1.2K
Peter Pietzuch - Imperial College London 7
![Page 8: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/8.jpg)
Why Use Large Batch Sizes?
dataset
gradient gradient gradient
weights
average
An even bigger batchA bigger batchA batch
Keep work per GPU constant to scale
E.g. ~32 to 256 labelled images
Peter Pietzuch - Imperial College London 8
![Page 9: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/9.jpg)
What is the Best Batch Size on a GPU?
• ResNet-32 on NVIDIA Titan X GPU
Peter Pietzuch - Imperial College London
1134
361302 354 379
445
0200400600800
10001200
32 64 128 256 512 1024Tim
e to
acc
urac
y (s
ec)
Batch size b
TensorFlow
9
![Page 10: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/10.jpg)
Training DNNs Favours Small Batch Sizes
We want frequent, less “noisy” updates
w
GPU gradsmallbatch
weights
wweights
grad
grad
avg
GPU
largebatch
GPU
w
w
GPU gradsmallbatch
Peter Pietzuch - Imperial College London 10
![Page 11: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/11.jpg)
Statistical Efficiency Needs Small Batch Sizes
Peter Pietzuch - Imperial College London
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120 140
Test
acc
urac
y (%
)
Epochs
b=64b=128b=256b=512
b=1024b=2048b=4096
11
small batch sizes
largebatch sizes
![Page 12: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/12.jpg)
Hardware Efficiency Needs Large Batch Sizes
Keep work per GPU constant → increase batch size with #GPUs
Peter Pietzuch - Imperial College London
Computegradient
Updatereplica
Computegradient
Updatereplica
Computegradient
ComputegradientAverage
GPU
1G
PU 2
Batch size ½ b
Batch size ½ b
Batch size ½ b
Batch size ½ b
12
![Page 13: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/13.jpg)
Tension between Hardware & Statistical Efficiency
• Practitioners increase batch size due to hardware efficiency
• But best batch size depends on both hardware & statistical efficiency
Peter Pietzuch - Imperial College London 13
Training with large mini-batches is bad for your health.
More importantly, it’s bad for your test error.
Friends don’t let friends use mini-batches larger than 32.
Yann LeCun @ylecun
2:00 PM – 26 Apr 2018
447 1.2K
![Page 14: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/14.jpg)
Current Practice: Hyper-Parameter Tuning
• Adjust hyper-parameters (eg learning rate, momentum etc) to avoid reduction in statistical efficiency
• Linear scaling rule:"When mini-batch size is multiplied by k, multiply learning rate by k”
• Goyal et al. (2017)
• Drawbacks– Manual, labour-intensive process– Highly model specific – not portable and does not work for some models– Less effective for very large batch sizes…
Peter Pietzuch - Imperial College London 14
![Page 15: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/15.jpg)
Limits of Hyper-Parameter Tuning
“When mini-batch size is multiplied by k, multiply learning rate by k”
20
25
30
35
40
256 1,024 4,096 16,384 65,536
Top-
1 Va
lidat
ion
Erro
r
Mini-batch size
ResNet-50, 32 images/GPU
8,192
Peter Pietzuch - Imperial College London 15
![Page 16: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/16.jpg)
Fundamental Challenge of GPU Scaling
• “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”
– J. Dean, D. Patterson et al., “A New Golden Age in Computer Architecture”, IEEE Micro, 2018
• How to design a deep learning system that scales training with multiple GPUs, even when the preferred batch size is small?
Peter Pietzuch - Imperial College London 16
![Page 17: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/17.jpg)
(1) How to increasehardware efficiency with small batches?
(2) How to synchronisemodel replicas?
(3) How to reducescheduling &synchronisationoverheads?
task1
task2
task3
model
model
model
GPU
buffer-1buffer-2
buffer-3buffer-4
Reusable data buffers
Currently in use
task
GPU-1 GPU-2
model modelmodel model
Model
Peter Pietzuch - Imperial College London 17
![Page 18: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/18.jpg)
Problem: Small Batch Sizes Underutilise GPUs
Peter Pietzuch - Imperial College London
0
20
40
60
80
100
0 20 40 60 80 100
GPU
Occ
upan
cy (%
)
CDF (%)
Under-used resources
18
![Page 19: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/19.jpg)
How to Process Small Batches Efficiently?
One batch per GPU →Not enough data and instruction parallelism for every operator
batchgrad
weightsbatch
grad
weights
One per GPU0
100
Operations
GPU
Util
. (%
)
Peter Pietzuch - Imperial College London 19
![Page 20: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/20.jpg)
Idea: Train Multiple Model Replicas per GPU
One learning process (or learner) per GPU stream
batchgrad
weightsLearner
Learner
Learner
Stream
Stream
Stream
1 GPU
Scheduler
Peter Pietzuch - Imperial College London 20
![Page 21: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/21.jpg)
Effect of Training Multiple Model Replicas per GPU
• But now we must synchronise a large number of learners/model replicas...Peter Pietzuch - Imperial College London
0
20
40
60
80
100
32 64 128 256 512 1024Thro
ughp
ut in
crea
se (%
)
Batch size b
Regained resources
21
![Page 22: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/22.jpg)
(1) How to increaseefficiency with small batches?
(2) How to synchronisemodel replicas?
task1
task2
task3
model
model
model
GPU
GPU 1 GPU 2
model modelmodel model
Model
Peter Pietzuch - Imperial College London 22
• Train multiple model replicas per GPU
![Page 23: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/23.jpg)
Problem: Why not Synchronous Parallel SGD?
All learners always start from the same pointLimited exploration of parameter space
Peter Pietzuch - Imperial College London 23
![Page 24: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/24.jpg)
Idea: Maintain Independent Model Replicas
• Benefits: – Increased exploration of space through parallelism– Each model replica uses small batch size
Peter Pietzuch - Imperial College London 24
Average model trajectory
Replica X’s trajectory
Replica Y’s trajectory
Initial weights
![Page 25: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/25.jpg)
Crossbow: Synchronous Model Averaging
Allow learners to diverge but correct trajectories based on average modelAccelerate average model trajectory with momentum to find minima faster
correction
correction
Momentum-accelerated
Peter Pietzuch - Imperial College London 25
![Page 26: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/26.jpg)
GPUs with Synchronous Model Averaging
Learner
Replica …Learner
Replica
GPU 2
Learner
Replica …Learner
Replica
Learner
Replica …Learner
Replica
GPU 3GPU 1
• Synchronously apply corrections to model replicas
Peter Pietzuch - Imperial College London 26
![Page 27: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/27.jpg)
GPUs with Synchronous Model Averaging
Learner
Replica …
Reference Model
Learner
Learner
Replica
GPU 2
Learner
Replica …
Average Model
Learner
Learner
Replica
Learner
Replica …
Reference Model
Learner
Learner
Replica
GPU 3GPU 1
• Synchronously apply corrections to model replicas
Peter Pietzuch - Imperial College London 27
![Page 28: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/28.jpg)
GPUs with Synchronous Model Averaging
Learner
Replica …
Reference Model
Learner
Learner
Replica
GPU 2
Learner
Replica …
Average Model
Learner
Learner
Replica
Learner
Replica …
Reference Model
Learner
Learner
Replica
GPU 3GPU 1
Synchronous Model Averaging
• Ensures consistent view of average model• Takes GPU bandwidth into account during synchronisation
Peter Pietzuch - Imperial College London 28
![Page 29: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/29.jpg)
• Train multiple model replicas per GPU
• Use synchronous model averaging
(1) How to increaseefficiency with small batches?
(2) How to synchronisemodel replicas?
(3) How to reducescheduling andsynchronisationoverheads?
task1
task2
task3
model
model
model
GPU
GPU-1 GPU-2
model modelmodel model
Model
Peter Pietzuch - Imperial College London 29
buffer-1buffer-2
buffer-3buffer-4
Reusable data buffers
Currently in use
task
![Page 30: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/30.jpg)
Crossbow Architecture
Auto-tuner
Java24K LOC
C/C++15K LOC
Dataset
Datapre-processData
pre-processDatapre-processors
Input batches
Model replicas
Learner streams
Ready queues
Task schedulerDataflows
GPU
1G
PU 2
Learner
Synch
Integration with TensorFlow
github.com/lsds/crossbow
Learner
Synch
Peter Pietzuch - Imperial College London 30
![Page 31: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/31.jpg)
Efficient Task Scheduling
• Execute compute and synchronisation tasks
• Fine-grained concurrency
• Need efficient scheduler to feed all GPUs with tasks
Dataflow
Task scheduler
Input batches
Ready queues
Dataset
Data pre-processors
<Multiple threads>
Model replicas
Learn. streams
Auto-tuner
Peter Pietzuch - Imperial College London 31
![Page 32: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/32.jpg)
Interleaving Compute & Synchronisation Tasks
Peter Pietzuch - Imperial College London
Computegradient
Replica Queue
Elasticaverage
Updatereplica
Averagegradients
Averageall gradients
Updatereplica
Computegradient
Computegradient
Replica Queue
Elasticaverage
Updatereplica
Updatereplica
Computegradient
Elasticaverage
Elasticaverage
Updatedevice model
Map
ped
mem
ory
Auto-tuner
Updatereplica
Updatereplica
Dataset
W
Worker threads
W
W
Computegradient
Sync. Queue
Replica Queue
Elasticaverage
Updatereplica
Averagegradients
Updatereplica
Computegradient
Replica
Queue
Elasticaverage
Updatereplica
Updatereplica
Computegradient
Computegradient
Elasticaverage
Elasticaverage
Updatedevice model
Updatereplica
Updatereplica
New Replica Queue
Sync. Queue
New Replica Queue
Batch NBatch N+1
Batch N+2
Mon
itor
GPU
1G
PU 2
A
…
Create new queue on-the-fly Ready to compute another gradient
Sche
dule
nex
t ava
ilabl
e re
plic
a
32
![Page 33: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/33.jpg)
Auto-Tuning the Number of Model Replicas
• Monitor training throughput
• Dynamically adustnumber of learners
• Uses object pooling & lazy materialization
Auto-tuner
Input batches
Task manager
Dataset
Data pre-processors
Model replicas
Learn. streams
Peter Pietzuch - Imperial College London 33
![Page 34: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/34.jpg)
Experimental Evaluation
Peter Pietzuch - Imperial College London 34
![Page 35: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/35.jpg)
Does Crossbow Train Effectively with Small Batch Sizes?
Multiple learners per GPU improve hardware efficiency
TensorFlow Crossbow
b = 64
b = 256b = 64
1 learner
b = 64 4 learners
0200400600800
100012001400
Tim
e to
test
acc
urac
y (s
ec)
Peter Pietzuch - Imperial College London 35
• ResNet-32 with ImageNet dataset on 1 Titan X GPU
![Page 36: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/36.jpg)
0
20000
40000
60000
80000
100000
8
TT
A(5
3%
) (s
ec)
g
TensorFlow
32
Crossbow (m=1)
32
Crossbow
16
2
Does Crossbow Scale to multiple GPUs?
Tim
e to
test
acc
urac
y (s
ec)
GPUs
ResNet-50
Peter Pietzuch - Imperial College London 36
Synchronous Model Averaging improves statistical efficiency
• ResNet-50 with ImageNet dataset on 8 Titan X GPUs
![Page 37: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/37.jpg)
Does Crossbow Train Effectively Across Models?
Training with multiple learners always better than training with large batches
1.31.5
1.2
ResNet-32 VGG ResNet-50 ResNet-101
2.7
1
2
3
4
LeNet
Spee
d-up
ove
r Ten
sorF
low
1 GPU 8 GPUs4.3
Peter Pietzuch - Imperial College London 37
![Page 38: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/38.jpg)
• ResNet-50 with ImageNetdataset on Titan X GPUs
What is the Statistical Efficiency with Many Learners?
Peter Pietzuch - Imperial College London
0
20
40
60
80
100
0 20 40 60 80 100 120 140
Test
acc
urac
y (%
)
Epochs
m=1m=2m=4m=8
m=16m=32
Increasing parallelism has less effect on statistical efficiency38
![Page 39: Crossbow: Scaling Deep Learning on Multi-GPU Servers](https://reader034.vdocument.in/reader034/viewer/2022042214/625987639ee829537b71ab29/html5/thumbnails/39.jpg)
Crossbow: Scaling GPU Deep Learning
• Need to make training throughput independent from hyper-parameters– Rethink the design of future deep learning systems
• Crossbow: Scaling DNN training with small batch sizes on many GPUs– Multiple model replicas per GPU for high hardware efficiency– Synchronous model averaging for high statistical efficiency
• Exciting research challenges for next generation deep learning systems
Peter Pietzuch - Imperial College London 39
Peter Pietzuchhttps://lsds.doc.ic.ac.uk — [email protected]
Thank You — Any Questions?
github.com/lsds/crossbow