scaling deep learning on multi-gpu servers · scaling deep learning on multi-gpu servers peter...
TRANSCRIPT
Scaling Deep Learning on Multi-GPU Servers
Peter Pietzuch(with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa)
Imperial College London
http://lsds.doc.ic.ac.uk<[email protected]>
SICS – Sweden – September 2018
Large-Scale Data & Systems Group
Deep Neural Networks (DNN) at Scale
Speech recognitionImage classification
Audio data
“Hello”“Person”
Image data
TrainDNNswith a
lot of data
Peter Pietzuch - Imperial College London
Training DNNs with Stochastic Gradient Descent
• Goal: Obtain DNN model that minimises classification error
• Stochastic Gradient Descent (SGD): � Consider mini-batch of training data� Iteratively calculate gradients and update model parameters w
Model parameters w
Erro
r
lowest error
random optimal
• converge
Peter Pietzuch - Imperial College London 3
The Problem With Large Batch Sizes
• “Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32.”
� Y. LeCun, @ylecun, April 26, 2018
Peter Pietzuch - Imperial College London 4
Scaling DNN Training on GPUs
• Manager:• Need a highly-accurate image classification model ASAP
• Highly-Paid Data Scientist:• OK. I’ll use ResNet-50. It will take ½ month
on 1 GPU
• M: • Throw more hardware at the problem! Need this sooner.
• HPDS: …
Peter Pietzuch - Imperial College London 5
Parallel DNN Training• With large training datasets, speed up by calculating gradients in parallel
0
0
0
0
0
1
1 1
11
1 0
10
1
0
0
0
0
0
1
1
1
11
1
0
1
00
01 0
0
0
0
0
01
11
1
1
10
01
GPU 1 GPU 2 GPU 3
gradient
model
gradient gradient
model model
Shard 1 Shard 2 Shard 3
Model replicasdiverge over time
Mini-batchPeter Pietzuch - Imperial College London 6
Synchronisation Among GPUs
• Parameter server: Maintains global model
GPU 1 GPU 2 GPU 3
model model model
model
gradient gradientgradient
Globalmodel
• GPUs:1. Send gradients to
update global model2. Synchronise local
model replicas with global model
Peter Pietzuch - Imperial College London 7
What is the Best Batch Size?• ResNet-32 on Titan X GPU
Peter Pietzuch - Imperial College London 8
1134
361302 354 379
445
0200400600800
10001200
32 64 128 256 512 1024Tim
e to
acc
urac
y (s
ec)
Batch size, b
TensorFlow
Intuition for Small Batch Sizes
• Practical considerations: • More frequent model updates minimise bias to initial conditions faster
(i.e. initial weights are forgotten faster)
• Theoretical considerations:• Small batch sizes explore better the wider minima, which are known to
exhibit better test accuracy
Peter Pietzuch - Imperial College London 9
Statistical Efficiency Needs Small Batch Sizes
Peter Pietzuch - Imperial College London 10
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120 140
Test
acc
urac
y (%
)
Epochs
b=64b=128b=256b=512
b=1024b=2048b=4096
Hardware Efficiency Needs Large Batch Sizes
Keep work per GPU constant => scale batch size with #GPUs
Peter Pietzuch - Imperial College London 11
Computegradient
Updatereplica
Computegradient
Updatereplica
Computegradient
ComputegradientAverage
GPU
1G
PU 2
Batch N
Batch N
Batch N+1
Batch N+1Aggregategradients(blocking)
12
12
12
12
Hardware & Statistical Efficiency
• “Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32.”� Y. LeCun, @ylecun, April 26, 2018
• The only reason practitioners increase batch size is hardware efficiency
• But best batch size depends on both hardware efficiency & statistical efficiency
Peter Pietzuch - Imperial College London 12
Limits of Scaling DNN Training
• HPDS: Managed to train it on 100s of GPUs in 1h!
• M: Great! Make it faster. Use as many resources as you need!
• HPDS: Can’t :( Beyond this point statistical efficiency collapses. Actually,I straggled to maintain it up to this point…
• M: …
• The story continues. New hardware becomes available. Improved communication between workers. Latest results train ResNet-50 in just few minutes. But the problem remains.
Peter Pietzuch - Imperial College London 13
Hyper-Parameter Tuning
The Fundamental Challenge of GPU Scaling
• “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”� J. Dean, D. Patterson, X. […], “The Golden Age”, IEEE Micro
• How to design a system that can scale training with multiple GPUs even when the preferred batch size is small?
Peter Pietzuch - Imperial College London 14
Problem: Small Batch Sizes Underutilise GPUs
Peter Pietzuch - Imperial College London 15
0
20
40
60
80
100
0 20 40 60 80 100
GPU
Occ
upan
cy (%
)
CDF (%)
Under-used resources
Idea: Train Multiple Model Replicas per GPU
• Fully exploits task parallelism on GPU
• But we now need to synchronise large number of model replicas...Peter Pietzuch - Imperial College London 16
0
20
40
60
80
100
32 64 128 256 512 1024Thro
ughp
ut In
crea
se (%
)
Batch size, b
Regained resources
How to Synchronise Many Model Replicas?
• Synchronised SGD:� Average gradients of multiple workers� Start next training with average model
• Workers start next exploration from same point in weight space
Peter Pietzuch - Imperial College London 17
Average gradienttrajectory
Replica X’s trajectory
Replica Y’s trajectory
Initial weights
Initial weights
Idea: Sample Multiple Nearby Points in Space
• Synchronisation using Elastic Averaging SGD (EA-SGD)
• Benefits: 1. Each replica reuses learning set-up of small batch size2. Increased exploration through parallelism3. Average replica helps with direction to good minima
Peter Pietzuch - Imperial College London 18
Correction rule
Average model trajectory
Replica X’s trajectory
Replica Y’s trajectory
Initial weights
When To Apply Corrections?
• Synchronously apply corrections to model replicas
• Other techniques� Control impact of average model using momentum� Auto-tune number of model replicas
Peter Pietzuch - Imperial College London 20
Correction rule
Average model trajectory
Replica X’s trajectory
Replica Y’s trajectory
Initial weights
• Execute concurrent tasks to leverage GPU concurrency
• Schedule fine-grained compute & synchronisationtasks
• Minimise amount of data transfer among different GPUs
Crossbow: Multi-GPU Deep Learning System
(1) Train multiplemodels in parallel
(2) Synchronous hierarchicalmodel averaging
(3) Efficient fine-grained task engine
task1
task2
task3
model
model
model
GPU
buffer-1buffer-2
buffer-3buffer-4
Reusable data buffers
Currently in use
task
GPU-1 GPU-2
model modelmodel model
Model
Peter Pietzuch - Imperial College London 21
GPU Parallelism with Multiple Model Replicas
• On each GPU, use different streams for training tasks• Task: series of operations based on computation of model layers
� Task associated with (a) model replica and (b) batch of training data
data
model
task
stream 1
stream 2
stream 3
model1 task1
task2
task3
model2
model3
data1
data2
data3
Dispatch to stream
Peter Pietzuch - Imperial College London 22
Synchronous Hierarchical Elastic Model Averaging
• Intra-GPU, cross-GPU, and CPU-GPU communication have different performance characteristics
• Crossbow uses hierarchical aggregation to avoid bottlenecks
GPU-1 GPU-2
model model
model
model model
model
Model
Sync withinGPU
Sync across GPUs
Peter Pietzuch - Imperial College London 23
CrossBow Task Execution Engine
Peter Pietzuch - Imperial College London 24
Computegradient
Replica Queue
Elasticaverage
Updatereplica
Averagegradients
Averageall gradients
Updatereplica
Computegradient
Computegradient
Replica Queue
Elasticaverage
Updatereplica
Updatereplica
Computegradient
Elasticaverage
Elasticaverage
Updatedevice model
Map
ped
mem
ory
Auto-tuner
Updatereplica
Updatereplica
Dataset
W
Worker threads
W
W
Computegradient
Sync. Queue
Replica Queue
Elasticaverage
Updatereplica
Averagegradients
Updatereplica
Computegradient
Replica
Queue
Elasticaverage
Updatereplica
Updatereplica
Computegradient
Computegradient
Elasticaverage
Elasticaverage
Updatedevice model
Updatereplica
Updatereplica
New Replica Queue
Sync. Queue
New Replica Queue
Batch NBatch N+1
Batch N+2
Mon
itor
GPU
1G
PU 2
A
…
Create new queue on-the-fly Ready to compute another gradient
Sche
dule
nex
t ava
ilabl
e re
plic
a
• Many small tasks for (1) replica training and (2) synchronisation
• CrossBow has GPU/CPU task engine for multiplexing & overlapping small tasks
CrossBow: Benefit of Synchronous Model Averaging
Peter Pietzuch - Imperial College London 25
0
500
1000
1500
2000
2500
1 2 4 8
TensorFlow CrossBow 1-replica CrossBow
Number of GPUs
Tim
e to
69%
test
acc
urac
y (s
econ
ds)
• VGG-16 with Cifar-100 dataset on Titan X GPUs
CrossBow: Statistical Efficiency with Many Models
Peter Pietzuch - Imperial College London 26
0
20
40
60
80
100
0 20 40 60 80 100 120 140
Test
acc
urac
y (%
)
Epochs
m=1m=2m=4m=8
m=16m=32
Increasing parallelism does not collapse convergence, while each replica reuses same learning setup
• ResNet-50 with ImageNet dataset on Titan X GPUs
Summary: Scaling Deep Leaning on Multi-GPU Servers
• Need to make training throughput independent from hyper-parameters� Current need for hyper-parameter tuning too complex� Need new designs for deep learning systems
• Crossbow: Scaling DNN training with small batch sizes on many GPUs� Multiple model replicas per GPU for high hardware efficiency� Synchronous hierarchical model averaging for high statistical efficiency� Requires new CPU/GPU task engine design
27
Peter Pietzuchhttps://lsds.doc.ic.ac.uk — [email protected]
Thank You — Any Questions?
Peter Pietzuch - Imperial College London