allreduce for parallel learningmltf/parallel_learning.pdf · 2017-05-25 · langford a reliable e...
TRANSCRIPT
![Page 1: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/1.jpg)
Allreduce for Parallel Learning
John Langford, Microsoft Resarch, NYC
May 8, 2017
![Page 2: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/2.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!
I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.
![Page 3: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/3.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!
The worst part: he had a point.
![Page 4: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/4.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.
![Page 5: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/5.jpg)
Why is it hard?
Everyone’s first instinct: Try using parameter servers.
Model
Shard
Model
Shard
Model
Shard
Shard
Data
Shard
Data
Shard
Data
Shard
Data
Shard
Data
1 Hsu, Karampatziakis, Langford, and Smola, “Parallel OnlineLearning”, https://arxiv.org/abs/1103.4204
2 Dean et al, “Large scale Deep Distributed Networks”, NIPS2012.
Big problems in practice:1 Overwhelmingly inefficient. Best case: marginally faster with
x100 electricity.2 Nondeterministic. Minor bugs incredibly difficult to track.
![Page 6: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/6.jpg)
Why is it hard?
Everyone’s first instinct: Try using parameter servers.
Model
Shard
Model
Shard
Model
Shard
Shard
Data
Shard
Data
Shard
Data
Shard
Data
Shard
Data
1 Hsu, Karampatziakis, Langford, and Smola, “Parallel OnlineLearning”, https://arxiv.org/abs/1103.4204
2 Dean et al, “Large scale Deep Distributed Networks”, NIPS2012.
Big problems in practice:1 Overwhelmingly inefficient. Best case: marginally faster with
x100 electricity.2 Nondeterministic. Minor bugs incredibly difficult to track.
![Page 7: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/7.jpg)
Terascale Linear Learning ACDL11
Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ faster than all possible singlemachine linear learning algorithms.
![Page 8: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/8.jpg)
Terascale Linear Learning ACDL11
Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ faster than all possible singlemachine linear learning algorithms.
![Page 9: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/9.jpg)
Terascale Linear Learning ACDL11
Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ faster than all possible singlemachine linear learning algorithms.
![Page 10: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/10.jpg)
MPI-style AllReduce
7
2 3 4
6
Allreduce initial state
5
1
AllReduce = Reduce+BroadcastProperties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 11: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/11.jpg)
MPI-style AllReduce
2828 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+BroadcastProperties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 12: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/12.jpg)
MPI-style AllReduce
2 3 4
6
7
5
1
Create Binary Tree
AllReduce = Reduce+BroadcastProperties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 13: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/13.jpg)
MPI-style AllReduce
2 3 4
7
8
1
13
Reducing, step 1
AllReduce = Reduce+BroadcastProperties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 14: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/14.jpg)
MPI-style AllReduce
2 3 4
8
1
13
Reducing, step 2
28
AllReduce = Reduce+BroadcastProperties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 15: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/15.jpg)
MPI-style AllReduce
2 3 41
28
Broadcast, step 1
28 28
AllReduce = Reduce+BroadcastProperties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 16: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/16.jpg)
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 17: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/17.jpg)
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+BroadcastProperties:
1 Easily pipelined so no latency concerns.
2 Bandwidth ≤ 6n.
3 No need to rewrite code!
![Page 18: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/18.jpg)
An Example Algorithm: Weight averaging
n = AllReduce(1)While (pass number < max)
1 While (examples left)1 Do online update.
2 AllReduce(weights)
3 For each weight w ← w/n
Other algorithms implemented:
1 Nonuniform averaging for online learning
2 Conjugate Gradient
3 LBFGS
![Page 19: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/19.jpg)
An Example Algorithm: Weight averaging
n = AllReduce(1)While (pass number < max)
1 While (examples left)1 Do online update.
2 AllReduce(weights)
3 For each weight w ← w/n
Other algorithms implemented:
1 Nonuniform averaging for online learning
2 Conjugate Gradient
3 LBFGS
![Page 20: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/20.jpg)
Hadoop AllReduce
1
DataProgram
“Map” job moves program to data.
2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.
3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.
The net effect: Reliable execution out to perhaps 10K node-hours.
![Page 21: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/21.jpg)
Hadoop AllReduce
1
DataProgram
“Map” job moves program to data.
2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.
3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.
The net effect: Reliable execution out to perhaps 10K node-hours.
![Page 22: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/22.jpg)
Hadoop AllReduce
1
DataProgram
“Map” job moves program to data.
2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.
3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.
The net effect: Reliable execution out to perhaps 10K node-hours.
![Page 23: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/23.jpg)
Hadoop AllReduce
1
DataProgram
“Map” job moves program to data.
2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.
3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.
The net effect: Reliable execution out to perhaps 10K node-hours.
![Page 24: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/24.jpg)
Robustness & Speedup
0
1
2
3
4
5
6
7
8
9
10
10 20 30 40 50 60 70 80 90 100
Spe
edup
Nodes
Speed per method
Average_10Min_10
Max_10linear
![Page 25: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/25.jpg)
Splice Site Recognition
0 10 20 30 40 500.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
0 5 10 15 20
0.466
0.468
0.47
0.472
0.474
0.476
0.478
0.48
0.482
0.484
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
![Page 26: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/26.jpg)
Splice Site Recognition
0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
Effective number of passes over data
auP
RC
L−BFGS w/ one online passZinkevich et al.Dekel et al.
![Page 27: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/27.jpg)
What about parallel deep learning?
Needs to work with a GPU.
![Page 28: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/28.jpg)
GPU Allreduce optimization 1: Minibatch gradient descent
GPUs have much more computation than communication.
Give every GPU n examples and compute average gradient onthem. Synchronize Gradient
![Page 29: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/29.jpg)
GPU Allreduce optimization 1: Minibatch gradient descent
GPUs have much more computation than communication.
Give every GPU n examples and compute average gradient onthem. Synchronize Gradient
![Page 30: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/30.jpg)
GPU Allreduce optimization 1: Half-Batch
Do communication asynchronous to computation.
1 minibatch communicates while the other computes on olderparameters.
![Page 31: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/31.jpg)
GPU Allreduce optimization 1: Half-Batch
Do communication asynchronous to computation.
1 minibatch communicates while the other computes on olderparameters.
![Page 32: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/32.jpg)
GPU Allreduce optimization 1: 1-bit SGD
Discretize the gradient to 1 bit before communicating off GPU.
Keep and accumulate discretization errors on GPU.
![Page 33: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/33.jpg)
GPU Allreduce optimization 1: 1-bit SGD
Discretize the gradient to 1 bit before communicating off GPU.
Keep and accumulate discretization errors on GPU.
![Page 34: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/34.jpg)
GPU Allreduce Optimization 2: Ring Allreduce
Every node masters a subset and messages travel in a ring.
1 Downside: latency increase?
2 Upside: perfectly efficient synchronization
![Page 35: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/35.jpg)
Bibliography
L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with LimitedStorage, Mathematics of Computation 35:773–782, 1980.
grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A ScalableModular Convex Solver for Regularized Risk Minimization,KDD 2007.
Rings 1 Pitch Patarasuk and Xin Yuan, Bandwidth Optimal all-reducealgorithms for clusters of workstations, JPDC 2009.
avg. G. Mann et al. Efficient large-scale distributed training ofconditional maximum entropy models, NIPS 2009.
ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li, ParallelizedStochastic Gradient Descent, NIPS 2010.
P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola,Parallel Online Learning, in SUML 2010.
D. Mini O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, OptimalDistributed Online Predictions Using Minibatch, JMLR 2012.
![Page 36: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec5caf3dfe04557bd2f2376/html5/thumbnails/36.jpg)
Bibliography II
Tera Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, JohnLangford A Reliable Effective Terascale Linear LearningSystem, Arxiv 2011/JMLR 2014.
DistBelief Dean et al, Large Scale Distributed Deep Networks, NIPS2012.
One-bit Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu,1-Bit Stochastic Gradient Descent and its Application toData-Parallel Distributed Training of Speech DNNs,Interspeech 2014.
Rings 2 Amodei et al, Deep Speech 2: End-to-End SpeechRecognition in English and Mandarin, ICML 2016.