brief introduction to distributed deep learning
TRANSCRIPT
![Page 1: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/1.jpg)
Distributed Deep Learning - An Overview
Adam Gibson Skymind May 2016 Korea
![Page 2: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/2.jpg)
Neural net Training basicsVectorization / Different kinds of dataParameters - A whole neural net consists of a graph
and parameter vectorMinibatches - Neural net data requires lots of ram.
Need to do minibatch training
![Page 3: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/3.jpg)
VectorizationImagesTextAudioVideoCSVs/structuredWeb logs
![Page 4: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/4.jpg)
Parameters / Neural net structureComputation graph - a neural net is just a dag of
ndarrays/tensorsThe parameters of a neural net can be made in to a
vector representing all the connections/weights in the graph
![Page 5: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/5.jpg)
MinibatchesData is partitioned in to sub samplesFits on gpuTrains fasterShould be representative sample (every label
present) as evenly as possible
![Page 6: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/6.jpg)
Distributed TrainingMultiple ComputersMultiple GpusMultiple Gpus AND Multiple ComputersDifferent kinds of parallelismLots of different algorithms
![Page 7: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/7.jpg)
Multiple ComputersDistributed Systems - connect/coordinate computers
over clusterHadoopHPC (MPI and friends)Client/server architecture
![Page 8: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/8.jpg)
Multiple GPUsSingle boxCould be multiple host threadsRDMA (Remote Direct Memory Access) interconnectNVLinkTypically used on a data center rackBreak problem up Share data across gpus
![Page 9: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/9.jpg)
Multiple GPUs and Multiple ComputersCoordinate problem over clusterUse GPUs for computeCan be done via MPI or hadoop (host thread
coordination)Parameter server - synchronize parameters over
master as well as handling things like gpu interconnect
![Page 10: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/10.jpg)
Different kinds of parallelismData ParallelismModel ParallelismBoth?
![Page 11: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/11.jpg)
Lots of different algorithmsAll ReduceIterative ReducePure Model parallelismParameter Averaging is key here
![Page 12: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/12.jpg)
Core IdeasPartition problem in to chunksCan be neural netAs well as dataUse as many cuda or cpu cores as possible
![Page 13: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/13.jpg)
How does parameter averaging work?Replicate model across clusterTrain on different portions of data with same modelSynchronize as minimally as possible while producing
a good modelHyper parameters should be more aggressive (higher
learning rates)
![Page 14: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/14.jpg)
All Reducehttp://cilvr.cs.nyu.edu/diglib/lsml/lecture04-allreduce.pdf
![Page 15: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/15.jpg)
Iterative Reduce (Parameter Averaging)
![Page 16: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/16.jpg)
Natural Gradient (ICLR 2015)https://arxiv.org/abs/1410.7455 - sync every k data points
![Page 17: Brief introduction to Distributed Deep Learning](https://reader036.vdocument.in/reader036/viewer/2022083111/58f9a92a760da3da068b6bab/html5/thumbnails/17.jpg)
Tuning distributed trainingAveraging acts as a form of regularizationNeeds more aggressive hyper parametersNot always going to be faster - account for amount of
data points you haveDistributed systems applies here: Send code to data
not other way aroundReduce communication overhead for max
performanceLots of experimentation here yet