brief introduction to distributed deep learning
Post on 13-Jan-2017
1.165 Views
Preview:
TRANSCRIPT
Distributed Deep Learning - An Overview
Adam Gibson Skymind May 2016 Korea
Neural net Training basicsVectorization / Different kinds of dataParameters - A whole neural net consists of a graph
and parameter vectorMinibatches - Neural net data requires lots of ram.
Need to do minibatch training
VectorizationImagesTextAudioVideoCSVs/structuredWeb logs
Parameters / Neural net structureComputation graph - a neural net is just a dag of
ndarrays/tensorsThe parameters of a neural net can be made in to a
vector representing all the connections/weights in the graph
MinibatchesData is partitioned in to sub samplesFits on gpuTrains fasterShould be representative sample (every label
present) as evenly as possible
Distributed TrainingMultiple ComputersMultiple GpusMultiple Gpus AND Multiple ComputersDifferent kinds of parallelismLots of different algorithms
Multiple ComputersDistributed Systems - connect/coordinate computers
over clusterHadoopHPC (MPI and friends)Client/server architecture
Multiple GPUsSingle boxCould be multiple host threadsRDMA (Remote Direct Memory Access) interconnectNVLinkTypically used on a data center rackBreak problem up Share data across gpus
Multiple GPUs and Multiple ComputersCoordinate problem over clusterUse GPUs for computeCan be done via MPI or hadoop (host thread
coordination)Parameter server - synchronize parameters over
master as well as handling things like gpu interconnect
Different kinds of parallelismData ParallelismModel ParallelismBoth?
Lots of different algorithmsAll ReduceIterative ReducePure Model parallelismParameter Averaging is key here
Core IdeasPartition problem in to chunksCan be neural netAs well as dataUse as many cuda or cpu cores as possible
How does parameter averaging work?Replicate model across clusterTrain on different portions of data with same modelSynchronize as minimally as possible while producing
a good modelHyper parameters should be more aggressive (higher
learning rates)
All Reducehttp://cilvr.cs.nyu.edu/diglib/lsml/lecture04-allreduce.pdf
Iterative Reduce (Parameter Averaging)
Natural Gradient (ICLR 2015)https://arxiv.org/abs/1410.7455 - sync every k data points
Tuning distributed trainingAveraging acts as a form of regularizationNeeds more aggressive hyper parametersNot always going to be faster - account for amount of
data points you haveDistributed systems applies here: Send code to data
not other way aroundReduce communication overhead for max
performanceLots of experimentation here yet
top related