brief introduction to distributed deep learning

Distributed Deep Learning - An Overview

Adam Gibson Skymind May 2016 Korea

Neural net Training basicsVectorization / Different kinds of dataParameters - A whole neural net consists of a graph

and parameter vectorMinibatches - Neural net data requires lots of ram.

Need to do minibatch training

VectorizationImagesTextAudioVideoCSVs/structuredWeb logs

Parameters / Neural net structureComputation graph - a neural net is just a dag of

ndarrays/tensorsThe parameters of a neural net can be made in to a

vector representing all the connections/weights in the graph

MinibatchesData is partitioned in to sub samplesFits on gpuTrains fasterShould be representative sample (every label

present) as evenly as possible

Distributed TrainingMultiple ComputersMultiple GpusMultiple Gpus AND Multiple ComputersDifferent kinds of parallelismLots of different algorithms

Multiple ComputersDistributed Systems - connect/coordinate computers

over clusterHadoopHPC (MPI and friends)Client/server architecture

Multiple GPUsSingle boxCould be multiple host threadsRDMA (Remote Direct Memory Access) interconnectNVLinkTypically used on a data center rackBreak problem up Share data across gpus

Multiple GPUs and Multiple ComputersCoordinate problem over clusterUse GPUs for computeCan be done via MPI or hadoop (host thread

coordination)Parameter server - synchronize parameters over

master as well as handling things like gpu interconnect

Different kinds of parallelismData ParallelismModel ParallelismBoth?

Lots of different algorithmsAll ReduceIterative ReducePure Model parallelismParameter Averaging is key here

Core IdeasPartition problem in to chunksCan be neural netAs well as dataUse as many cuda or cpu cores as possible

How does parameter averaging work?Replicate model across clusterTrain on different portions of data with same modelSynchronize as minimally as possible while producing

a good modelHyper parameters should be more aggressive (higher

learning rates)

All Reducehttp://cilvr.cs.nyu.edu/diglib/lsml/lecture04-allreduce.pdf

Iterative Reduce (Parameter Averaging)

Natural Gradient (ICLR 2015)https://arxiv.org/abs/1410.7455 - sync every k data points

https://arxiv.org/abs/1410.7455

Tuning distributed trainingAveraging acts as a form of regularizationNeeds more aggressive hyper parametersNot always going to be faster - account for amount of

data points you haveDistributed systems applies here: Send code to data

not other way aroundReduce communication overhead for max

performanceLots of experimentation here yet

brief introduction to distributed deep learning

Data & Analytics