brief introduction to distributed deep learning

17
Distributed Deep Learning - An Overview Adam Gibson Skymind May 2016 Korea

Upload: adam-gibson

Post on 13-Jan-2017

1.164 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Brief introduction to Distributed Deep Learning

Distributed Deep Learning - An Overview

Adam Gibson Skymind May 2016 Korea

Page 2: Brief introduction to Distributed Deep Learning

Neural net Training basicsVectorization / Different kinds of dataParameters - A whole neural net consists of a graph

and parameter vectorMinibatches - Neural net data requires lots of ram.

Need to do minibatch training

Page 3: Brief introduction to Distributed Deep Learning

VectorizationImagesTextAudioVideoCSVs/structuredWeb logs

Page 4: Brief introduction to Distributed Deep Learning

Parameters / Neural net structureComputation graph - a neural net is just a dag of

ndarrays/tensorsThe parameters of a neural net can be made in to a

vector representing all the connections/weights in the graph

Page 5: Brief introduction to Distributed Deep Learning

MinibatchesData is partitioned in to sub samplesFits on gpuTrains fasterShould be representative sample (every label

present) as evenly as possible

Page 6: Brief introduction to Distributed Deep Learning

Distributed TrainingMultiple ComputersMultiple GpusMultiple Gpus AND Multiple ComputersDifferent kinds of parallelismLots of different algorithms

Page 7: Brief introduction to Distributed Deep Learning

Multiple ComputersDistributed Systems - connect/coordinate computers

over clusterHadoopHPC (MPI and friends)Client/server architecture

Page 8: Brief introduction to Distributed Deep Learning

Multiple GPUsSingle boxCould be multiple host threadsRDMA (Remote Direct Memory Access) interconnectNVLinkTypically used on a data center rackBreak problem up Share data across gpus

Page 9: Brief introduction to Distributed Deep Learning

Multiple GPUs and Multiple ComputersCoordinate problem over clusterUse GPUs for computeCan be done via MPI or hadoop (host thread

coordination)Parameter server - synchronize parameters over

master as well as handling things like gpu interconnect

Page 10: Brief introduction to Distributed Deep Learning

Different kinds of parallelismData ParallelismModel ParallelismBoth?

Page 11: Brief introduction to Distributed Deep Learning

Lots of different algorithmsAll ReduceIterative ReducePure Model parallelismParameter Averaging is key here

Page 12: Brief introduction to Distributed Deep Learning

Core IdeasPartition problem in to chunksCan be neural netAs well as dataUse as many cuda or cpu cores as possible

Page 13: Brief introduction to Distributed Deep Learning

How does parameter averaging work?Replicate model across clusterTrain on different portions of data with same modelSynchronize as minimally as possible while producing

a good modelHyper parameters should be more aggressive (higher

learning rates)

Page 14: Brief introduction to Distributed Deep Learning

All Reducehttp://cilvr.cs.nyu.edu/diglib/lsml/lecture04-allreduce.pdf

Page 15: Brief introduction to Distributed Deep Learning

Iterative Reduce (Parameter Averaging)

Page 16: Brief introduction to Distributed Deep Learning

Natural Gradient (ICLR 2015)https://arxiv.org/abs/1410.7455 - sync every k data points

Page 17: Brief introduction to Distributed Deep Learning

Tuning distributed trainingAveraging acts as a form of regularizationNeeds more aggressive hyper parametersNot always going to be faster - account for amount of

data points you haveDistributed systems applies here: Send code to data

not other way aroundReduce communication overhead for max

performanceLots of experimentation here yet