brief introduction to distributed deep learning

Distributed Deep Learning - An Overview

Adam Gibson Skymind May 2016 Korea

Neural net Training basicsVectorization / Different kinds of dataParameters - A whole neural net consists of a graph

and parameter vectorMinibatches - Neural net data requires lots of ram.

Need to do minibatch training

VectorizationImagesTextAudioVideoCSVs/structuredWeb logs

Parameters / Neural net structureComputation graph - a neural net is just a dag of

ndarrays/tensorsThe parameters of a neural net can be made in to a

vector representing all the connections/weights in the graph

MinibatchesData is partitioned in to sub samplesFits on gpuTrains fasterShould be representative sample (every label

present) as evenly as possible

Distributed TrainingMultiple ComputersMultiple GpusMultiple Gpus AND Multiple ComputersDifferent kinds of parallelismLots of different algorithms

Multiple ComputersDistributed Systems - connect/coordinate computers

over clusterHadoopHPC (MPI and friends)Client/server architecture

Multiple GPUsSingle boxCould be multiple host threadsRDMA (Remote Direct Memory Access) interconnectNVLinkTypically used on a data center rackBreak problem up Share data across gpus

Multiple GPUs and Multiple ComputersCoordinate problem over clusterUse GPUs for computeCan be done via MPI or hadoop (host thread

coordination)Parameter server - synchronize parameters over

master as well as handling things like gpu interconnect

Different kinds of parallelismData ParallelismModel ParallelismBoth?

Lots of different algorithmsAll ReduceIterative ReducePure Model parallelismParameter Averaging is key here

Core IdeasPartition problem in to chunksCan be neural netAs well as dataUse as many cuda or cpu cores as possible

How does parameter averaging work?Replicate model across clusterTrain on different portions of data with same modelSynchronize as minimally as possible while producing

a good modelHyper parameters should be more aggressive (higher

learning rates)

All Reducehttp://cilvr.cs.nyu.edu/diglib/lsml/lecture04-allreduce.pdf

Iterative Reduce (Parameter Averaging)

Natural Gradient (ICLR 2015)https://arxiv.org/abs/1410.7455 - sync every k data points

Tuning distributed trainingAveraging acts as a form of regularizationNeeds more aggressive hyper parametersNot always going to be faster - account for amount of

data points you haveDistributed systems applies here: Send code to data

not other way aroundReduce communication overhead for max

performanceLots of experimentation here yet

brief introduction to distributed deep learning

Data & Analytics

distributed deep q-learning

a brief introduction to distributed systems

tarantella a framework for distributed deep learning

accelerating deep neuroevolution on distributed fpgas for

defenses against byzantine attacks in distributed deep

distributed deep learning · distributed deep learning...

elastic resource sharing for distributed deep learning

learned gradient compression for distributed deep learning

research brief: deep canvass, deep change · research...

building distributed deep learning engine

platform technical brief - everest distributed protocol

distributed deep q-learning - arxiv · distributed deep...

heterps: distributed deep learning with reinforcement

communication-efﬁcient distributed deep learning: a...

deep observability with the pensando distributed services

distributed training large-scale deep architectures

brief review of distributed systems

distributed deep rl on spark strata singapore

scalable distributed deep learning · 2016-10-17 ·...

canary: decentralized distributed deep learning via