joe bradish parallel neural networks. background deep neural networks (dnns) have become one of the...

Download Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence

If you can't read please download the document

Upload: anastasia-lawson

Post on 19-Jan-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Joe Bradish Parallel Neural Networks Background Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence and machine learning Used extensively by major corporations Google, Facebook Very expensive to train Large datasets Often in the terabytes The larger the dataset, the more accurately the network can model the underlying classification function Limiting factor has almost always been computational power, but we are starting to reach levels that can solve previously impossible problems Set of layers of neurons which represent synapses in the network Each neuron has a set of inputs Either inputs into the entire network Or, outputs from previous neurons, usually from the previous layer Underlying algorithm of network can vary greatly Multilayer feed forward Feedback Network Self-organizing maps Maps from high dimension to lower in 1 layer Sparse Distributed Memory Two layer feedforward, associative memory Quick Neural Net Basics Everything is reliant on the weights Determine the importance of each signal which is essential to the network output The training cycle adjusts the weights By far, the most critical step of a successful neural network No training = useless network Network topology is also key to training and especially parallelization Learning / Training Implies these ways of parallelization: Training session parallelism Training example parallelism Layer parallelism Neuron parallelism Weight parallelism Bit parallelism Where can we use parallelization? Typical structure of a neural network: For each training session For each training example in the session For each neuron in the layer For all the weights of the neuron For all the bits of the weight value Example - Network level Parallelism Notice there are many different neural networks, some of which feed into each other The outputs are sent to different machines and then aggregated once more Each Neuron is assigned a specific controlling entity on the communication network Each computer is responsible for forwarding the weights to the hub so that the computer controlling the next layer can feed it into the neural network Uses a broadcast system Example Neuron Level Parallelism Used to parallelize serial backpropagation Usually implemented as a series of matrix- vector operations Achieved using an all-to-all broadcasts Each node (on a cluster) is responsible for a subset of the network Uses master broadcaster Parallelism by Node Backward propagation more complicated 1. Master scatters error vector to current layer 2. Each process computes their weight change for its subset 3. Each process computes their error vector for the previous layer 4. Each process sends its contribution to error vector to master 5. Master sums contributions and prepares previous layers error vector for broadcast Forward propagation is straight forward 1. Master broadcasts previous layers output vector 2. Each process computers its subset of the current layers output vector 3. Master gathers from all processes and prepares vector for next broadcast Parallelization by Node Cont. Results of Node Parallelization MPI used for communication between nodes 32 machine cluster of Intel Pentium II Up to 16.36x speedup with 32 processes Results of Node Parallelization Cont. Each process determines the weight change on a disjoint subset of the training population Changes are aggregated and then applied to neural network after each epoch (set of training) Low levels of synchronization needed Only requires two additional steps Very simple to implement Parallelism by training example Uses master-slave style topology Speedups using Exemplar Parallelization Max speedup with 32 processes 16.66x Many different strategies for parallelization Strategy depends on shape, size, type of training data Node excels at small datasets and on-line learning Exemplar gives best performance on large training datasets Different topologies will perform radically different when using the same parallelization strategy On-going research GPUs have become very prevalent, due to their ability to perform matrix operations in parallel Sometimes it is harder to link multiple GPUS Large clusters of weaker machines have also become prevalent, due to reduced cost Amazon, Google, and Microsoft offer commercial products for scalable neural networks on their clouds Conclusion Questions?