deep learning at scale
TRANSCRIPT
Deep Learning at scale
Mateusz DymczykSoftware Engineer
H2O.ai
Strata+Hadoop Singapore 08.12.2016
About me
• M.Sc. in CS @ AGH UST, Poland • CS Ph.D. dropout • Software Engineer @ H2O.ai • Previously ML/NLP and distributed
systems @ Fujitsu Laboratories and en-japan inc.
• Deep learning - brief introduction • Why scale • Scaling models + implementations • Demo • A peek into the future
Agenda
Deep Learning
Text classification (item prediction)
Use Cases
Fraud detection (11% accuracy boost in production)
Image classification
Machine translation
Recommendation systems
Deep Learning
Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations.*
*https://en.wikipedia.org/wiki/Deep_learning
Deep Learning
SGD
• Memory efficient • Fast • Not easy to parallelize without speed
degradation
Initialize Parameters
Get training sample i
12
3
Until converged
Deep Learning
PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in
multiple fields
CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits
Why scale?
PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in
multiple fields
CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits
Grid search?
Why not to scale?
• Distribution isn’t free: overhead due to network traffic, synchronization etc.
• Small neural network (not many layers and/or neurons) • small+shallow network = not much computation per iteration
• Small data • Very slow network communication
Distribution
Distribution models
• Model parallelism • Data parallelism • Mixed/composed • Parameter server vs *peer-to-peer • Communication: Asynchronous vs Synchronous
*http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf
Model parallelism
Node 1 Node 2 Node 3
• Each node computes a different part of network
• Potential to scale to large models
• Rather difficult to implement and reason about
• Originally designed for large convolutional layer in GoogleNet
Parameter Server
Parameters or deltas
Data parallelism
Node 1
Parameter Server
Node 3
Node 2
Node 4
• Each node computes all the parameters
• Using part (or all) of local data • Results have to be combined
Parameters or deltas
Mixed/composed
• Node distribution (model or data) • In-node, per gpu/cpu/thread
concurrent/parallel computation • Example: learn whole model on each
multi CPU/GPU machine, where each CPU/GPU trains a different layer or works with different part of data
Node
T T T
NodeCPU/GPU CPU/GPU
CPU/GPU CPU/GPU
Sync vs. Async
Sync • Slower • Reproducible • Might overfit • In some cases more
accurate and faster*
At some point parameters need to be collected within the nodes and between them.
Async • Race conditions possible • Not reproducible • Faster • Might helps with overfitting
(you make mistakes)
*https://arxiv.org/abs/1604.00981
H2O’s architecture
H2O in-memory Non-blocking
hash map (resides on all nodes)
Initial model (weights & biases)
Node
Computation (threads, async)
Node communication
MAP each node trains a copy
of the whole network with its local data (part or all) using
async F/J framework
H2O Frame
Node 1 data
Node 2-N data
Thread 1 data
* Each row is fully stored on the same node
* Each chunk contains thousands/millions of rows
* All the data is compressed (sometimes 2-4x) in a lossless fashion
Inside a node
• Each thread works on a part of data
• All threads update weights/biases concurrently
• Race conditions possible (hard to reproduce, good for overfitting)
H2O’s architecture
Updated model (weights $ biases)
* Communication frequency is auto-tuned and user-controllable (affects convergence)
H2O in-memory Non-blocking
hash map (resides on all nodes)
Initial model (weights & biases)
Node
Computation (threads, async)
Node communication
MAP each node trains a copy
of the whole network with its local data (part or all) using
async F/J framework
REDUCE Model averaging: Average weights and biases from
all the nodes
Here: averaging New: Elastic averaging
Benchmarks
• Problem: MNIST (hand-written digits 28x28 pixels (784 features), 10-class classification
• Hardware: 10* Dual E5-2650 (8 cores, 2.6GHz), 10Gb • Result: trains 100 epochs (6M samples) in 10 seconds on 10 nodes
Demo
Airline Delays
• Data: • airline data • 116mln rows (6GB) • 800+ predictors (numeric & categorical)
• Problem: predict if a flight is delayed • Hardware: 10* Dual E5-2650 (32 cores, 2.6GHz), ~11Gb • Platform: H2O
GPUs and other networks
• What if you want to use GPUs?
• What if you want to train arbitrary networks?
• What if you want to compare different frameworks?
DeepWater
DeepWater
DeepWater Architecture
Other frameworks
• Data parallelism (default) • Model parallelism also available
• for example for multi-layer LSTM • Supports both sync and async
communication • Can perform updates in the GPU or CPU
*http://mxnet.io/how_to/model_parallel_lstm.html
• Data and model parallelism • Both sync and async updates
supported
*https://ischlag.github.io/2016/06/12/async-distributed-tensorflow/
mxnet TensorFlow
Summary
Multi Nodo • data/model doesn’t fit on one node • computation too long
Single Node • small NN/small data
Multi cpu/gpu • data fits on single node but too
much for single processing unit
Model parallelism • network parameters don’t fit on a
single machine • faster computation
Data parallelism • data doesn’t fit on a single node • faster computation
Async • faster convergence • ok with potential lower accuracy
Sync • best accuracy • lots of workers or ok with slower
training
Open Source
• Github:
https://github.com/h2oai/h2o-3 https://github.com/h2oai/deepwater
• Community:
https://groups.google.com/forum/?hl=en#!forum/h2ostream http://jira.h2o.ai https://community.h2o.ai/index.html
@h2oai
http://www.h2o.ai
Q&A