deep learning at scale

Deep Learning at scale

Mateusz DymczykSoftware Engineer

H2O.ai

Strata+Hadoop Singapore 08.12.2016

About me

• M.Sc. in CS @ AGH UST, Poland • CS Ph.D. dropout • Software Engineer @ H2O.ai • Previously ML/NLP and distributed

systems @ Fujitsu Laboratories and en-japan inc.

• Deep learning - brief introduction • Why scale • Scaling models + implementations • Demo • A peek into the future

Agenda

Deep Learning

Text classification (item prediction)

Use Cases

Fraud detection (11% accuracy boost in production)

Image classification

Machine translation

Recommendation systems

Deep Learning

Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations.*

*https://en.wikipedia.org/wiki/Deep_learning

https://en.wikipedia.org/wiki/Deep_learning

Deep Learning

SGD

• Memory efficient • Fast • Not easy to parallelize without speed

degradation

Initialize Parameters

Get training sample i

12

3

Until converged

Deep Learning

PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in

multiple fields

CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits

Why scale?

PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in

multiple fields

CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits

Grid search?

Why not to scale?

• Distribution isn’t free: overhead due to network traffic, synchronization etc.

• Small neural network (not many layers and/or neurons) • small+shallow network = not much computation per iteration

• Small data • Very slow network communication

Distribution

Distribution models

• Model parallelism • Data parallelism • Mixed/composed • Parameter server vs *peer-to-peer • Communication: Asynchronous vs Synchronous

*http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf

http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf

Model parallelism

Node 1 Node 2 Node 3

• Each node computes a different part of network

• Potential to scale to large models

• Rather difficult to implement and reason about

• Originally designed for large convolutional layer in GoogleNet

Parameter Server

Parameters or deltas

Data parallelism

Node 1

Parameter Server

Node 3

Node 2

Node 4

• Each node computes all the parameters

• Using part (or all) of local data • Results have to be combined

Parameters or deltas

Mixed/composed

• Node distribution (model or data) • In-node, per gpu/cpu/thread

concurrent/parallel computation • Example: learn whole model on each

multi CPU/GPU machine, where each CPU/GPU trains a different layer or works with different part of data

Node

T T T

NodeCPU/GPU CPU/GPU

CPU/GPU CPU/GPU

Sync vs. Async

Sync • Slower • Reproducible • Might overfit • In some cases more

accurate and faster*

At some point parameters need to be collected within the nodes and between them.

Async • Race conditions possible • Not reproducible • Faster • Might helps with overfitting

(you make mistakes)

*https://arxiv.org/abs/1604.00981

https://arxiv.org/abs/1604.00981

H2O’s architecture

H2O in-memory Non-blocking

hash map (resides on all nodes)

Initial model (weights & biases)

Node

Computation (threads, async)

Node communication

MAP each node trains a copy

of the whole network with its local data (part or all) using

async F/J framework

H2O Frame

Node 1 data

Node 2-N data

Thread 1 data

* Each row is fully stored on the same node

* Each chunk contains thousands/millions of rows

* All the data is compressed (sometimes 2-4x) in a lossless fashion

Inside a node

• Each thread works on a part of data

• All threads update weights/biases concurrently

• Race conditions possible (hard to reproduce, good for overfitting)

H2O’s architecture

Updated model (weights $ biases)

* Communication frequency is auto-tuned and user-controllable (affects convergence)

H2O in-memory Non-blocking

hash map (resides on all nodes)

Initial model (weights & biases)

Node

Computation (threads, async)

Node communication

MAP each node trains a copy

of the whole network with its local data (part or all) using

async F/J framework

REDUCE Model averaging: Average weights and biases from

all the nodes

Here: averaging New: Elastic averaging

https://arxiv.org/pdf/1412.6651v8.pdf

Benchmarks

• Problem: MNIST (hand-written digits 28x28 pixels (784 features), 10-class classification

• Hardware: 10* Dual E5-2650 (8 cores, 2.6GHz), 10Gb • Result: trains 100 epochs (6M samples) in 10 seconds on 10 nodes

Airline Delays

• Data: • airline data • 116mln rows (6GB) • 800+ predictors (numeric & categorical)

• Problem: predict if a flight is delayed • Hardware: 10* Dual E5-2650 (32 cores, 2.6GHz), ~11Gb • Platform: H2O

GPUs and other networks

• What if you want to use GPUs?

• What if you want to train arbitrary networks?

• What if you want to compare different frameworks?

DeepWater

DeepWater Architecture

Other frameworks

• Data parallelism (default) • Model parallelism also available

• for example for multi-layer LSTM • Supports both sync and async

communication • Can perform updates in the GPU or CPU

*http://mxnet.io/how_to/model_parallel_lstm.html

• Data and model parallelism • Both sync and async updates

supported

*https://ischlag.github.io/2016/06/12/async-distributed-tensorflow/

mxnet TensorFlow

http://mxnet.io/how_to/model_parallel_lstm.html

Summary

Multi Nodo • data/model doesn’t fit on one node • computation too long

Single Node • small NN/small data

Multi cpu/gpu • data fits on single node but too

much for single processing unit

Model parallelism • network parameters don’t fit on a

single machine • faster computation

Data parallelism • data doesn’t fit on a single node • faster computation

Async • faster convergence • ok with potential lower accuracy

Sync • best accuracy • lots of workers or ok with slower

training

Open Source

• Github:

https://github.com/h2oai/h2o-3 https://github.com/h2oai/deepwater

• Community:

https://groups.google.com/forum/?hl=en#!forum/h2ostream http://jira.h2o.ai https://community.h2o.ai/index.html

@h2oai

http://www.h2o.ai

https://github.com/h2oai/h2o-3

https://github.com/h2oai/deepwater

https://groups.google.com/forum/?hl=en#!forum/h2ostream

http://jira.h2o.ai

https://community.h2o.ai/index.html

Thank you!

@mdymczyk

Mateusz Dymczyk

[email protected]

mailto:[email protected]?subject=

deep learning at scale

Data & Analytics