deep learning at scale

33
Deep Learning at scale Mateusz Dymczyk Software Engineer H2O.ai Strata+Hadoop Singapore 08.12.2016

Upload: mateusz-dymczyk

Post on 07-Jan-2017

434 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Deep Learning at Scale

Deep Learning at scale

Mateusz DymczykSoftware Engineer

H2O.ai

Strata+Hadoop Singapore 08.12.2016

Page 2: Deep Learning at Scale

About me

• M.Sc. in CS @ AGH UST, Poland • CS Ph.D. dropout • Software Engineer @ H2O.ai • Previously ML/NLP and distributed

systems @ Fujitsu Laboratories and en-japan inc.

Page 3: Deep Learning at Scale

• Deep learning - brief introduction • Why scale • Scaling models + implementations • Demo • A peek into the future

Agenda

Page 4: Deep Learning at Scale

Deep Learning

Page 5: Deep Learning at Scale

Text classification (item prediction)

Use Cases

Fraud detection (11% accuracy boost in production)

Image classification

Machine translation

Recommendation systems

Page 6: Deep Learning at Scale

Deep Learning

Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations.*

*https://en.wikipedia.org/wiki/Deep_learning

Page 7: Deep Learning at Scale

Deep Learning

Page 8: Deep Learning at Scale

SGD

• Memory efficient • Fast • Not easy to parallelize without speed

degradation

Initialize Parameters

Get training sample i

12

3

Until converged

Page 9: Deep Learning at Scale

Deep Learning

PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in

multiple fields

CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits

Page 10: Deep Learning at Scale

Why scale?

PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in

multiple fields

CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits

Grid search?

Page 11: Deep Learning at Scale

Why not to scale?

• Distribution isn’t free: overhead due to network traffic, synchronization etc.

• Small neural network (not many layers and/or neurons) • small+shallow network = not much computation per iteration

• Small data • Very slow network communication

Page 12: Deep Learning at Scale

Distribution

Page 13: Deep Learning at Scale

Distribution models

• Model parallelism • Data parallelism • Mixed/composed • Parameter server vs *peer-to-peer • Communication: Asynchronous vs Synchronous

*http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf

Page 14: Deep Learning at Scale

Model parallelism

Node 1 Node 2 Node 3

• Each node computes a different part of network

• Potential to scale to large models

• Rather difficult to implement and reason about

• Originally designed for large convolutional layer in GoogleNet

Parameter Server

Parameters or deltas

Page 15: Deep Learning at Scale

Data parallelism

Node 1

Parameter Server

Node 3

Node 2

Node 4

• Each node computes all the parameters

• Using part (or all) of local data • Results have to be combined

Parameters or deltas

Page 16: Deep Learning at Scale

Mixed/composed

• Node distribution (model or data) • In-node, per gpu/cpu/thread

concurrent/parallel computation • Example: learn whole model on each

multi CPU/GPU machine, where each CPU/GPU trains a different layer or works with different part of data

Node

T T T

NodeCPU/GPU CPU/GPU

CPU/GPU CPU/GPU

Page 17: Deep Learning at Scale

Sync vs. Async

Sync • Slower • Reproducible • Might overfit • In some cases more

accurate and faster*

At some point parameters need to be collected within the nodes and between them.

Async • Race conditions possible • Not reproducible • Faster • Might helps with overfitting

(you make mistakes)

*https://arxiv.org/abs/1604.00981

Page 18: Deep Learning at Scale

H2O’s architecture

H2O in-memory Non-blocking

hash map (resides on all nodes)

Initial model (weights & biases)

Node

Computation (threads, async)

Node communication

MAP each node trains a copy

of the whole network with its local data (part or all) using

async F/J framework

Page 19: Deep Learning at Scale

H2O Frame

Node 1 data

Node 2-N data

Thread 1 data

* Each row is fully stored on the same node

* Each chunk contains thousands/millions of rows

* All the data is compressed (sometimes 2-4x) in a lossless fashion

Page 20: Deep Learning at Scale

Inside a node

• Each thread works on a part of data

• All threads update weights/biases concurrently

• Race conditions possible (hard to reproduce, good for overfitting)

Page 21: Deep Learning at Scale

H2O’s architecture

Updated model (weights $ biases)

* Communication frequency is auto-tuned and user-controllable (affects convergence)

H2O in-memory Non-blocking

hash map (resides on all nodes)

Initial model (weights & biases)

Node

Computation (threads, async)

Node communication

MAP each node trains a copy

of the whole network with its local data (part or all) using

async F/J framework

REDUCE Model averaging: Average weights and biases from

all the nodes

Here: averaging New: Elastic averaging

Page 22: Deep Learning at Scale

Benchmarks

• Problem: MNIST (hand-written digits 28x28 pixels (784 features), 10-class classification

• Hardware: 10* Dual E5-2650 (8 cores, 2.6GHz), 10Gb • Result: trains 100 epochs (6M samples) in 10 seconds on 10 nodes

Page 23: Deep Learning at Scale

Demo

Page 24: Deep Learning at Scale

Airline Delays

• Data: • airline data • 116mln rows (6GB) • 800+ predictors (numeric & categorical)

• Problem: predict if a flight is delayed • Hardware: 10* Dual E5-2650 (32 cores, 2.6GHz), ~11Gb • Platform: H2O

Page 25: Deep Learning at Scale

GPUs and other networks

• What if you want to use GPUs?

• What if you want to train arbitrary networks?

• What if you want to compare different frameworks?

Page 26: Deep Learning at Scale

DeepWater

Page 27: Deep Learning at Scale

DeepWater

Page 28: Deep Learning at Scale

DeepWater Architecture

Page 29: Deep Learning at Scale

Other frameworks

• Data parallelism (default) • Model parallelism also available

• for example for multi-layer LSTM • Supports both sync and async

communication • Can perform updates in the GPU or CPU

*http://mxnet.io/how_to/model_parallel_lstm.html

• Data and model parallelism • Both sync and async updates

supported

*https://ischlag.github.io/2016/06/12/async-distributed-tensorflow/

mxnet TensorFlow

Page 30: Deep Learning at Scale

Summary

Multi Nodo • data/model doesn’t fit on one node • computation too long

Single Node • small NN/small data

Multi cpu/gpu • data fits on single node but too

much for single processing unit

Model parallelism • network parameters don’t fit on a

single machine • faster computation

Data parallelism • data doesn’t fit on a single node • faster computation

Async • faster convergence • ok with potential lower accuracy

Sync • best accuracy • lots of workers or ok with slower

training

Page 31: Deep Learning at Scale

Open Source

• Github:

https://github.com/h2oai/h2o-3 https://github.com/h2oai/deepwater

• Community:

https://groups.google.com/forum/?hl=en#!forum/h2ostream http://jira.h2o.ai https://community.h2o.ai/index.html

@h2oai

http://www.h2o.ai

Page 32: Deep Learning at Scale

Thank you!

@mdymczyk

Mateusz Dymczyk

[email protected]

Page 33: Deep Learning at Scale

Q&A