scalable deep learning platform on spark in baidu

27
Scalable Deep Learning in Baidu Weide Zhang, Kyle Tsai, Jiang Wang Baidu USDC

Upload: jen-aman

Post on 21-Apr-2017

2.058 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Scalable Deep Learning Platform On Spark In Baidu

Scalable Deep Learning in Baidu

Weide Zhang, Kyle Tsai, Jiang WangBaidu USDC

Page 2: Scalable Deep Learning Platform On Spark In Baidu

Background • Spark

– Batch/Streaming processing, Spark SQL, MLLib• Deep learning has many use cases in Baidu and showed

significant improvement in quality – Image retrieval ranking – Ads CTR prediction – Machine translation – Speech recognition

• Goal: able to use distributed deep learning in Spark

Page 3: Scalable Deep Learning Platform On Spark In Baidu

Typical Training Data in Baidu• Image Recognition: 100 millions• OCR: 100 millions• Speech: 10 billions• CTR: 100 billions• Grows every year since 2012

Page 4: Scalable Deep Learning Platform On Spark In Baidu

Baidu Spark One

Page 5: Scalable Deep Learning Platform On Spark In Baidu

Paddle• Parallel Asynchronous Distributed Deep Learning Library

– Support distributed parameter servers to do synchronous/asynchronous parameter update

– Support Multi GPU / CPU training– Support sparse model– Easy to understand API for user to add new layers – Support rich features for deep learning use cases,

especially for NLP

Page 6: Scalable Deep Learning Platform On Spark In Baidu

Deep learning options comparisonCaffe Tensor Flow Torch Paddle

Distributed Training Yes Yes No Yes

CommunicationCost

Medium High N/A Medium to Low

Easy to customizeand coding

Yes More LearningCurve

More LearningCurve

Yes

Sparse ModelSupport

No Yes Yes Yes

Area of Focus Vision All All All

Integration withSpark

Yes No No Yes

Page 7: Scalable Deep Learning Platform On Spark In Baidu

High Level Goals• Implement Spark ML abstractions to let user train

deep learning models with minimal code change• Leverage paddle’s native training and parameter

server mechanisms to be scheduled in spark deeplearning jobs

• Handle multi-tenancy and heterogeneity• Parallelize hyper parameter selection• Batch and Streaming learning

Page 8: Scalable Deep Learning Platform On Spark In Baidu

Paddle on Spark

•TrainingCommunication

•Deep LearningAlgorithm

•Resource Management

• Training Environment

Spark Yarn

Parameter ServerPaddle

Page 9: Scalable Deep Learning Platform On Spark In Baidu

Training Data Flow

Page 10: Scalable Deep Learning Platform On Spark In Baidu

System Architecture

Page 11: Scalable Deep Learning Platform On Spark In Baidu

Spark ML’s Abstraction• Train

• Predict

Page 12: Scalable Deep Learning Platform On Spark In Baidu

Simple Parameter Is Not EnoughImage Label

Bird

Cat

Convolution

Pooling

Full Connection Cost

Parameter For CNN

Page 13: Scalable Deep Learning Platform On Spark In Baidu

Use Paddle As Estimator

Page 14: Scalable Deep Learning Platform On Spark In Baidu

Code your Configuration

Page 15: Scalable Deep Learning Platform On Spark In Baidu

Example of caffe prototxt

Page 16: Scalable Deep Learning Platform On Spark In Baidu

Design decisions• Spark ML Compatible API

– Compatible with Spark is more important than implemented under Spark

• Code level configuration– Easy and flexible– Manual is prone to error

Page 17: Scalable Deep Learning Platform On Spark In Baidu

PADDLE Scalable Deep Learning Platform at Baidu

Page 18: Scalable Deep Learning Platform On Spark In Baidu

Sharded Parameter Server • One parameter an one trainer co-locate in a machine.

• Parameters are shared, but not replicated.

• All-to-all communication.• Our environments

• 4 GPUs per machine.• 4-10 machines.• all machines in one

switch• reliable data center.

Page 19: Scalable Deep Learning Platform On Spark In Baidu

GPU Ring Synchronization• Each parameter only needs

to go through slow connection two times.

• One for reduce.• Another for scatter.

Page 20: Scalable Deep Learning Platform On Spark In Baidu

ImageNet Scale Experiments

0

10

20

30

40

1 2 3 4 5

Tim

e (s

)

Number of machines

Time per 100 batches

TCP RDMA• AlexNet on ImageNet• batch size = 64• 1 Machine has 4 K10

GPUs.

Page 21: Scalable Deep Learning Platform On Spark In Baidu

Sparse Training

Page 22: Scalable Deep Learning Platform On Spark In Baidu

Sparse Training Experiment

0

75

150

225

1 2 4 8 16

Tim

e (s

)

Number of nodes

Time per 100 batches

Non Sparse Sparse • 1451594 dimensional sparse feature.

• Embedded to 128d, 96d, 96d, and 128d.

• Using a ranking cost on the top.

• batch size = 128.

Page 23: Scalable Deep Learning Platform On Spark In Baidu

Flexible and Efficient RNN Implementation

Page 24: Scalable Deep Learning Platform On Spark In Baidu

RNN Performance Comparison with TensorFlow

0

125

250

375

500

625

200 650 1500

Tim

e (m

s)

RNN Hidden Size

Time per BatchPADDLE TensorFlow

• Machine Translation• batch size = 64• embedding size =

hidden_size• dictionary size =

10000

Page 25: Scalable Deep Learning Platform On Spark In Baidu

Distributed Training Performance Character Neural Machine Translation

• 8 Machines, each with 4 K40 GPUs• number of RNN encoder layers: 9, number of RNN decoder layers: 7• Word embedding size: 256, RNN size: 512• batch size: 25000 character• Speed:

•attention: 25 minutes / 100 batches .•encoder-decoder: 9 minutes / 100 batches.

Page 26: Scalable Deep Learning Platform On Spark In Baidu

Future work• Streaming training• Dynamic trainer allocation• FairScheduler• Model serving

Page 27: Scalable Deep Learning Platform On Spark In Baidu

THANK YOU.

zhangweide/kyletsai/[email protected]