large scale distributed deep networks - fudan...

35

Upload: others

Post on 09-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Large Scale Distributed Deep Networks

Yifu Huang

School of Computer Science, Fudan University

[email protected]

COMP630030 Data Intensive Computing Report, 2013

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 1 / 21

Page 2: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Outline

1 Motivation

2 Software Framework: DistBelief

3 Distributed AlgorithmDownpour SGDSandblaster L-BFGS

4 Experiments

5 Discussion

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 2 / 21

Page 3: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Motivation

Motivation

Why I choose this paper [1]?

Google + StanfordJe�rey Dean + Andrew Y. Ng

Why we need large scale distributed deep networks?

Large model can dramatically improve performance

Training examples + model parameters

Exist methods have limitations

GPU, MapReduce, GraphLab

What can we learn from this paper?

Best parallelism design ideas for deep networks up to now

Model parallelism + data parallelism

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 3 / 21

Page 4: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Motivation

Motivation

Why I choose this paper [1]?

Google + StanfordJe�rey Dean + Andrew Y. Ng

Why we need large scale distributed deep networks?

Large model can dramatically improve performance

Training examples + model parameters

Exist methods have limitations

GPU, MapReduce, GraphLab

What can we learn from this paper?

Best parallelism design ideas for deep networks up to now

Model parallelism + data parallelism

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 3 / 21

Page 5: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Motivation

Motivation

Why I choose this paper [1]?

Google + StanfordJe�rey Dean + Andrew Y. Ng

Why we need large scale distributed deep networks?

Large model can dramatically improve performance

Training examples + model parameters

Exist methods have limitations

GPU, MapReduce, GraphLab

What can we learn from this paper?

Best parallelism design ideas for deep networks up to now

Model parallelism + data parallelism

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 3 / 21

Page 6: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Motivation

Preliminaries

Neural Networks [6]

Deep Networks [7]

Multiple hidden layersThis will allow us to compute much more complex features of the input

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 4 / 21

Page 7: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Motivation

Preliminaries

Neural Networks [6]

Deep Networks [7]

Multiple hidden layersThis will allow us to compute much more complex features of the input

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 4 / 21

Page 8: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Software Framework: DistBelief

Software Framework: DistBelief

Model parallelism

�Inside� parallelismMulti-thread + message passing -> large scale

User de�nes

Computation in node, message upward/downward

Framework manages

Synchronization, data transfer

Performance depends on

Connectivity structure, computational needs

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 5 / 21

Page 9: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Software Framework: DistBelief

Software Framework: DistBelief

Model parallelism

�Inside� parallelismMulti-thread + message passing -> large scale

User de�nes

Computation in node, message upward/downward

Framework manages

Synchronization, data transfer

Performance depends on

Connectivity structure, computational needs

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 5 / 21

Page 10: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Software Framework: DistBelief

Software Framework: DistBelief

Model parallelism

�Inside� parallelismMulti-thread + message passing -> large scale

User de�nes

Computation in node, message upward/downward

Framework manages

Synchronization, data transfer

Performance depends on

Connectivity structure, computational needs

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 5 / 21

Page 11: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Software Framework: DistBelief

Software Framework: DistBelief

Model parallelism

�Inside� parallelismMulti-thread + message passing -> large scale

User de�nes

Computation in node, message upward/downward

Framework manages

Synchronization, data transfer

Performance depends on

Connectivity structure, computational needs

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 5 / 21

Page 12: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Software Framework: DistBelief

Software Framework: DistBelief (cont.)

An example of model parallelism in DistBelief

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 6 / 21

Page 13: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm

Distributed Algorithm

Data parallelism

�Outside� parallelismMultiple model instances optimize a single objective -> high speed

A centralized sharded parameter server

Di�erent model replicas retrieve/update their own parameters

Load balance, robust

Tolerate variance in the processing speed of di�erent model replicasThe wholesale failure may be taken o�ine or restart at random

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 7 / 21

Page 14: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm

Distributed Algorithm

Data parallelism

�Outside� parallelismMultiple model instances optimize a single objective -> high speed

A centralized sharded parameter server

Di�erent model replicas retrieve/update their own parameters

Load balance, robust

Tolerate variance in the processing speed of di�erent model replicasThe wholesale failure may be taken o�ine or restart at random

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 7 / 21

Page 15: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm

Distributed Algorithm

Data parallelism

�Outside� parallelismMultiple model instances optimize a single objective -> high speed

A centralized sharded parameter server

Di�erent model replicas retrieve/update their own parameters

Load balance, robust

Tolerate variance in the processing speed of di�erent model replicasThe wholesale failure may be taken o�ine or restart at random

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 7 / 21

Page 16: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Downpour SGD

Outline

1 Motivation

2 Software Framework: DistBelief

3 Distributed AlgorithmDownpour SGDSandblaster L-BFGS

4 Experiments

5 Discussion

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 8 / 21

Page 17: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Downpour SGD

Downpour SGD

SGD [8]

Minimize the object function F (ω)Update parameters ω ′ = ω−η∆ω

asynchronous SGD [3]

Downpour

Massive parameters retrieved and updated

Adagrad learning rate [4]

ηi ,K = γ/√

∑Kj=1

∆ω2

i ,j

Improve both robust and scale

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 9 / 21

Page 18: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Downpour SGD

Downpour SGD

SGD [8]

Minimize the object function F (ω)Update parameters ω ′ = ω−η∆ω

asynchronous SGD [3]

Downpour

Massive parameters retrieved and updated

Adagrad learning rate [4]

ηi ,K = γ/√

∑Kj=1

∆ω2

i ,j

Improve both robust and scale

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 9 / 21

Page 19: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Downpour SGD

Downpour SGD

SGD [8]

Minimize the object function F (ω)Update parameters ω ′ = ω−η∆ω

asynchronous SGD [3]

Downpour

Massive parameters retrieved and updated

Adagrad learning rate [4]

ηi ,K = γ/√

∑Kj=1

∆ω2

i ,j

Improve both robust and scale

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 9 / 21

Page 20: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Downpour SGD

Downpour SGD (cont.)

Algorithm visualization

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 10 / 21

Page 21: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Downpour SGD

Downpour SGD (cont.)

Algorithm pseudo code

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 11 / 21

Page 22: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Sandblaster L-BFGS

Outline

1 Motivation

2 Software Framework: DistBelief

3 Distributed AlgorithmDownpour SGDSandblaster L-BFGS

4 Experiments

5 Discussion

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 12 / 21

Page 23: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Sandblaster L-BFGS

Sandblaster L-BFGS

BFGS [9]

An iterative method for solving unconstrained nonlinear optimizationCompute an approximation to the Hessian matrix BLimited-memory BFGS [10]

Sandblaster

Massive commands issued by coordinator

Load banancing scheme

Dynamic work assigned by coordinator

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 13 / 21

Page 24: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Sandblaster L-BFGS

Sandblaster L-BFGS

BFGS [9]

An iterative method for solving unconstrained nonlinear optimizationCompute an approximation to the Hessian matrix BLimited-memory BFGS [10]

Sandblaster

Massive commands issued by coordinator

Load banancing scheme

Dynamic work assigned by coordinator

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 13 / 21

Page 25: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Sandblaster L-BFGS

Sandblaster L-BFGS

BFGS [9]

An iterative method for solving unconstrained nonlinear optimizationCompute an approximation to the Hessian matrix BLimited-memory BFGS [10]

Sandblaster

Massive commands issued by coordinator

Load banancing scheme

Dynamic work assigned by coordinator

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 13 / 21

Page 26: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Sandblaster L-BFGS

Sandblaster L-BFGS (cont.)

Algorithm visualization

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 14 / 21

Page 27: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Distributed Algorithm Sandblaster L-BFGS

Sandblaster L-BFGS (cont.)

Algorithm pseudo code

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 15 / 21

Page 28: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Experiments

Experiments

Setup

Object recognition in still images [5]Acoustic processing for speech recognition [2]

Model parallelism benchmarks

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 16 / 21

Page 29: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Experiments

Experiments

Setup

Object recognition in still images [5]Acoustic processing for speech recognition [2]

Model parallelism benchmarks

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 16 / 21

Page 30: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Experiments

Experiments (cont.)

Optimization method comparisons

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 17 / 21

Page 31: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Experiments

Experiments (cont.)

Optimization method comparisons

Application to ImageNet [2]

This network achieved a cross-validated classi�cation accuracy of over15%, a relative improvement over 60% from the best performance weare aware of on the 21k category ImageNet classi�cation task

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 18 / 21

Page 32: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Discussion

Discussion

Contributions

Increase the scale and speed of deep networks training

Drawbacks

There is no guarantee that parameters are consistent

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 19 / 21

Page 33: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Discussion

Discussion

Contributions

Increase the scale and speed of deep networks training

Drawbacks

There is no guarantee that parameters are consistent

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 19 / 21

Page 34: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Appendix

References I

[1] Large Scale Distributed Deep Networks. NIPS. 2012.

[2] Building High-level Features Using Large Scale UnsupervisedLearning. ICML. 2012.

[3] Hogwild!: A Lock-Free Approach to Parallelizing StochasticGradient Descent. NIPS. 2011.

[4] Adaptive subgradient methods for online learning and stochasticoptimization. JMLR. 2011.

[5] Improving the speed of neural networks on cpus. NIPS. 2011.

[6] http://u�dl.stanford.edu/wiki/index.php/Neural_Networks

[7] http://u�dl.stanford.edu/wiki/index.php/Deep_Networks:_Overview

[8] http://u�dl.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 20 / 21

Page 35: Large Scale Distributed Deep Networks - Fudan …admis.fudan.edu.cn/~yfhuang/files/LSDDN_slide.pdfSandblaster Massive commands issued by coordinator Load banancing scheme Dynamic work

Appendix

References II

[9] http://en.wikipedia.org/wiki/BFGS

[10] http://en.wikipedia.org/wiki/Limited-memory_BFGS

Yifu Huang (FDU CS) COMP630030 Report 2013/11/20 21 / 21