demystifying learning @ scale€¦ · demystifying learning @ scale: from distributed math to...

S. Eliuk Ph.D. CS (Project Lead, Chief Architect)

C. Upright M.Sc. CS (Lead Developer)

Computing Science Innovation Center

Samsung Research America

April 7th 2016

Demystifying Learning @ Scale: From Distributed Math to Effective HPC Infra

Samsung Research America – Silicon Valley

Outline

i. Why a distributed Mathematics (dMath) library? i. Distributed world means we need unique perspectives on algorithms

ii. Current Open-Source offerings… i. Audience perspective ii. Our sentiments

iii. {Data, Model, Hybrid} Parallelism i. Expresso, Caffe – master powered by dMath ii. Distributed Kaldi, Kaldi – master powered by dMath

iv. Best Known Practices i. Common issues in the NN pipeline ii. Tips & Tricks

v. Real-World Results i. Strong & weak scaling, training time

vi. HPC Environment i. What makes for an effective environment?

vii. Q & A


Deep Learning Platform:

Deep Learning Platform that enables Samsung to efficiently prototype distributed DL algorithms on cutting-edge hardware infrastructure for large-scale, computation-intensive applications and services

Speech Translation

AIR

ASR

Handwriting Recognition

Autonomous Vehicles

Object & Speech Recognition

Machine translation & search

SRA DL Platform

Model Parallelism

Large Scale

Distributed & multi-GPU

Ad Targeting

Tra

inin

g D

ata

Scene Understanding


Deep Learning Platform: Framework

“Scales Out” Effectively

Comprehensive Library Developed

Larger models – frontier research

Extended feature set

LAPACK FFT General Purpose Math

Node 1

GPU #1 GPU #16 …

Application (DNN – Kaldi ASR, Caffe AIR, Theano,

NLP)

CUDA cuBLAS cuDNN

Tractability issues

Does not “Scales Out” effectively

Accuracy can degrade

Small models

Current Approaches Samsung – SRA Approach

• Abstract away complicated distributed & multi-GPU programming from high-level Machine Learning user,

• Integration w/ existing open-source frameworks.

LAPACK FFT General Purpose Math

dMath Framework

Node 1

GPU #1 GPU #16 …

Node N

GPU #1 GPU #16 …

…

Application (DNN – Kaldi ASR,

Caffe AIR, Theano, Torch, etc)

CUDA cuBLAS cuDNN

Application


Long Training Times Impact Time to Market

Jeffrey Dean Senior Fellow Google Brain

Effect of Experiment Time

Minutes, Hours

• Interactive investigation! Instant gratification!

• Parameter exploration

1-4 Days

• Tolerable

• Interactivity replaced by parallelization of experiments

1-4 Weeks

• High value experiments only

• Progress stalls

> 1 Month

• Don’t even try Keynote at 2015 NVIDIA GPU Conference


Current Open-Source Offerings: Discussion

Thoughts?


Current Open-Source Offerings:

Single-GPU Multi-GPU Distributed


(b) (a)

(d) (c)

• Matrix data is divided into multiple blocks and distributed across the workers

• Arbitrary layouts are supported • On the left are four examples of

distributing the same matrix across four workers, the numbers represent the worker that stores that block

• Block size and location are important for efficiency

• Most algorithms require consistent block sizes, some require consistent layout on the workers

Best Known Practices: Data Layout


= +

Inefficient, off diagonal blocks require communication between GPUs

= +

Efficient, no communication between GPUs

=

Efficient, no communication between GPUs

=

Less efficient, due to memory access patterns

=

Very inefficient, requires temporary buffers and communication between GPUs

Best Known Practices: Data Layout and Algorithm Efficiency


Best Known Practices: Replication Motivation

• Often desirable for workers to have identical copies of a matrix (abundant memory and rarely changing data)

• Useful for parameters in a network • Caching of a distributed weight matrix, forward pass caches it, backward

pass requires no communication • When a matrix is changed, the workers automatically redistribute the

matrix • Equivalent to the distribution of parameters in a parameter server


Best Known Practices: Replication Motivation (AlexNet Example)

• The matrix multiplication in the three fully connected layers shown above requires the weight matrix to be shared between the workers in the forward pass

• We cache this matrix, allowing the backward data gradient to be computed without any communication between GPUs


Best Known Practices: Replication Implementation

• Each worker stores an entire copy of the matrix, in addition to the blocks they are responsible for

• Replication can be temporarily suspended while multiple updates are made to a matrix, then resumed


Best Known Practices: Asynchronous Replication

• Parameters are updated all at once and then replicated in SGD training

• Updated parameters often not needed until much deeper into the network

• Replication frequently followed by forward convolution layers (high computation, zero communication)

• Replace synchronous replication with asynchronous replication, overlapping parameter redistribution with computation


Best Known Practices: Matrix Multiplication

• One of the most important algorithms in dMath • Probably the most difficult to optimize • Built on cuBLAS routines for the individual blocks • Typical sizes in deep learning applications have communication

between GPUs and servers as the bottleneck • Multiple implementations are needed for different circumstances

• General purpose • Variant of Fox’s algorithm [1987] • Fully local (replicated weight matrix, local data) • Currently over 15 different distributed algorithms for GEMM.


Best Known Practices: Cyclic GEMM Implementation

• One dimensional decomposition of the matrices across workers

C

B

A C = A B x



• One dimensional decomposition of the matrices across workers



• One dimensional decomposition of the matrices across workers • Multiplication divided into multiple stages, overlapping computation

with cyclic asynchronous transfers • Darkly colored blocks are currently being used in a computation • Dark orange indicates initiation of data transfer




with cyclic asynchronous transfers • Darkly colored blocks are currently being used in a computation • Dark orange indicates initiation of data transfer • Light orange is a transfer still in progress




with cyclic asynchronous transfers • Darkly colored blocks are currently being used in a computation • Dark orange indicates initiation of data transfer • Light orange is a transfer still in progress • Many transfers and GEMM calls happen concurrently




with cyclic asynchronous transfers • Darkly colored blocks are currently being used in a computation • Dark orange indicates initiation of data transfer • Light orange is a transfer still in progress • Many transfers and GEMM calls happen concurrently • The process repeats again in the next outer stage


Best Known Practices: Inner Product Layer

• In a neural network, the cyclic GEMM routine transfers the weight matrix through the workers

• We take advantage of this, by caching it on the forward pass • On the backward pass, all workers have a copy of the weight matrix, and can

calculate the input data gradients without costly communication


Best Known Practices: Matrix Multiplication Scaling

• Can only distribute so far, due to communication overhead and cuBLAS GEMM efficiency with smaller blocks

• As matrices get larger communication becomes less of an issue compared to computation, scaling is less difficult

• In practice we copy our data to a subset of workers, and perform our GEMM routines on this subset

• Shown below is the time in milliseconds for the same multiplication, distributed over a different number of GPUs

0

5

10

15

1 GPU

2 GPUs

4 GPUs

8 GPUs


Best Known Practices: Matrix Multiplication Performance

• Shown below are runtimes for square matrix multiplications of various sizes

• dMath performance on a variety of GPUs is compared with a cuBLAS 7.5 baseline and cuBLAS-XT


Best Known Practices: Reductions

• Important to be cognizant of what MPI is doing under the hood • CUDA aware MPI_Reduce was copying to the CPU to perform the reduction • Instead implemented a custom reduction that was aware of the hardware

• Transfer within a root complex is fast • QPI between root complexes is slow • A reduction is done on each root complex individually, then between

root complexes using CUDA aware MPI_Send / MPI_Recv. • The MPI_Send / MPI_Recv pair is further accelerated via a custom

branch of MPI that allows intranode GDR, for extremely low latency and high bandwidth via Mellanox EDR (100Gb/s)… more to come shortly on that one.


Best Known Practices: Data Loading

• Data loading quickly becomes a bottleneck when scaling to multiple GPUs • Not practical to load from a single process • Threaded image loader runs on each worker, supporting both raw and

encoded images • Extra cores required to handle the processing, and fast data access to feed it • Caching in RAM is the best option for the database • Randomization of batches have next to zero cost when cached in RAM • Caching of pages in Linux happens on a per socket basis (NUMA), significant

performance degradation can occur if QPI gets involved, i.e. when page migration between NUMA-nodes takes place

• Each worker loads from a subset of the entire database (QPI aware caching) • Optional direct caching in memory of the DB on each worker


Best Known Practices: Logging, Debugging & Profiling

Most important, left for last, Logging, Debugging and Profiling: • Flexible per-process logging is indispensible (master and workers) • Inevitably, difficult bugs will arise, sometimes due to rare and subtle timing

variations between processes • Flexible and comprehensive logging records the sequence of events • Scripting then analyzes those logs • Logging effects timing in the application, and will easily hide bugs • A separate logging thread can be used to minimize this effect

• Main places data or a function object in a queue • Logging thread processes this and logs it



• Profiling is trivial to implement in this framework, basic timing code and logging combined with scripting helps to locate bottlenecks

• Performance can be measured on a per layer or per layer type basis, using a variety of metrics : • Average time • Speedup ratio • Percentage efficiency • Scaling residual (time that could be saved with perfect linear scaling)



Sample profiling CSV for AlexNet :


Tips and Tricks

• Any serial portion of the program will hurt scaling (Amdahl’s law) • Dispatching jobs to the workers has a small overhead (dozens of

microseconds) • Additional time is spent broadcasting the metadata, and setting up data

structures required to perform a distributed job • For large jobs this overhead is not a problem • Scaling to more GPUs without increasing batch size, strong scaling, presents

significant issues (small convolutions in GoogLeNet) • Can be alleviated by avoiding lazy solutions

• Before history_data->scale(momentum);

history_data->addMatrix(local_rate, *param_diff);

• After history_data->addMatrix(momentum, *history_data,

local_rate, *param_diff);


Tips and Tricks

• Combine common operations together into a single job • Backward convolutions involve computing data, filter and bias gradients,

can be done all at once • Allows for multiple CUDA streams to be used • Overlapping the filter and bias gradient reductions with data gradient

computation helps optimize performance • Avoid multiple MPI calls when possible, instead copy the data into a single

buffer and do a single call (e.g. filter and bias gradient reductions) • Implement batched versions of jobs (e.g. matrix addition used in the

parameter update)


Tips and Tricks

• Caching of job data structures and metadata can significantly reduce parallelization overhead and incredibly useful for fixed pipelines • No metadata will be sent (other than the cached job id) • Only the computation is performed, no initialization or cleanup • Eliminates as much of the parallelization overhead as possible

• For Nvidia Tesla K80, we have found much better performance by locking the clocks at 758Mhz and disabling auto-boost, e.g. ~5-10% for multi-GPU jobs

• Registering memory with the IB driver is costly, try to reuse your buffers to prevent this costly registration, i.e. use a memory manager of some sort, with various sized allocation that are then reused


Real World: soumith/convnet-benchmarks -- 128 Batch, i.e. Strong Scaling

Expresso 121.8ms average on one titan

• Expresso 64.5ms average on two titans • Expresso 48.6ms average on four titans • Expresso 39.1ms average on eight titans


Real World: Training AlexNet -- 256 Batch, i.e. Strong Scaling

0

500

1000

1500

2000

2500

Expresso CNTK Caffe Nvidia - Caffe (0.14)

1 GPU 4 GPU 8 GPU 16 GPU 32 GPU 64 GPU


Real World: Training Time

AlexNet 1024:

• Caffe -- 8 K80s -- 17h 20min

• NV-Caffe 0.14 -- 8 K80s -- 12h 15min

• Expresso -- 8 K80s -- 07h 40min

• Expresso -- 32 K80s -- 05h 30min

GoogLeNet v1 1024 Batch:

• Caffe -- 8 K80s -- 03d 20hrs

• Expresso -- 8 K80s -- 02d 12hrs

• Expresso -- 32 K80s -- 01d 06hrs

Scaling batch from single GPU to 64 GPUs results in the

same accuracy, e.g. AlexNet 256 58.59+-0.5% top one.

Expresso results are for that of our version hybrid

parallelism [Krizhevsky14].


H/W: Samsung Advanced Learning v1.0

Via: custom branch of OpenMPI & Mvapich2 - PCC


H/W: Samsung Advanced Learning v1.0 & v2.0

Two variants, dual socket, one Mellanox InfiniBand EDR (100Gb/s) per socket connected directly to a PCIev3.0 96 lane switch, w/ 1. 8 Nvidia Tesla k80s per node, 8 nodes, 128 GPUs, per rack, 2. 8 Nvidia Tesla m40 24GB per node, 12 nodes, 96 GPUs, per rack.

Remember to be sure to have an adequate number of cores per socket to sufficiently power the GPUs, min of 1.25 physical cores per GPU.


Recap:

• Current open-source offering are incredible but have many shortcomings

• Data layout is very important, really need to consider the layout, often optimal layout may be to distribute to only a subset of workers

• Cache data to prevent unnecessary communication, huge benefit on backward pass

• Async replication, overlap computation & communication, • GEMM, need an arsenal of algorithms w/ consideration of

underlying h/w architecture


Recap:

• Work for strong scaling & secondly weak scaling• Likely need to augment chosen MPI version w/ custom

routines for reduction, broadcasts, intranode GDR, etc• Distributed, multi-threaded data loading is important but do

not forget NUMA nodes• Most importantly, have solid logging, profiling, test suites

from the beginning


Publications:

• dMath architecture paper available on arxiv• http://arxiv.org/abs/1604.01416

• A few under submission so expect a bunch from us in Q2-Q4


Employment Opportunities:

[email protected]


Acknowledgements & Additional Information

http://arxiv.org/abs/1604.01416


Q & A

?



demystifying learning @ scale€¦ · demystifying learning @ scale: from distributed math to...

Documents