deep learning on gpu clusters - on-demand.gputechconf.com · adam coates, brody huval, tao wang,...
TRANSCRIPT
![Page 1: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/1.jpg)
Deep Learning on GPU Clusters
Bryan Catanzaro
![Page 2: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/2.jpg)
Bryan Catanzaro
Machine Learning
• ML runs many things these days
– Ad placement / Product Recommendations
– Web search / image search
– Speech recognition / machine translation
– Autonomous driving
– …
Bryan Catanzaro
![Page 3: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/3.jpg)
Machine learning in practice “Mug”
Machine Learning (Classifier)
Feature Extraction
Prior Knowledge Experience
Adam Coates
![Page 4: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/4.jpg)
How does ML work?
Classifier Feature
Extraction “Mug”?
Update model
Pixels
Adam Coates
![Page 5: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/5.jpg)
Learning features
• Can we learn “features” from data?
Classifier Feature
Extraction “Mug”?
Update entire stack to make better predictions.
AKA: “Deep learning”
Pixels
Adam Coates
![Page 6: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/6.jpg)
Learning features
• Deep learning: learn multiple stages of features to achieve end goal.
Features Features “Mug”? Classifier
Pixels
Adam Coates
![Page 7: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/7.jpg)
Progress in AI
Idea
Code
Train
Test
Deep Learning brings special opportunities and challenges
Iteration latency gates
progress
Fastest time from idea to
tested model
Adam Coates
![Page 8: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/8.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
The need for scaling
• Dist-Belief [Le et al., ICML 2012]: Up to 1.7 billion parameter networks.
• Unsupervised learning algorithm with > 1 billion parameters able to discover “objects” in high-res images.
1000 machines for 1 week. (16000 cores.)
Faces: Bodies: Cats:
[Also: Dean et al., NIPS 2012]
![Page 9: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/9.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
What will the rest of us do??
• Millions of $$ hardware.
• Extensive engineering to handle node failures and network traffic.
• Hard to scale beyond data center. – …if you had a data center.
![Page 10: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/10.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
Two ways to scale neural networks
• Simple solution: “data parallelism” – Parallelize over images in a batch.
Need to synchronize model across machines.
Difficult to fit big models on GPUs.
Machine 1
Image 1
Machine 2
Image 2
Sync.
![Page 11: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/11.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
Two ways to scale neural networks
• “Model parallelism” – Parallelize over neurons. (Relies on local connectivity.)
Scales to much larger models. Much more frequent synchronization.
Machine 1 Machine 2
Image 1
![Page 12: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/12.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
The network bottleneck
• On a typical Ethernet cluster: – Data parallelism:
• Synchronize a 1B parameter model = 30 seconds.
– Model parallelism: • Move 1MB of neurons for 100 images = 0.8 seconds
– Must do this for every layer.
– Typically >>10 times slower than computation.
• Problem: communication makes distribution very inefficient for large neural nets. – How do we scale out efficiently??
![Page 13: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/13.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
COTS HPC Hardware
• Infiniband:
– FDR Infiniband switch.
– 1 network adapter per server.
56 Gbps; microsecond latency.
• GTX 680 GPUs
– 4 GPUs per server.
> 1 TFLOPS each for ideal workload.
![Page 14: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/14.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
Model parallelism in MPI
MPI starts a single process for each GPU. – Enables message passing, but this is surprisingly unnatural.
GPU 1 GPU 2
Image 1
W1 W2
![Page 15: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/15.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
Model parallelism in MPI
MPI starts a single process for each GPU. – Enables message passing, but this is surprisingly unnatural.
GPU 1 GPU 2
Image 1
W1 W2
![Page 16: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/16.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
HPC Software Infrastructure: Communication
• Moving neuron responses around is confusing.
– Hide communication inside “distributed array”.
GPU 1
GPU 3
GPU 2
GPU 4
![Page 17: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/17.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
HPC Software Infrastructure: Communication
• After some hidden communication, GPU 2 has all the input data it needs.
– GPU code not much different from 1 GPU.
GPU 2
![Page 18: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/18.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
Results: Scaling
• Implemented 9-layer network from Le et al., 2012.
– 3 stacks of sparse auto-encoder, pooling, LCN.
– Compute “fine-tuning” gradient.
• Up to 11.2B parameter networks. – Update time similar to 185M parameter network on 1 GPU.
00.10.20.30.40.50.60.70.80.9
1
1 4 9 16 36 64
Ite
rati
on
Tim
e(s
)
# GPUs
11.2B
6.9B
3.0B
1.9B
680M
185M
![Page 19: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/19.jpg)
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro
Results: Scaling
• Up to 47x increase in throughput:
1
2
4
8
16
32
64
1 4 9 16 36 64
Fact
or
Spe
ed
up
# GPUs
11.2B
6.9B
3.0B
1.9B
680M
185M
Linear
![Page 20: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/20.jpg)
cuDNN
• Deep Neural Networks rely heavily on BLAS – Basic Linear Algebra Subroutines
• However, there are some kernels unique to DNNs – Such as convolutions
• cuDNN is a GPU library that provides these kernels
• Available at https://developer.nvidia.com/cudnn
Bryan Catanzaro
![Page 21: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/21.jpg)
Using Caffe with cuDNN
• Accelerate Caffe layer types by 1.2 – 3x
• On average, 36% faster overall for training on Alexnet
• Integrated into Caffe dev branch – Official release soon
Caffe (CPU*)
1x
Caffe (GPU) 11x
Caffe (cuDNN)
14x
Baseline Caffe compared to
Caffe accelerated by cuDNN on
K40
Overall AlexNet training time
*CPU is 24 core E5-2697v2 @
2.4GHz Intel MKL 11.1.3 Bryan Catanzaro
![Page 22: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple](https://reader033.vdocument.in/reader033/viewer/2022041416/5e1bee11399d0847f602a57f/html5/thumbnails/22.jpg)
Conclusion
• Deep Learning is increasingly important to AI
• HPC is key to Deep Learning
• Interested in applying your HPC skills to AI? Talk to us!