geeps: scalable deep learning on distributed gpus with a ...hzhang2/projects/geeps/slides.pdf ·...
TRANSCRIPT
![Page 1: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/1.jpg)
GeePS: Scalable deep learningon distributed GPUs with a
GPU-specialized parameter serverHenggang Cui
Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. XingPARALLEL DATA LABORATORY
Carnegie Mellon University
![Page 2: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/2.jpg)
Image classification w/ deep learning
Henggang Cui © June 17http://www.pdl.cmu.edu/ 2
Machine learning programTraining data:
images w/ labels
read/updateparams
Deep neural network:interconnected neurons
Eagle
Vulture
Accipiter
Osprey
Model parameters:connection weights
(solution)
![Page 3: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/3.jpg)
DistributedML workers
Distributed deep learning
Henggang Cui © June 17http://www.pdl.cmu.edu/ 3
PartitionedTraining data
Sharedmodel parameters
Eagle
Vulture
Accipiter
Osprey
read/updateparams
Parameter server
![Page 4: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/4.jpg)
DistributedGPU ML workers
Distributed deep learning
Henggang Cui © June 17http://www.pdl.cmu.edu/ 4
PartitionedTraining data
Sharedmodel parameters
Eagle
Vulture
Accipiter
Osprey
Parameter server
for GPUs
read/updateparams
![Page 5: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/5.jpg)
Outline• Background
• Deep learning with GPUs• Parallel ML using parameter servers
• GeePS: GPU-specialized parameter server
• Experiment results
Henggang Cui © June 17http://www.pdl.cmu.edu/ 5
![Page 6: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/6.jpg)
A machine with no GPU
Henggang Cui © June 17http://www.pdl.cmu.edu/ 6
DRAM(CPU memory)
NIC
NetworkCPU cores
...Local
storage
![Page 7: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/7.jpg)
GPU device
GPUmemory
(a few GB)
GPU cores
A machine with a GPU device
Henggang Cui © June 17http://www.pdl.cmu.edu/ 7
DRAM(CPU memory)
NIC
NetworkCPU cores
...Local
storage
• Small GPU memory• Expensive to copy between GPU/CPU mem
![Page 8: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/8.jpg)
Single GPU machine learning
Henggang Cui © June 17http://www.pdl.cmu.edu/ 8
Inputdata
GPU memoryCPU memory
Intermediatedata
a mini-batch of training data
Parameter data
Staging memoryfor input data batch
Input data file(training data)
![Page 9: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/9.jpg)
Multi-GPU ML via CPU param. serv.
Henggang Cui © June 17http://www.pdl.cmu.edu/ 9
Parameter server shard 0
Inputdata
GPU memoryCPU memory
Intermediatedata
Parameter cache Param working copy
Staging memoryfor input data batch
Network
Input data file(training data)
1. Expensive CPU/GPUdata transfer
2. Only works whendata fits in GPU memory
![Page 10: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/10.jpg)
Outline• Background
• Deep learning with GPUs• Parallel ML using parameter servers
• GeePS: GPU-specialized parameter server• Maintaining the parameter cache in GPU memory• Batch access with GPU cores for higher throughput• Managing limited GPU device memory
• Experiment results
Henggang Cui © June 17http://www.pdl.cmu.edu/ 10
![Page 11: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/11.jpg)
Parameter cache
Parameter cache
Inputdata
Intermediatedata
Parameter server shard 0
Staging memory for parameter cache
Multi-GPU ML via GeePS
Henggang Cui © June 17http://www.pdl.cmu.edu/ 11
GPU memoryCPU memory
CPU/GPU transferin the background
• PS access through GPU memory• Higher PS throughputNetwork
Param working copy
Staging memoryfor input data batch
Input data file(training data)
![Page 12: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/12.jpg)
Outline• Background
• GeePS: GPU-specialized parameter server• Maintaining the parameter cache in GPU memory• Batch access with GPU cores for higher throughput• Managing limited GPU device memory
• Experiment results
Henggang Cui © June 17http://www.pdl.cmu.edu/ 12
![Page 13: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/13.jpg)
Layer-by-layer computation for DNN
Henggang Cui © June 17http://www.pdl.cmu.edu/ 13
Class probabilities
Training images
• For each iteration (mini-batch)• A forward pass• Then a backward pass
• Each time only data of two layers are used
![Page 14: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/14.jpg)
Layer-by-layer computation for DNN• For each iteration (mini-batch)
• A forward pass• Then a backward pass
• Each time only data of two layers are used
Henggang Cui © June 17http://www.pdl.cmu.edu/ 14
Class probabilities
Training images
![Page 15: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/15.jpg)
Layer-by-layer computation for DNN• For each iteration (mini-batch)
• A forward pass• Then a backward pass
• Each time only data of two layers are used
Henggang Cui © June 17http://www.pdl.cmu.edu/ 15
Class probabilities
Training images
![Page 16: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/16.jpg)
Layer-by-layer computation for DNN• For each iteration (mini-batch)
• A forward pass• Then a backward pass
• Each time only data of two layers are used
Henggang Cui © June 17http://www.pdl.cmu.edu/ 16
Class probabilities
Training images
![Page 17: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/17.jpg)
Layer-by-layer computation for DNN• For each iteration (mini-batch)
• A forward pass• Then a backward pass
• Each time only data of two layers are used
Henggang Cui © June 17http://www.pdl.cmu.edu/ 17
Class probabilities
Training images
![Page 18: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/18.jpg)
Layer-by-layer computation for DNN• For each iteration (mini-batch)
• A forward pass• Then a backward pass
• Each time only data of two layers are used
Henggang Cui © June 17http://www.pdl.cmu.edu/ 18
• Use GPU mem as a cache to keep actively used data• Store the remaining in CPU mem
Class probabilities
Training images
![Page 19: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/19.jpg)
GPU memory management
Henggang Cui © June 17http://www.pdl.cmu.edu/ 19
Parameter cache
Inputdata
Intermediatedata
Parameter server shard 0
Staging memory for parameter cache
GPU memoryCPU memoryNetwork
Param working copy
Staging memoryfor input data batch
Input data file(training data)
GPU mem owned by app
![Page 20: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/20.jpg)
Access buffer pool
GeePS-managed buffers
Henggang Cui © June 17http://www.pdl.cmu.edu/ 20
Parameter cache
Inputdata
Intermediatedata
Parameter server shard 0
Staging memory for parameter cache
GPU memoryCPU memoryNetwork
GeePS managed buffers
Staging memoryfor input data batch
Input data file(training data)
![Page 21: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/21.jpg)
Access buffer pool
Local data
GeePS manages local data also
Henggang Cui © June 17http://www.pdl.cmu.edu/ 21
Parameter cacheParameter
server shard 0Staging memory for
parameter cacheGPU memoryCPU memory
Network
Staging memoryfor input data batch
Input data file(training data)
GeePS managed local data
![Page 22: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/22.jpg)
Access buffer poolStaging memory for
parameter cache Parameter cache
Local data
Use CPU memory when not fit
Henggang Cui © June 17http://www.pdl.cmu.edu/ 22
Staging memoryfor input data batch
Parameter server shard 0
GPU memoryCPU memoryNetwork
Local data(CPU part)
Parameter cache(CPU part)
Input data file(training data)
GPU mem owned by GeePS
![Page 23: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/23.jpg)
Use CPU memory when not fit
Henggang Cui © June 17http://www.pdl.cmu.edu/ 23
Staging memoryfor input data batch
Parameter server shard 0
GPU memoryCPU memoryNetwork
Access buffer pool
Local data(CPU part)
Parameter cache(CPU part)
Input data file(training data)
2x the size of largest layer
![Page 24: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/24.jpg)
Pinned local data
Pinned param cache
Use CPU memory when not fit
Henggang Cui © June 17http://www.pdl.cmu.edu/ 24
Staging memoryfor input data batch
Parameter server shard 0
Staging memory for parameter cache
GPU memoryCPU memoryNetwork
Access buffer pool
Local data(CPU part)
Parameter cache(CPU part)
Input data file(training data)
![Page 25: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/25.jpg)
Outline• Background
• GeePS: GPU-specialized parameter server• Maintaining the parameter cache in GPU memory• Batch access with GPU cores for higher
throughput• Managing limited GPU device memory
• Experiment results
Henggang Cui © June 17http://www.pdl.cmu.edu/ 25
![Page 26: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/26.jpg)
Experimental setups• Cluster information
• Tesla K20C GPUs with 5 GB GPU memory
• Dataset and model• ImageNet: 7 million training images in 22,000 classes• Model: AlexNet
– 25 layers, 2.4 billion conns– total memory consumption 4.5 GB
Henggang Cui © June 17http://www.pdl.cmu.edu/ 26
![Page 27: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/27.jpg)
System setups• GeePS-Caffe setups
• Caffe: single-machine GPU deep learning system• GeePS-Caffe: Caffe linked with GeePS
• Baselines• The original unmodified Caffe• Caffe linked with CPU-based PS (IterStore [Cui SoCC’14])
Henggang Cui © June 17http://www.pdl.cmu.edu/ 27
![Page 28: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/28.jpg)
Training throughput
Henggang Cui © June 17http://www.pdl.cmu.edu/ 28
![Page 29: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/29.jpg)
Training throughput
Henggang Cui © June 17http://www.pdl.cmu.edu/ 29
• GeePS scales close to linear with more machines• with 16 machines, it runs 13x faster than Caffe• only 8% GPU stall time
![Page 30: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/30.jpg)
Training throughput
Henggang Cui © June 17http://www.pdl.cmu.edu/ 30
• GeePS is much faster than CPU-based PS• 2.6x higher throughput• reduces GPU stall time from 65% to 8%
![Page 31: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/31.jpg)
More results in the paper• Good scalability and convergence speed for
• GoogLeNet network• RNN network for video classification
• Handle problems larger than GPU memory• Only 27% reduction in throughput with 35% memory
– 3x bigger problems with little overhead• Handle models as large as 20 GB• Support 4x longer videos for video classification
Henggang Cui © June 17http://www.pdl.cmu.edu/ 31
![Page 32: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/32.jpg)
Conclusion• GPU-specialized parameter server for GPU ML
• 13x throughput speedup using 16 machines• 2x faster compared to CPU-based PS
• Managing limited GPU memory• By managing GPU memory inside GeePS as a cache• Efficiently handle problems larger than GPU memory
Enable use of data-parallel PS model
Henggang Cui © June 17http://www.pdl.cmu.edu/ 32
![Page 33: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/33.jpg)
References• [IterStore] H. Cui, A. Tumanov, J. Wei, L. Xu, W. Dai, J. Haber-
Kucharsky, Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting iterative-ness for parallel ML computations. In ACM SoCC, 2014.
• [Caffe] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
• [ImageNet] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE CVPR, 2009.
• [ProjectAdam] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an efficient and scalable deep learning training system. In USENIX OSDI, 2014.
Henggang Cui © June 17http://www.pdl.cmu.edu/ 33
![Page 34: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/34.jpg)
Additional related work• T. Chen, et al. MXNet: A flexible and efficient machine learning library
for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
• H. Zhang, et al. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arXiv preprint arXiv:1512.06216, 2015.
• A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
• J. Dean, et al. Large scale distributed deep networks. In NIPS, 2012.• C. Szegedy, et al. Going deeper with convolutions. arXiv preprint
arXiv:1409.4842, 2014.• R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling
up image recognition. arXiv preprint arXiv:1501.02876, 2015.• A. Coates, B. Huval, T.Wang, D.Wu, B. Catanzaro, and N. Andrew.
Deep learning with COTS HPC systems. In ICML, 2013.
Henggang Cui © June 17http://www.pdl.cmu.edu/ 34
![Page 35: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/35.jpg)
Backup Slides
![Page 36: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/36.jpg)
Interface to GeePS-managed buffer• Read
– Buffer “allocated” by GeePS– Data copied to buffer
• PostRead– Buffer reclaimed
• PreUpdate– Buffer “allocated” by GeePS
• Update– Updates applied to data– Buffer reclaimed
Henggang Cui © June 17http://www.pdl.cmu.edu/ 36
![Page 37: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/37.jpg)
Data placement policy
Henggang Cui © June 17http://www.pdl.cmu.edu/ 37
- Pin as much local data as can in GPU memory- Select local data that causes peak usage
GPU memory
Accessbuffer
peak
![Page 38: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/38.jpg)
Data placement policy
Henggang Cui © June 17http://www.pdl.cmu.edu/ 38
GPU memory
Accessbuffer
(smaller)
to GPU memory
Intermediate states
Pin as much local data as can in GPU memorySelect local data that causes peak usage
![Page 39: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/39.jpg)
Image classification accuracy
Henggang Cui © June 17http://www.pdl.cmu.edu/ 39
• To reach 10% classification accuracy:• 6x faster with 8 machines• 8x faster with 16 machines
![Page 40: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/40.jpg)
Training throughput (more)
Henggang Cui © June 17http://www.pdl.cmu.edu/ 40
![Page 41: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/41.jpg)
Per-layer memory usage
Henggang Cui © June 17http://www.pdl.cmu.edu/ 41
![Page 42: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/42.jpg)
Throughput vs. memory budget
Henggang Cui © June 17http://www.pdl.cmu.edu/ 42
All data in GPU memory
Only buffer pool in GPU memoryTwice the peak size for double buffering
• Only 27% reduction in throughput with 35% memory• Can do 3x bigger problems with little overhead
![Page 43: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/43.jpg)
Larger models
Henggang Cui © June 17http://www.pdl.cmu.edu/ 43
• Models up to 20 GB
![Page 44: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/44.jpg)
Layer-by-layer computation for DNN• For each iteration (mini-batch)
• A forward pass• Then a backward pass
• Each time only data of two layers are used
Henggang Cui © June 17http://www.pdl.cmu.edu/ 44
Class probabilities
Training images
![Page 45: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/45.jpg)
Computation vs. stall times
Henggang Cui © June 17http://www.pdl.cmu.edu/ 45
• Even for slack 0, updates of a layer can be sent to other machines before the updates of other layers finish
• CPU-PS has much overhead of transferring data between GPU/CPU memory in the foreground
![Page 46: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/46.jpg)
Computation vs. stall times (more)
Henggang Cui © June 17http://www.pdl.cmu.edu/ 46
• GeePS and CPU-PS: updates of each layer are sent in distinct batches
• Single-table: updates of all layers are sent in a single batch
![Page 47: GeePS: Scalable deep learning on distributed GPUs with a ...hzhang2/projects/GeePS/slides.pdf · GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter](https://reader033.vdocument.in/reader033/viewer/2022051922/60104b3416ecc635f3490404/html5/thumbnails/47.jpg)
Convergence with data staleness
Henggang Cui © June 17http://www.pdl.cmu.edu/ 47
• The data staleness sweet spot is Slack 0• because the GPUs perform huge amount of computation every clock