hetpipe: enabling large dnn training on (whimpy ... · hetpipe: enabling large dnn training on...
TRANSCRIPT
![Page 1: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/1.jpg)
HetPipe: Enabling Large DNN Training on (Whimpy)
Heterogeneous GPU Clusters through Integration of
Pipelined Model Parallelism and Data Parallelism
Jay H. Park,
Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee,
Jaesik Choi †, Sam H. Noh, and Young-ri Choi
†
![Page 2: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/2.jpg)
▪ Motivation & Background
▪ HetPipe in a Nutshell
▪ Our System: HetPipe
▪ Evaluation
▪ Conclusion
2
Contents
![Page 3: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/3.jpg)
▪ DNN (Deep Neural Network) models continue to grow
3
Motivation
• Need more powerful GPUs for training!
![Page 4: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/4.jpg)
▪ Short release cycle of new GPU architectures
4
Motivation
• Use of heterogeneous GPUs is inevitable!• What to do with whimpy GPUs?
Whimpy GPUs
![Page 5: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/5.jpg)
5
DNN Training
Forward Pass 𝒊
Backward Pass 𝒊
𝒘𝒊+𝟏 = 𝒘𝒊 − 𝜼 ∙ 𝒖𝒊
Minibatch 𝒊(Training Data)
Loss
Cat?
Weight Parameter 𝒘
![Page 6: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/6.jpg)
▪ Model parallelism (MP)▪ Data parallelism (DP)
6
Parallelizing DNN Training
• Low GPU utilization
Weights synchronizedthrough PS or AllReduce
• GPU memory limitation
Worker 1
…
Parameter Server (PS)
1
1
Worker 𝒏
𝑛
𝑛
Forward passBackward pass
![Page 7: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/7.jpg)
▪ Attempts to improve MP utilization
• Pipelined model parallelism (PMP)
7
Parallelizing DNN Training
• Designed for homogeneous GPUs• Designed for a single PMP worker
Forward passBackward pass
PMP Worker
• PipeDream [SOSP’19]• GPipe [NIPS’19]
![Page 8: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/8.jpg)
8
HetPipe in a Nutshell
Virtual Worker (VW) 1
Parameter Server
VW 𝒏
…
Integrates PMP + DP
WSP(Wave Synchronous Parallel)
Support Heterogeneous GPUs VW: A group of multiple GPUs
GPUGPUGPUGPU
GPUGPUGPUGPU R
RGG
R
G
GPUGPUGPUGPU
GPUGPUGPUGPU V
VQQ
V
Q
PMP DP
PMP
![Page 9: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/9.jpg)
9
Challenges in integration PMP+DP in Heterogeneous GPUs
• What weight version should be used by each VW to synchronize with other VWs?
Parameter Server
… • How do we reduce virtual worker stragglers when we consider DP?
Many more in the paper
…
![Page 10: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/10.jpg)
10
HetPipe Contributions
Integrates PMP + DPNovel parameter synchronization model
WSP (Wave Synchronous Parallel)
Enable Large DNN Training on Heterogeneous GPUsAggregate heterogeneous resources
Reduce the straggler problem
Proof of WSP Convergence
![Page 11: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/11.jpg)
11
HetPipe Workflow
Model Partitioner
DNN Model
Resource Allocator
Cluster Configuration
P2 P3 P4P1
Assign k GPUs to each virtual worker
VW 1…
Divide model into k partitions
P1
P2
P3
P4
VW 𝒏
P1’
P2’
P3’
P4’
PS…
…
Time
V
V
Q
Q
R
R
G
G
P2’ P3’ P4’P1’
VW 1 VW 𝒏
![Page 12: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/12.jpg)
12
HetPipe Workflow
Model Partitioner
DNN Model
Resource Allocator
Cluster Configuration
Assign k GPUs to each virtual worker
VW 1
…
Divide model into k partitionsVW 𝒏
PS…
…
P2 P3 P4P1
…
P1
P2
P3
P4
P1’
P2’
P3’
P4’
P2’ P3’ P4’P1’
VW 1 VW 𝒏
Global
Local
Local
Staleness
V
V
Q
Q
R
R
G
G
![Page 13: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/13.jpg)
▪ Motivation & Background
▪ HetPipe in a Nutshell
▪ Our System: HetPipe
• Pipelined Model Parallelism Within a VW
• Data Parallelism with Multiple VWs
▪ Evaluation
▪ Conclusion
13
Outline
![Page 14: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/14.jpg)
▪ Execution of a virtual worker
14
Pipelined Model Parallelism Within a VW
1
1
2 3 41 2 3 4
1 2 3 41 22 33
1 21
1GPU1
GPU2
GPU3
GPU4
Time
Forward pass Backward pass
𝒘𝒍𝒐𝒄𝒂𝒍
𝑾𝒍𝒐𝒄𝒂𝒍 is a consistent version of weights within a VW
𝑵𝒎 minibatches processed concurrently in pipeline manner
…
44 53 45
2 55 62
3
5 66 77 884 5 6 7
6 4 773
5 884
6 995
76
6 7 8 9
![Page 15: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/15.jpg)
▪ Weight management procedure
15
Pipelined Model Parallelism Within a VW
1
1
2 3 41 2 3 4
1 2 3 41 22 33
1 21
1GPU1
GPU2
GPU3
GPU4
Time
…
Forward pass Backward pass
𝒘𝒍𝒐𝒄𝒂𝒍
Update 𝒖𝟏 𝒘𝟓 ← 𝒘𝒍𝒐𝒄𝒂𝒍
Initial weight version (𝒘𝟎)𝑤𝑙𝑜𝑐𝑎𝑙=𝑤0=𝑤1=𝑤2=𝑤3=𝑤4
𝑤1 𝑤2 𝑤3 𝑤4
𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟏
44 53 45
2 55 62
3
5 66 77 884 5 6 7
6 4 773
5 884
6 995
76
6 7 8 9
![Page 16: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/16.jpg)
▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍): maximum missing updates
16
Pipelined Model Parallelism Within a VW
1
1
2 3 41 2 3 4
1 2 3 41 22 33
1 21
1GPU1
GPU2
GPU3
GPU4
Time
𝒘𝒍𝒐𝒄𝒂𝒍
…
Forward pass Backward pass
𝒘𝟓 missing updates of minibatches 2 to 4
𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟏
𝒘𝟓 ← 𝒘𝒍𝒐𝒄𝒂𝒍
𝑺𝒍𝒐𝒄𝒂𝒍 = 3
44 53 45
2 55 62
3
5 66 77 884 5 6 7
6 4 773
5 884
6 995
76
6 7 8 9
![Page 17: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/17.jpg)
▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍): maximum missing updates
17
Pipelined Model Parallelism Within a VW
1
1
2 3 41 2 3 4
1 2 3 41 22 33 44 5
1 2 3 41
52 5
51 623
GPU1
GPU2
GPU3
GPU4
Time
Forward pass Backward pass
𝒘𝒍𝒐𝒄𝒂𝒍
Update 𝒖𝟐 𝒘𝟔 ← 𝒘𝒍𝒐𝒄𝒂𝒍
𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟐
𝒘𝟔 missing updates of minibatches 3 to 5
…
5 66 77 884 5 6 7
6 4 773
5 884
6 995
76
6 7 8 9
𝒘𝟎 + 𝒖𝟏
𝑺𝒍𝒐𝒄𝒂𝒍 = 3
![Page 18: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/18.jpg)
▪ Motivation & Background
▪ HetPipe in a Nutshell
▪ Our System: HetPipe
• Pipelined Model Parallelism Within a VW
• Data Parallelism with Multiple VWs
▪ Evaluation
▪ Conclusion
18
Outline
![Page 19: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/19.jpg)
Data Parallelism with Multiple VWs
Minibatch 1
Minibatch 2
Minibatch 3
Minibatch 4VW 1
Clock
56
78
…
Wave 0 Wave 1
0 1 2
…
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Progress of minibatch execution
VW 𝒏 …
Push & Pull
𝑵𝒎
Wave: Sequence of concurrently executing 𝑁𝑚 minibatches
19
![Page 20: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/20.jpg)
▪ Push occurs every clock
20
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
…
VW 𝒏
…
8 Blocked minibatch 8
𝒘𝒈𝒍𝒐𝒃𝒂𝒍 ← 𝒘𝒈𝒍𝒐𝒃𝒂𝒍 + 𝒖
Push aggregated updates of wave0 ( 𝑢)
𝑢 = 𝑢1+𝑢2+𝑢3+𝑢4
56
7
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
Wave 0
![Page 21: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/21.jpg)
▪ Pull occurs intermittently - Depending on user defined clock distance D
21
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
8
12
34VW 2
- If D = 0 pull occurs every clock
VW1 waits before pulluntil VW2 pushes
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
56
7
![Page 22: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/22.jpg)
12
34
56
7
▪ Pull occurs intermittently - Depending on user defined clock distance D
22
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
8
12
34VW 2 8
If D = 0
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
7
56
7
56
VW2 Push aggregated updates ( 𝑢)
VW1 waits before pulluntil VW2 pushes
𝑤𝑔𝑙𝑜𝑏𝑎𝑙 ← 𝑤𝑔𝑙𝑜𝑏𝑎𝑙 + 𝑢
![Page 23: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/23.jpg)
▪ Pull occurs intermittently - Depending on user defined clock distance D
23
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
8
12
34VW 2
Pull occursafter all VWs have been pushed
8
If D = 0
𝑤𝑙𝑜𝑐𝑎𝑙 ← 𝑤𝑔𝑙𝑜𝑏𝑎𝑙
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
7
56
7
56
![Page 24: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/24.jpg)
▪ Pull occurs intermittently - Depending on user defined clock distance D
24
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
8
12
34VW 2
Minibatch 8 starts with 𝑤8
8
If D = 0
𝑤8 = 𝑤0+(𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
7
56
7
56
![Page 25: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/25.jpg)
▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍) and global staleness (𝑺𝒈𝒍𝒐𝒃𝒂𝒍) with WSP
25
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
12
34VW 2
2
𝑤11
(𝑢8+ 𝑢9+ 𝑢10)vw1𝑺𝒈𝒍𝒐𝒃𝒂𝒍𝑺𝒍𝒐𝒄𝒂𝒍
𝑤𝑔𝑙𝑜𝑏𝑎𝑙 =𝑤0+ (𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2
+ (𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2=𝑤0
+ (𝑢5+ 𝑢6+ 𝑢7)vw1
8
8
7
56
7
56
56
78
56
78
910
1112
11
(𝑢5+ 𝑢6+ 𝑢7)vw2
![Page 26: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/26.jpg)
5
▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍) and global staleness (𝑺𝒈𝒍𝒐𝒃𝒂𝒍) with WSP
26
Data Parallelism with Multiple VWs
12
34VW 1
Clock0 1
12
34VW 2
67
8
56
78
2
910
1112
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Minibatch 12 has to wait
![Page 27: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/27.jpg)
▪ Example of clock distance threshold D
27
Data Parallelism with Multiple VWs
If D = 1
Can start minibatch 8 without pull1
23
4VW 1
Clock0 1
8
12
34VW 2
56
7
Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍
Push & Pull
![Page 28: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/28.jpg)
▪ Example of clock distance threshold D
28
Data Parallelism with Multiple VWs
Minibatch 12 has to wait
12
34VW 1
Clock0 1
12
34
56
78
910
1112
2
𝑤11 =𝑤0+(𝑢1+ 𝑢2+ 𝑢3+ 𝑢4+ 𝑢5+ 𝑢6+ 𝑢7)vw1
(𝑢8+ 𝑢9+𝑢10)vw1𝑺𝒈𝒍𝒐𝒃𝒂𝒍𝑺𝒍𝒐𝒄𝒂𝒍
𝑤𝑔𝑙𝑜𝑏𝑎𝑙 =𝑤0
If D = 1
1111
VW 2
(𝑢1+ 𝑢2+𝑢3+𝑢4+𝑢5+𝑢6+𝑢7)vw25
67
![Page 29: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/29.jpg)
▪ Motivation & Background
▪ HetPipe in a Nutshell
▪ Our System: HetPipe
▪ Evaluation
• Setup
• Resource Allocation for Virtual Workers
• Results
▪ Conclusion
29
Outline
![Page 30: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/30.jpg)
InfiniBand (56 Gbps)
▪ Cluster setup - 4 heterogeneous GPU nodes
▪ Two DNN models
30
Evaluation Setup
ResNet-152 VGG-19
Dataset, minibatch size ImageNet, 32
Model parameter size 230 MB 548 MB
Characteristic Large activation output Large parameter size
Node 𝟏 Node 𝟐 Node 𝟑 Node 𝟒
V0
V1
V2
V3
R0
R1
R2
R3
G0
G1
G2
G3
Q0
Q1
Q2
Q3
TITAN V TITAN RTX GeForce RTX 2060Quadro P4000
V R G Q
Computation power
> > >
VR GQ
Memory size
> > >
![Page 31: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/31.jpg)
▪ NP (Node Partition)
31
Resource Allocation for Virtual Workers: NP, ED, HD
Node 𝟏
V0V1V2V3
VW 𝟏
Node 𝟐
Q0Q1Q2Q3
VW 𝟐
Node 𝟑
R0R1R2R3
VW 𝟑
Node 𝟒
G0G1G2G3
VW 4
• Minimum communication overhead within VW
• Performance of each virtual worker varies• Straggler may degrade performance with DP
![Page 32: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/32.jpg)
▪ ED (Equal Distribution)
32
Resource Allocation for Virtual Workers: NP, ED, HD
Node 𝟏
V0
V1
V2
V3
Node 𝟐
Q0
Q1
Q2
Q3
Node 𝟑
R0
R1
R2
R3
Node 𝟒
G0
G1
G2
G3
VW 𝟏
VW 𝟐
VW 𝟑
VW 𝟒
• Performance will be the same across the VWs• Mitigates the straggler problem
• High communication overhead within each VW
![Page 33: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/33.jpg)
▪ HD (Hybrid Distribution)
33
Resource Allocation for Virtual Workers: NP, ED, HD
Node 𝟏
V0
V1
V2
V3
Node 𝟐
Q0
Q1
Q2
Q3
Node 𝟑
R0
R1
R2
R3
Node 𝟒
G0
G1
G2
G3
VW 𝟏
VW 𝟐
VW 𝟑
VW 4
• Mitigates the straggler problem • Reduces communication overhead within each VW
V R G Q
Computation power
> > >
VR GQ
Memory size
> > >
![Page 34: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/34.jpg)
▪ Round-robin policy (default)
• Can be used in all three policies: NP, ED, and HD
Parameter Placement
Node 𝟏
V3
Node 𝟐
Q3
Node 𝟑
R3
Node 𝟒
G3
VW 𝟏
VW 𝟒
… … … … …
Example: ED
Parameters of each layer:
34
![Page 35: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/35.jpg)
▪ Local placement policy
• ED-local
Parameter Placement
Node 𝟏
V3
Node 𝟐
Q3
Node 𝟑
R3
Node 𝟒
G3
VW 𝟏
VW 𝟒
… … … … …
Parameters of each layer:
• Significantly reduces communication overhead
ED
• Parameter communication occurs
35
![Page 36: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/36.jpg)
▪ Baseline Horovod
• State-of-the-art DP using AllReduce
36
Compare Throughput with Horovod
ResNet-152 VGG-19
1.4 X 1.8 X• ED: reduces the straggler
problem• ED-local: significantly
reduces communication overhead
• For ResNet-152, the whole model is too large to be loaded into a single G type GPU(batch size = 32)
![Page 37: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/37.jpg)
37
Performance Improvement of Adding Whimpy GPUs
Adding whimpy GPUs
V RRRRQQQQ
GGGG+ + +
• With additional GPUs, HetPipe achieves up to 2.3X speed up
2.3 X
• Additional whimpy systems allow for faster training
VVVV
![Page 38: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/38.jpg)
▪ ResNet-152
38
Convergence Results
• HetPipe reduces straggler problem in heterogeneous environment
Target accuracy: 74%
Up to 39% faster
• Adding four more whimpy G GPUs, performance improves even more
7% faster
12GPUs16GPUs
![Page 39: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/39.jpg)
▪ VGG-19
39
Convergence Results
Target accuracy: 67%
• HetPipe (D=0) is 29% faster than HorovodUp to 49% faster
D=4 D=32 D=0
4.7% slower
• Higher global staleness (i.e., 32) can degrade convergence performance
29% faster
![Page 40: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/40.jpg)
▪ Provide convergence proof of WSP
▪ Partitioning algorithm
▪ Performance of a single virtual worker
▪ Comparison to PipeDream
40
Not Presented But Discussed in Paper
![Page 41: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/41.jpg)
▪ HetPipe makes it possible to efficiently train large DNN models with heterogeneous GPUs
▪ Integrate pipelined model parallelism with data parallelism
▪ Propose a novel parameter synchronization model: WSP
▪ DNN models converge up to 49% faster with HetPipe
41
Conclusion
![Page 42: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism](https://reader035.vdocument.in/reader035/viewer/2022062607/60259d0597a73f42b76556ae/html5/thumbnails/42.jpg)
42
Thank you!