efficient intra-sm slicing for gpu...
TRANSCRIPT
Efficient Intra-SM Slicing for GPU MultiprogrammingQiumin Xu, EE / Dr. Annavaram
[email protected], [email protected]
Intra-SM Slicing vs. Inter-SM SlicingApp 1 App 2 App N
…
CPU + GPU Servers
P(i, Ti) : normalized IPC of application iK : # of applications sharing the SM.RTi : the resource requirement of Ti CTAsRtot : the total resources available in an SMN: maximum # of CTAs per SM
How to assign kernels todifferent GPU SMs?
Inter-SM Slicing Intra-SM Slicing
A A AA A A
A A A A
A
Even Partitioning
A
A
A
A
A
A
A
A A
A
FCFS Allocation
Fragmentation!
A A
A A
A A
Left-Over Allocation
No A Left
A A AA AA
A A AA AA
A A
B B
B B
B B
B
B B
B B
B
B
B
B B B
B
(a) (b) (c) (d)Warped-Slicer Re-partitioning for the Third Kernel
(e)
A A A A A A
B B B
CA A
1. Assign a sequentially increasing # of CTAs from each kernel2. Employ a sampling phase to measure the IPC of each SM
3. Calculate the best partitioning based on Equ. (1)4. Continue to assign CTAs to SMs based on the decision
(1)
Application Pairs
30Speedup
1.23xIncreased Util.
15%Fairness Improved
14%Hardware Overhead
~0
*Cooperative Thread Array (CTA), a.k.a. GPU thread blocks
Warped-SlicerA BA A A A B B B
Kernel Aware Thread Block Scheduler
Kernel A’s CTAs Kernel B’s CTAs
*Streaming Multiprocessors (SM), a.k.a. GPU shader cores
Naïve Brute force algorithm: Ο(NK )Proposed waterfillingalgorithm:
Ο(NK )i. identify kernel #2 hasthe MIN performance
ii. assign 1 CTAfrom kernel #2
3
SM 7CTA 3CTA 6CTA 8CTA 9
SM 6
CTA 5CTA 7
CTA 2
CTA 15
SM 5
CTA 4CTA 1
CTA 12
CTA 14
SM 4
CTA CTA 10
CTA 11
CTA 13
SM 3
CTA 9CTA 8
CTA 3CTA 6
SM 2
CTA 7
CTA 2CTA 5
CTA 15
SM 1
CTA 1CTA 4CTA 12
CTA 14
SM 0
CTA CTA 10
CTA 11
CTA 13
Kernel 1 Kernel 2Kernel Aware Thread-block Scheduler
SM 0 SM 1 SM 2 SM 3
SM 4 SM 5 SM 6 SM 7
Kernel 1 Kernel 2Kernel Aware Thread-block Scheduler
Sam
pler
Cas
e2: C
hoos
e Sp
atia
l M
ultit
aski
ng
4
CTA 1CTA 4CTA 12
CTA 14
CTA 10
CTA 11
CTA 13
CTA 0 CTA 1CTA 4CTA 12
CTA 14
CTA 2CTA 5CTA 7CTA 15
CTA 3CTA 6CTA 8CTA 9
CTA 0CTA 10
CTA 11
CTA 13
CTA 2CTA 5CTA 7CTA 15
CTA 3CTA 6CTA 8CTA 9
Cas
e1: C
hoos
e In
tra-
SM
Slic
ing
at (2
,2)
Sam
pler
CTA Scheduling for Intra-SM Slicing
Dynamic Resource Partitioning
Experimental Results