efficient intra-sm slicing for gpu...

1
Efficient Intra-SM Slicing for GPU Multiprogramming Qiumin Xu, EE / Dr. Annavaram [email protected], [email protected] Intra-SM Slicing vs. Inter-SM Slicing App 1 App 2 App N CPU + GPU Servers P(i, T i ) : normalized IPC of application i K : # of applications sharing the SM. R Ti : the resource requirement of T i CTAs R tot : the total resources available in an SM N: maximum # of CTAs per SM How to assign kernels to different GPU SMs? Inter-SM Slicing Intra-SM Slicing A A A A A A A A A A A Even Partitioning A A A A A A A A A A Left-Over Allocation No A Left A A A A A A A A A A A A A A B B B B B B B B B B B B (b) (c) (d) Warped-Slicer Re 1. Assign a sequentially increasing # of CTAs from each kernel 2. Employ a sampling phase to measure the IPC of each SM 3. Calculate the best partitioning based on Equ. (1) 4. Continue to assign CTAs to SMs based on the decision (1) Application Pairs 30 Speedup 1.23x Increased Util. 15% Fairness Improved 14% Hardware Overhead ~0 *Cooperative Thread Array (CTA), a.k.a. GPU thread blocks Warped-Slicer A B A A A A B B B Kernel Aware Thread Block Scheduler Kernel A’s CTAs Kernel B’s CTAs *Streaming Multiprocessors (SM), a.k.a. GPU shader cores Naïve Brute force algorithm: Ο ( N K ) Proposed waterfilling algorithm: Ο ( NK ) i. identify kernel #2 has the MIN performance ii. assign 1 CTA from kernel #2 3 SM 7 CTA 3 CTA 6 CTA 8 CTA 9 SM 6 CTA 5 CTA 7 CTA 2 CTA 15 SM 5 CTA 4 CTA 1 CTA 12 CTA 14 SM 4 CTA CTA 10 CTA 11 CTA 13 SM 3 CTA 9 CTA 8 CTA 3 CTA 6 SM 2 CTA 7 CTA 2 CTA 5 CTA 15 SM 1 CTA 1 CTA 4 CTA 12 CTA 14 SM 0 CTA CTA 10 CTA 11 CTA 13 Kernel 1 Kernel 2 Kernel Aware Thread-block Scheduler SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 SM 6 SM 7 Kernel 1 Kernel 2 Kernel Aware Thread-block Scheduler Sampler Case2: Choose Spatial Multitasking 4 CTA 1 CTA 4 CTA 12 CTA 14 CTA 10 CTA 11 CTA 13 CTA 0 CTA 1 CTA 4 CTA 12 CTA 14 CTA 2 CTA 5 CTA 7 CTA 15 CTA 3 CTA 6 CTA 8 CTA 9 CTA 0 CTA 10 CTA 11 CTA 13 CTA 2 CTA 5 CTA 7 CTA 15 CTA 3 CTA 6 CTA 8 CTA 9 Case1: Choose Intra-SM Slicing at (2,2) Sampler CTA Scheduling for Intra-SM Slicing Dynamic Resource Partitioning Experimental Results

Upload: others

Post on 22-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Intra-SM Slicing for GPU Multiprogrammingminghsiehece.usc.edu/wp-content/uploads/2017/08/PM-Qiumin-Xu.pdfQiumin Xu,EE / Dr. Annavaram ... CTA 12 CTA 14 CTA 10 CTA 11 CTA

Efficient Intra-SM Slicing for GPU MultiprogrammingQiumin Xu, EE / Dr. Annavaram

[email protected], [email protected]

Intra-SM Slicing vs. Inter-SM SlicingApp 1 App 2 App N

CPU + GPU Servers

P(i, Ti) : normalized IPC of application iK : # of applications sharing the SM.RTi : the resource requirement of Ti CTAsRtot : the total resources available in an SMN: maximum # of CTAs per SM

How to assign kernels todifferent GPU SMs?

Inter-SM Slicing Intra-SM Slicing

A A AA A A

A A A A

A

Even Partitioning

A

A

A

A

A

A

A

A A

A

FCFS Allocation

Fragmentation!

A A

A A

A A

Left-Over Allocation

No A Left

A A AA AA

A A AA AA

A A

B B

B B

B B

B

B B

B B

B

B

B

B B B

B

(a) (b) (c) (d)Warped-Slicer Re-partitioning for the Third Kernel

(e)

A A A A A A

B B B

CA A

1. Assign a sequentially increasing # of CTAs from each kernel2. Employ a sampling phase to measure the IPC of each SM

3. Calculate the best partitioning based on Equ. (1)4. Continue to assign CTAs to SMs based on the decision

(1)

Application Pairs

30Speedup

1.23xIncreased Util.

15%Fairness Improved

14%Hardware Overhead

~0

*Cooperative Thread Array (CTA), a.k.a. GPU thread blocks

Warped-SlicerA BA A A A B B B

Kernel Aware Thread Block Scheduler

Kernel A’s CTAs Kernel B’s CTAs

*Streaming Multiprocessors (SM), a.k.a. GPU shader cores

Naïve Brute force algorithm: Ο(NK )Proposed waterfillingalgorithm:

Ο(NK )i. identify kernel #2 hasthe MIN performance

ii. assign 1 CTAfrom kernel #2

3

SM 7CTA 3CTA 6CTA 8CTA 9

SM 6

CTA 5CTA 7

CTA 2

CTA 15

SM 5

CTA 4CTA 1

CTA 12

CTA 14

SM 4

CTA CTA 10

CTA 11

CTA 13

SM 3

CTA 9CTA 8

CTA 3CTA 6

SM 2

CTA 7

CTA 2CTA 5

CTA 15

SM 1

CTA 1CTA 4CTA 12

CTA 14

SM 0

CTA CTA 10

CTA 11

CTA 13

Kernel 1 Kernel 2Kernel Aware Thread-block Scheduler

SM 0 SM 1 SM 2 SM 3

SM 4 SM 5 SM 6 SM 7

Kernel 1 Kernel 2Kernel Aware Thread-block Scheduler

Sam

pler

Cas

e2: C

hoos

e Sp

atia

l M

ultit

aski

ng

4

CTA 1CTA 4CTA 12

CTA 14

CTA 10

CTA 11

CTA 13

CTA 0 CTA 1CTA 4CTA 12

CTA 14

CTA 2CTA 5CTA 7CTA 15

CTA 3CTA 6CTA 8CTA 9

CTA 0CTA 10

CTA 11

CTA 13

CTA 2CTA 5CTA 7CTA 15

CTA 3CTA 6CTA 8CTA 9

Cas

e1: C

hoos

e In

tra-

SM

Slic

ing

at (2

,2)

Sam

pler

CTA Scheduling for Intra-SM Slicing

Dynamic Resource Partitioning

Experimental Results