efficient intra-sm slicing for gpu...

Efficient Intra-SM Slicing for GPU MultiprogrammingQiumin Xu, EE / Dr. Annavaram

[email protected], [email protected]

Intra-SM Slicing vs. Inter-SM SlicingApp 1 App 2 App N

…

CPU + GPU Servers

P(i, Ti) : normalized IPC of application iK : # of applications sharing the SM.RTi : the resource requirement of Ti CTAsRtot : the total resources available in an SMN: maximum # of CTAs per SM

How to assign kernels todifferent GPU SMs?

Inter-SM Slicing Intra-SM Slicing

A A AA A A

A A A A

A

Even Partitioning

A

A

A

A

A

A

A

A A

A

FCFS Allocation

Fragmentation!

A A

A A

A A

Left-Over Allocation

No A Left

A A AA AA

A A AA AA

A A

B B

B B

B B

B

B B

B B

B

B

B

B B B

B

(a) (b) (c) (d)Warped-Slicer Re-partitioning for the Third Kernel

(e)

A A A A A A

B B B

CA A

1. Assign a sequentially increasing # of CTAs from each kernel2. Employ a sampling phase to measure the IPC of each SM

3. Calculate the best partitioning based on Equ. (1)4. Continue to assign CTAs to SMs based on the decision

(1)

Application Pairs

30Speedup

1.23xIncreased Util.

15%Fairness Improved

14%Hardware Overhead

~0

*Cooperative Thread Array (CTA), a.k.a. GPU thread blocks

Warped-SlicerA BA A A A B B B

Kernel Aware Thread Block Scheduler

Kernel A’s CTAs Kernel B’s CTAs

*Streaming Multiprocessors (SM), a.k.a. GPU shader cores

Naïve Brute force algorithm: Ο(NK )Proposed waterfillingalgorithm:

Ο(NK )i. identify kernel #2 hasthe MIN performance

ii. assign 1 CTAfrom kernel #2

3

SM 7CTA 3CTA 6CTA 8CTA 9

SM 6

CTA 5CTA 7

CTA 2

CTA 15

SM 5

CTA 4CTA 1

CTA 12

CTA 14

SM 4

CTA CTA 10

CTA 11

CTA 13

SM 3

CTA 9CTA 8

CTA 3CTA 6

SM 2

CTA 7

CTA 2CTA 5

CTA 15

SM 1

CTA 1CTA 4CTA 12

CTA 14

SM 0

CTA CTA 10

CTA 11

CTA 13

Kernel 1 Kernel 2Kernel Aware Thread-block Scheduler

SM 0 SM 1 SM 2 SM 3

SM 4 SM 5 SM 6 SM 7

Kernel 1 Kernel 2Kernel Aware Thread-block Scheduler

Sam

pler

Cas

e2: C

hoos

e Sp

atia

l M

ultit

aski

ng

4

CTA 1CTA 4CTA 12

CTA 14

CTA 10

CTA 11

CTA 13

CTA 0 CTA 1CTA 4CTA 12

CTA 14

CTA 2CTA 5CTA 7CTA 15


CTA 0CTA 10

CTA 11

CTA 13



Cas

e1: C

hoos

e In

tra-

SM

Slic

ing

at (2

,2)

Sam

pler

CTA Scheduling for Intra-SM Slicing

Dynamic Resource Partitioning

Experimental Results

efficient intra-sm slicing for gpu...

Documents