Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance
ISCA'39Portland, OR
June 11, 2012
Nicolas BrunieKalray and ENS [email protected]
Sylvain CollangeUniv. Federal de
Minas [email protected]
Gregory DiamosNVIDIA [email protected]
2/25
Mitigating GPU branch divergence cost
GPUs rely on SIMD execution
Serialization of divergent branches → resource underutilization
Contribution: a second instruction scheduler to improve utilization
SIMD(baseline)
1
234
56
Control-flowgraph
SBI+SWI
3/25
Outline
GPU microarchitecture
SIMT model
The divergence problem
Simultaneous branch interweaving
Simultaneous warp interweaving
4/25
Context: GPU microarchitecture
Software: graphics shaders, OpenCL, CUDA...
Hardware: GPU
Architecture: multi-thread SPMD programming model
GPU microarchitecture
Hardware datapaths: SIMD execution units
kernel void scale(float a, float * X) {X[tid] = a * X[tid];
}
1 program Many threads
RFALU
RFALU
RFALU
RFALU
5/25
Single-Instruction Multi-Threading (SIMT)
Optimized for regular workloads
Fetch @17
Execute
Implicit SIMD execution model
Fetch 1 instruction for a warp of lockstepping threads
Execute on SIMD units
T0 T1 T2 T3
PC=17 PC=17
add add add add
add
PC=17 PC=17
Warp
6/25
The control divergence problem
Fetch @2
Execute
Control divergence: conflict for shared fetch unit
Serialize execution paths
T0 T1 T2 T3
PC=2 PC=2
add nop add nop
add
Warp
1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }
Efficiency loss
7/25
The control divergence problem
Fetch @4
Execute
T0 T1 T2 T3
PC=4 PC=4
nop mul nop mul
mul
Warp
1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }
Control divergence: conflict for shared fetch unit
Serialize execution paths
Efficiency loss
8/25
Outline
The GPU divergence problem
Simultaneous branch interweaving
Double instruction fetch
Finding branch-level parallelism
Restoring lockstep execution
Implementation
Simultaneous warp interweaving
9/25
Simultaneous Branch Interweaving
Add a second fetch unit
Simultaneous execution of divergent branches
T0 T1 T2 T3
Fetch @2
Execute
add
Fetch @4
addmul mul
PC=2 PC=2PC=4 PC=4
add mul
1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }
Warp
10/25
Standard divergence control: mask stack
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
Code
push
push
pop
push
pop
pop
1111
1111 1100
1111 1100 1000
1111 1100
1111 1100 0100
1111 1100
1111
Mask Stack1 activity bit / thread
tid=0
tid=1
tid=2
tid=3
Problem: does not expose branch-level parallelism
11/25
Alternative to stack: 1 PC / thread
Master PC
Program Counters (PCs)tid= 0 1 2 3
1 0 0 0
PC0
PC1
PC2
PC3
Match→ active
No match→ inactive
Policy: MPC = min(PCi)
Earliest reconvergencewith code laid out in Thread Frontiers order
Stack-based divergence control implies serialization
G. Diamos et al. SIMD re-convergence at thread frontiers. MICRO 44, 2011.
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
Code
12/25
Run two branches simultaneously
PC1
MPC1 = min(PC
i)
MPC2 = min(PC
i, PC
i ≠ MPC
1)
Master PC 2
Master PC 1
Program Counters (PCs)tid= 0 1 2 3
1 0 0 0
PC0
PC2
PC3
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
x = 3;
}
}
Code
13/25
9 9
9 9
8 8
8 8
Restoring lockstep execution
Issue: unbalanced paths break lockstep execution
Power consumption, loss of memory locality
Solution: implicit partial synchronization barrier
Instruction 6, 7broken down,issued twice Synchronize
beforeinstruction 6
Greedyscheduling
Earliestreconvergence
1
234
567
1 1 1 1234
234
5 5
67
67
67
67
1 1 1 1234
234
5 5
67
67
67
67
Control-flowgraph
T0T
1T
2T
3T
0T
1T
2T
3
98
8 8 8 89 9 9 9
14/25
Enforcing control-flow reconvergence
T0 and T2 (at F)wait for T1 (in D).
T3 (in B) can proceedin parallel.
Wait for any thread of the warp between PCdiv and PCrec
Annotate reconvergence points with pointer to immediate dominator
15/25
Enforcing control-flow reconvergence
T0 and T2 (at F)wait for T1 (in D).
T3 (in B) can proceedin parallel.
Wait for any thread of the warp between PCdiv and PCrec
Annotate reconvergence points with pointer to immediate dominator
16/25
Implementation: context table
Common case: few different PCs
Order stable in time
Keep Common PCs+activity masks in sorted heap
17
PC0
12 17 3 17 17 3 3 17
0 1 0 1 1 0 0 1
3 0 0 1 0 0 1 1 0
12 1 0 0 0 0 0 0 0PC1
CPC1
CPC2
CPC3Per-thread PCs
Sorted context table
PC7
PC2
PC3PC
4PC
5PC
6
T0T
1T
7
17/25
Two-level context table
Cache top 2 entries in the Hot Context Table register
Constant-time access to MPCi=CPC
i, activity masks
Other entries in the Cold Context Table linked list
Branch → incremental insertion in CCT
18/25
Outline
The GPU divergence problem
Simultaneous branch interweaving
Simultaneous warp interweaving
Idea
Dealing with lane conflicts
Implementation
Results
19/25
Simultaneous Warp Interweaving
SBI limitation: often no secondary path
Single-sided ifs, loops…
SWI: opportunistically schedule instructionsfrom other warps in divergence gaps
“SMT for SIMD”
T0 T1 T2 T3
Fetch @17
Execute
add
Fetch @42
addmul nop
PC=17
PC=42
add mul
T4 T5 T6 T7
Warp 0
Warp 1
20/25
2332
Using divergence correlations
Issue: unbalanced divergence introduces conflicts
e.g. Parallel reduction
Solution: static lane shuffling
Apply different lane permutation for each warp
Preserves inter-thread memory locality
time
warp 0 warp 1 warp 2 warp 3
Warp 0 is never compatible with warp 2:conflict in lane 0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
time
warp 0 warp 1 warp 2 warp 3
Threads 0 mapped to different physicallanes: no conflict
0 1 2 3 01 23 0 1 01
21/25
Detecting a compatible secondary warp
Bitset inclusion test:Content-Associative Memory
Treat zeros as don't care bits
Power-hungry!
1 1
0 1 1 0
1 1 1 1 1m
W0W1W2W3W4W5W6
hit
0 0 0
0 0 0 0 0 0
0 hit0 0 0
Set-associative lookupSplit warps in sets
Restrict lookup to 1 set
More power-efficient1 1
0 1 1 0
1 1 1 1 1m
W0W1W2W3W4W5W6
hit
0 0 0
0 0 0 0 0 0
00 0 0
sameset
22/25
Set-associative lookup is good enough
3-way: captures 66% of performance potential
Direct-mapped: 48%
23/25
Experimental configuration
Baseline: clustered SIMT architecture (Fermi, Kepler)
Tie both clusters together to form twice bigger warps
Direct both instructions to the same execution units
Baseline: warp size 322 warps / clock, 1 instruction / warp
SBI/SWI: warp size 641 warp / clock, 2 instructions / warp
T0 T1 T2 T3 T4 T5 T6 T7T0 T1 T2 T3 T4 T5 T6 T7Warp pool 1 Warp pool 2Cluster 1 Cluster 2
Common warp pool
Fetch-unit / execute-unit ratio maintained
24/25
Performance results
Regular applications Irregular applications
Speedup Regular Irregular
SBI +15% +41%
SWI +25% +33%
SBI+SWI +23% +40%
25/25
…
...
Perspective: SMT-GPU µarch convergence
Converging trends in SMT and GPU architecture
Closing micro-architectural spacebetween Clustered Multi-Threading and SIMD
Explore new tradeoffs between power efficiency and flexibility?
SMT
SIMD
SIMT
Merge instructions from concurrent threads Loosen constraints of SIMD execution
MMT[Long10]
MIS[Dechene10]
Fetch-combining[Kumar04] DWF
[Fung07]
DWS[Meng10]
Thread Fusion[González08]
BC[Fung11]
LW[Narasiman11]
SBI / SWIThis work
CAPRI[Rhu12]
Efficiency on regular MT apps Flexibility
?
TF[Diamos11]
iGPU[Menon12]
Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance
ISCA'39Portland, OR
June 11, 2012
Nicolas BrunieKalray and ENS [email protected]
Sylvain CollangeUniv. Federal de
Minas [email protected]
Gregory DiamosNVIDIA [email protected]
Backups
28/25
References
R. Kumar et al. Conjoined-core chip multiprocessing. MICRO 37, 2004.
J. González et al. Thread fusion. ISLPED 13, 2008.
W. W. L. Fung et al. Dynamic warp formation: efficient MIMD control flow on SIMD graphics hardware. TACO, 2009.
G. Long et al. Minimal multi-threading: finding and removing redundant instructions in multithreaded processors. MICRO 43, 2010.
M. Dechene et al. Multi-threaded instruction sharing. Technical report, 2010.
J. Meng et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ISCA 37, 2010.
G. Diamos et al. SIMD re-convergence at thread frontiers. MICRO 44, 2011.
W. Fung et al. Thread block compaction for efficient SIMT control flow.HPCA 17, 2011.
29/25
SWI implies SMT
Heterogeneous execution units
SWI improves utilization with superscalar execution
T0 T1 T2 T3
Fetch @17
ALU
add
Fetch @42
add load
PC=17 PC=42
add load
T4 T5 T6 T7
Warp 0 Warp 1
LSU
30/25
SBI vs. DWF
Dual fetch
Uses branch-level parallelism
Sensitive to branch unbalance
Preserves in-warp locality
T0 T1 T2 T3
Fetch @2
Execute
add
Fetch @4
addmul mul
PC=2 PC=2PC=4 PC=4
add mul
Warp
Single fetch
Uses warp-level parallelism
Sensitive to lane activity unbalance
T12 T13 T14 T15
Fetch
Execute
add add
PC=2
add
T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3
addadd
31/25
SWI vs. DWF
Dual fetch
Uses warp-level parallelism
Sensitive to lane conflicts
Preserves in-warp locality
Single fetch
Uses warp-level parallelism
Sensitive to lane activity unbalance, low occupancy
T12 T13 T14 T15
Fetch
Execute
add add
PC=2
add
T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3
addadd
Execute
add addmul nop
add mul
T12 T13 T14 T15T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3
32/25
Simulation platform
Barra: functional GPU simulator
modeled after NVIDIA Tesla GPUs
Runs native Tesla SASS binaries
Reproduces SIMT execution
Timing-power model
Cycle-accurate execution pipeline
Constant-latency, bandwidth-bound memory
Calibration from GPU microbenchmarks
http://gpgpu.univ-perp.fr/index.php/Barra
Sylvain Collange, Marc Daumas, David Defour, David Parello. Barra: a parallel functional simulator for GPGPU. MASCOTS 2010.
SBI scoreboarding logic
Keep track of dependencies induced by thread divergence-reconvergence
Transitive closure of dependency graph
34/25
Goto considered harmful?
jjaljrsyscall
MIPS
jmpiififfelseendifdowhilebreakconthaltmsavemrestpushpop
Intel GMAGen4(2006)
jmpiifelseendifcasewhilebreakconthaltcallreturnfork
Intel GMASB(2011)
pushpush_elsepoppush_wqmpop_wqmelse_wqmjump_anyreactivatereactivate_wqmloop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after
AMD Cayman(2011)
pushpush_elsepoploop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after
AMDR600(2007)
jumploopendlooprependrepbreakloopbreakrepcontinue
AMDR500(2005)
barbrabrkbrkptcalcontkilpbkpretretssytrap.s
NVIDIATesla(2007)
barbptbrabrkbrxcalcontexitjcaljmpjmxlongjmppbkpcntplongjmppretretssy.s
NVIDIAFermi(2010)
Control instructions in some CPU and GPU instruction sets
Control flow structure is explicit
GPU-specific instruction sets
No support for arbitrary control flow
35/25
Flynn's taxonomy revisited
Resource count
1
M
InstructionFetch
Resource type: Memory port(Address)
Computation /registers(Data)
SIMT
MIMT
F
T0T
1T
2T
3
F
T0T
1T
2T
3
F F F
2
DIMTF
T0T
1T
2T
3
F
SAMT
MAMT
M
T0T
1T
2T
3
M
T0T
1T
2T
3
M M M
DAMTM
T0T
1T
2T
3
M
SDMT
MDMT
X
T0T
1T
2T
3
X
T0T
1T
2T
3
X X X
DDMTX
T0T
1T
2T
3
X
A. Glew. Coherent vector lane threading. Berkeley ParLab Seminar, 2009.
36/25
Examples: conventional design points
F MXMulti-core
MIMD(MAMT) F MX
F MX
GPU
SI(MDSA)MT
Short-vector SIMD
SIMD(SAST)
T0
T1
T2
X
F MX
X
T0
X
F MX
X
T0
T1
T2
MI MD MA MT
SIMD
SA ST
SIMD
SA MT