simultaneous branch and warp interweaving for … · simultaneous branch and warp interweaving for...

Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance

ISCA'39Portland, OR

June 11, 2012

Nicolas BrunieKalray and ENS Lyonnicolas.brunie@kalray.eu

Sylvain CollangeUniv. Federal de

Minas Geraissylvain.collange@dcc.ufmg.br

Gregory DiamosNVIDIA Researchgdiamos@nvidia.com

Mitigating GPU branch divergence cost

GPUs rely on SIMD execution

Serialization of divergent branches → resource underutilization

Contribution: a second instruction scheduler to improve utilization

SIMD(baseline)

Control-flowgraph

SBI+SWI

Outline

GPU microarchitecture

SIMT model

The divergence problem

Simultaneous branch interweaving

Simultaneous warp interweaving

Context: GPU microarchitecture

Software: graphics shaders, OpenCL, CUDA...

Hardware: GPU

Architecture: multi-thread SPMD programming model

GPU microarchitecture

Hardware datapaths: SIMD execution units

kernel void scale(float a, float * X) {X[tid] = a * X[tid];

1 program Many threads

Single-Instruction Multi-Threading (SIMT)

Optimized for regular workloads

Fetch @17

Execute

Implicit SIMD execution model

Fetch 1 instruction for a warp of lockstepping threads

Execute on SIMD units

T0 T1 T2 T3

PC=17 PC=17

add add add add

PC=17 PC=17

The control divergence problem

Fetch @2

Execute

Control divergence: conflict for shared fetch unit

Serialize execution paths

T0 T1 T2 T3

PC=2 PC=2

add nop add nop

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Efficiency loss

The control divergence problem

Fetch @4

Execute

T0 T1 T2 T3

PC=4 PC=4

nop mul nop mul

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Control divergence: conflict for shared fetch unit

Serialize execution paths

Efficiency loss

Outline

The GPU divergence problem

Double instruction fetch

Finding branch-level parallelism

Restoring lockstep execution

Implementation

Simultaneous Branch Interweaving

Add a second fetch unit

Simultaneous execution of divergent branches

T0 T1 T2 T3

Fetch @2

Execute

Fetch @4

addmul mul

PC=2 PC=2PC=4 PC=4

add mul

1: if(!tid%2) {2: a+b;3: else {4: a*b;5: }

Standard divergence control: mask stack

if(tid < 2) {

if(tid == 0) {

x = 2;

else {

x = 3;

1111 1100

1111 1100 1000

1111 1100

1111 1100 0100

1111 1100

Mask Stack1 activity bit / thread

Problem: does not expose branch-level parallelism

Alternative to stack: 1 PC / thread

Master PC

Program Counters (PCs)tid= 0 1 2 3

1 0 0 0

Match→ active

No match→ inactive

Policy: MPC = min(PCi)

Earliest reconvergencewith code laid out in Thread Frontiers order

Stack-based divergence control implies serialization

G. Diamos et al. SIMD re-convergence at thread frontiers. MICRO 44, 2011.

if(tid < 2) {

if(tid == 0) {

x = 2;

else {

x = 3;

Run two branches simultaneously

MPC1 = min(PC

MPC2 = min(PC

i ≠ MPC

Master PC 2

Master PC 1

Program Counters (PCs)tid= 0 1 2 3

1 0 0 0

if(tid < 2) {

if(tid == 0) {

x = 2;

else {

x = 3;

Restoring lockstep execution

Issue: unbalanced paths break lockstep execution

Power consumption, loss of memory locality

Solution: implicit partial synchronization barrier

Instruction 6, 7broken down,issued twice Synchronize

beforeinstruction 6

Greedyscheduling

Earliestreconvergence

1 1 1 1234

Control-flowgraph

8 8 8 89 9 9 9

Enforcing control-flow reconvergence

T0 and T2 (at F)wait for T1 (in D).

T3 (in B) can proceedin parallel.

Wait for any thread of the warp between PCdiv and PCrec

Annotate reconvergence points with pointer to immediate dominator

Enforcing control-flow reconvergence

T0 and T2 (at F)wait for T1 (in D).

T3 (in B) can proceedin parallel.

Wait for any thread of the warp between PCdiv and PCrec

Annotate reconvergence points with pointer to immediate dominator

Implementation: context table

Common case: few different PCs

Order stable in time

Keep Common PCs+activity masks in sorted heap

12 17 3 17 17 3 3 17

0 1 0 1 1 0 0 1

3 0 0 1 0 0 1 1 0

12 1 0 0 0 0 0 0 0PC1

CPC3Per-thread PCs

Sorted context table

Two-level context table

Cache top 2 entries in the Hot Context Table register

Constant-time access to MPCi=CPC

i, activity masks

Other entries in the Cold Context Table linked list

Branch → incremental insertion in CCT

Outline

The GPU divergence problem

Dealing with lane conflicts

Implementation

Results

Simultaneous Warp Interweaving

SBI limitation: often no secondary path

Single-sided ifs, loops…

SWI: opportunistically schedule instructionsfrom other warps in divergence gaps

“SMT for SIMD”

T0 T1 T2 T3

Fetch @17

Execute

Fetch @42

addmul nop

add mul

T4 T5 T6 T7

Warp 0

Warp 1

Using divergence correlations

Issue: unbalanced divergence introduces conflicts

e.g. Parallel reduction

Solution: static lane shuffling

Apply different lane permutation for each warp

Preserves inter-thread memory locality

warp 0 warp 1 warp 2 warp 3

Warp 0 is never compatible with warp 2:conflict in lane 0

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

warp 0 warp 1 warp 2 warp 3

Threads 0 mapped to different physicallanes: no conflict

0 1 2 3 01 23 0 1 01

Detecting a compatible secondary warp

Bitset inclusion test:Content-Associative Memory

Treat zeros as don't care bits

Power-hungry!

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

0 0 0 0 0 0

0 hit0 0 0

Set-associative lookupSplit warps in sets

Restrict lookup to 1 set

More power-efficient1 1

0 1 1 0

1 1 1 1 1m

W0W1W2W3W4W5W6

0 0 0 0 0 0

00 0 0

sameset

Set-associative lookup is good enough

3-way: captures 66% of performance potential

Direct-mapped: 48%

Experimental configuration

Baseline: clustered SIMT architecture (Fermi, Kepler)

Tie both clusters together to form twice bigger warps

Direct both instructions to the same execution units

Baseline: warp size 322 warps / clock, 1 instruction / warp

SBI/SWI: warp size 641 warp / clock, 2 instructions / warp

T0 T1 T2 T3 T4 T5 T6 T7T0 T1 T2 T3 T4 T5 T6 T7Warp pool 1 Warp pool 2Cluster 1 Cluster 2

Common warp pool

Fetch-unit / execute-unit ratio maintained

Performance results

Regular applications Irregular applications

Speedup Regular Irregular

SBI +15% +41%

SWI +25% +33%

SBI+SWI +23% +40%

Perspective: SMT-GPU µarch convergence

Converging trends in SMT and GPU architecture

Closing micro-architectural spacebetween Clustered Multi-Threading and SIMD

Explore new tradeoffs between power efficiency and flexibility?

Merge instructions from concurrent threads Loosen constraints of SIMD execution

MMT[Long10]

MIS[Dechene10]

Fetch-combining[Kumar04] DWF

[Fung07]

DWS[Meng10]

Thread Fusion[González08]

BC[Fung11]

LW[Narasiman11]

SBI / SWIThis work

CAPRI[Rhu12]

Efficiency on regular MT apps Flexibility

TF[Diamos11]

iGPU[Menon12]

Simultaneous Branch and Warp Interweavingfor Sustained GPU Performance

ISCA'39Portland, OR

June 11, 2012

Nicolas BrunieKalray and ENS Lyonnicolas.brunie@kalray.eu

Sylvain CollangeUniv. Federal de

Minas Geraissylvain.collange@dcc.ufmg.br

Gregory DiamosNVIDIA Researchgdiamos@nvidia.com

Backups

References

R. Kumar et al. Conjoined-core chip multiprocessing. MICRO 37, 2004.

J. González et al. Thread fusion. ISLPED 13, 2008.

W. W. L. Fung et al. Dynamic warp formation: efficient MIMD control flow on SIMD graphics hardware. TACO, 2009.

G. Long et al. Minimal multi-threading: finding and removing redundant instructions in multithreaded processors. MICRO 43, 2010.

M. Dechene et al. Multi-threaded instruction sharing. Technical report, 2010.

J. Meng et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ISCA 37, 2010.

G. Diamos et al. SIMD re-convergence at thread frontiers. MICRO 44, 2011.

W. Fung et al. Thread block compaction for efficient SIMT control flow.HPCA 17, 2011.

SWI implies SMT

Heterogeneous execution units

SWI improves utilization with superscalar execution

T0 T1 T2 T3

Fetch @17

Fetch @42

add load

PC=17 PC=42

add load

T4 T5 T6 T7

Warp 0 Warp 1

SBI vs. DWF

Dual fetch

Uses branch-level parallelism

Sensitive to branch unbalance

Preserves in-warp locality

T0 T1 T2 T3

Fetch @2

Execute

Fetch @4

addmul mul

PC=2 PC=2PC=4 PC=4

add mul

Single fetch

Uses warp-level parallelism

Sensitive to lane activity unbalance

T12 T13 T14 T15

Execute

add add

T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

addadd

SWI vs. DWF

Dual fetch

Sensitive to lane conflicts

Preserves in-warp locality

Single fetch

Sensitive to lane activity unbalance, low occupancy

T12 T13 T14 T15

Execute

add add

T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

addadd

Execute

add addmul nop

add mul

T12 T13 T14 T15T8 T9 T10 T11T4 T5 T6 T7T0 T1 T2 T3

Simulation platform

Barra: functional GPU simulator

modeled after NVIDIA Tesla GPUs

Runs native Tesla SASS binaries

Reproduces SIMT execution

Timing-power model

Cycle-accurate execution pipeline

Constant-latency, bandwidth-bound memory

Calibration from GPU microbenchmarks

http://gpgpu.univ-perp.fr/index.php/Barra

Sylvain Collange, Marc Daumas, David Defour, David Parello. Barra: a parallel functional simulator for GPGPU. MASCOTS 2010.

SBI scoreboarding logic

Keep track of dependencies induced by thread divergence-reconvergence

Transitive closure of dependency graph

Goto considered harmful?

jjaljrsyscall

jmpiififfelseendifdowhilebreakconthaltmsavemrestpushpop

Intel GMAGen4(2006)

jmpiifelseendifcasewhilebreakconthaltcallreturnfork

Intel GMASB(2011)

pushpush_elsepoppush_wqmpop_wqmelse_wqmjump_anyreactivatereactivate_wqmloop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMD Cayman(2011)

pushpush_elsepoploop_startloop_start_no_alloop_start_dx10loop_endloop_continueloop_breakjumpelsecallcall_fsreturnreturn_fsalualu_push_beforealu_pop_afteralu_pop2_afteralu_continuealu_breakalu_else_after

AMDR600(2007)

jumploopendlooprependrepbreakloopbreakrepcontinue

AMDR500(2005)

barbrabrkbrkptcalcontkilpbkpretretssytrap.s

NVIDIATesla(2007)

barbptbrabrkbrxcalcontexitjcaljmpjmxlongjmppbkpcntplongjmppretretssy.s

NVIDIAFermi(2010)

Control instructions in some CPU and GPU instruction sets

Control flow structure is explicit

GPU-specific instruction sets

No support for arbitrary control flow

Flynn's taxonomy revisited

Resource count

InstructionFetch

Resource type: Memory port(Address)

Computation /registers(Data)

A. Glew. Coherent vector lane threading. Berkeley ParLab Seminar, 2009.

Examples: conventional design points

F MXMulti-core

MIMD(MAMT) F MX

SI(MDSA)MT

Short-vector SIMD

SIMD(SAST)

MI MD MA MT

simultaneous branch and warp interweaving for … · simultaneous branch and warp interweaving for...

Documents

stageweb : interweaving pipeline stages into a wearout and...

what interweaving two worlds

interweaving chiral spirals

simultaneous model2

interweaving assessments into immersive authentic...

digital transformation and interweaving at brighttalk 2017...

the new interweaving & architectural discourse

translation india - simultaneous equipments, new...

trasocial marketing: interweaving social media into your...

david bird. interweaving culture into responsible gaming

yannis dimitriadis: interweaving learning and assessment...

hybrid basketry: interweaving digital practice within...

interweaving buddhist art traditions from india across asia

interweaving photo strips with...

sanjay r singhal, ra - wordpress.commar 13, 2016 · 33 as...

not an afterthought: interweaving measurement into user...

simultaneous electrochemical determination of...

simultaneous equations

vo-tech market for molenbeek, brussels: interweaving the...

simultaneous storytime