idempotent code generation: implementation, analysis, and evaluation

Idempotent Code Generation: Implementation, Analysis, and Evaluation

Marc de Kruijf ( )Karthikeyan Sankaralingam

CGO 2013, Shenzhen

Example

int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}

source code

Example

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

F F F F0

faults

exceptions

load ?

mis-speculations

Example

assembly code

BAD STUFF HAPPENS!

R0 and R1 are unmodified

Example

assembly code

just re-execute!

convention:use checkpoints/buffers

It’s Idempotent!

idempoh… what…?

int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x;}

Idempotent Region Constructionpreviously… in PLDI ’12

idempotent regionsALL THE TIME

before:after:

Idempotent Code Generationnow… in CGO ’13

int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}

how do we get from here...

to here...

not here (this is not idempotent) ...

and not here (this is slow) ...

R3 = R1 R2 = load [R3] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

here...

F F F F0

faults

exceptions

load ?

mis-speculations

Hampton & Asanović, ICS ’06De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12

Kim et al., TOPLAS ’06Zhang et al., ASPLOS ‘13

De Kruijf et al., ISCA ’10Feng et al., MICRO ’11De Kruijf et al., PLDI ’12

Idempotent Code Generationapplications to prior work

Idempotent Code Generationexecutive summary

(1) how do we generate efficient idempotent code?

(2) how do external factors affect overhead?(a) idempotent region size(b) instruction set (ISA) characteristics(c) control flow side-effects

each can affect overheads by 10% or more

algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler

not covered in this talk

Presentation Overview

❶ Introduction

❷ Analysis

❸ Evaluation

(a) idempotent region size(b) ISA characteristics(c) control flow side-effects

Analysis

(a) idempotent region size

region size

- number of inputs increasing- likelihood of spills growing

- maximum spill cost reached- amortized over more instructions

Analysis

(b) ISA characteristics

(1) two-address (e.g. x86) vs. three-address (e.g. ARM)ADD R1, R2 -> R1

Idempotent? NO!

(2) register-memory (e.g. x86) vs. register-register (e.g. ARM)

(3) number of available registers

ADD R1, R2 = R3 idempotent? YES!

for register-memory, register spills may be less costly (microarchitecture dependent)

impact is obvious, but… more registers is not always enough (see back-up slide)

Analysis(c) control flow side-effects

x = ...

... = f(x)y = ...

region boundaries

x’s “shadow interval” given no side-effects

x’s live interval

Analysis(c) control flow side-effects

x = ...

... = f(x)y = ...

region boundaries

x’s “shadow interval” given side-effects

x’s live interval

❶ Introduction

❷ Analysis

❸ Evaluation(a) idempotent region size(b) ISA characteristics(c) control flow side-effects

Evaluation

methodology

measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length)

benchmarks – SPEC 2006, PARSEC, and Parboil suites

Evaluation

region size

YOU ARE HERE(baseline: typically 10-30 instructions)

(a) idempotent region size

10+ instructions

13.1% (geometric mean)

region size

detectionlatency

Evaluation(a) idempotent region size

region size

detectionlatency

re-executiontime

Evaluation(a) idempotent region size

0.06%13.1%

Evaluation

SPEC INT SPEC FP PARSEC Parboil OVERALL0

x86-64 ARMv7

Three-address support matters more for FP benchmarksRegister-memory matters more for integer benchmarks

(b) ISA characteristics

Evaluation

SPEC INT SPEC FP PARSEC Parboil namd libquantum OVERALL05

10152025303540

no side-effects side-effects

(c) control flow side-effects

substantial only in two cases; insubstantial otherwiseintuition: typically compiler already spills for control flow divergence

❶ Introduction

❷ Analysis

❸ Evaluation

Conclusions

(a) region size – matters a lot; large regions are ideal if recovery is infrequent

overheads approach zero as regions grow

overheads drop below 10% only with careful co-design

(b) instruction set – matters when region sizes must be small

supporting control flow side-effects is not expensive

(c) control flow side-effects – generally does not matter

Conclusions

code generation and static analysis algorithmshttp://research.cs.wisc.edu/vertical/iCompiler

applicability not limited to architecture designsee Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]”

thank you!

Back-up Slides

ISA Characteristics

more registers isn’t always enough

x = 0;if (y > 0) x = 1;z = x + y;

C code R0 = 0

if (R1 > 0)

R0 = 1

R2 = R0 + R1

ISA Characteristics

more registers isn’t always enough

R0 = 0

if (R1 > 0)

R3 = R0

x = 0;if (y > 0) x = 1;z = x + y;

C code

R3 = 1

R2 = R3 + R1need an extra instruction

no matter what

14-GPR 12-GPR 10-GPR baseline02468

101214

data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only)

ISA Characteristicsidempotence vs. fewer registers

no idempotence, #GPR reduced from 16

Very Large Regions

how do we get there?

Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops

Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help

Problem #3: large array structures – awareness of array access patterns can help (next slides)

Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

Very Large Regions

Re: Problem #2 (cut in loops are bad)

i0 = φ(0, i1)

i1 = i0 + 1if (i1 < X)

for (i = 0; i < X; i++) { ...}

C code CFG + SSA

Very Large Regions

R0 = 0

R0 = R0 + 1if (R0 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

NO BOUNDARIES = NO PROBLEM

Very Large Regions

R0 = 0

R0 = R0 + 1if (R0 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

Very Large Regions

R1 = 0

R0 = R1

R1 = R0 + 1if (R1 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

– “redundant” copy– extra boundary (pressure)

Very Large Regions

Re: Problem #3 (array access patterns)

[x] = a;b = [x];[x] = c;

[x] = a;b = a;[x] = c;

non-clobber antidependences… GONE!

PLDI ‘12 algorithm makes this simplifying assumption:

cheap for scalars, expensive for arrays

Very Large Regions

Re: Problem #3 (array access patterns)

not really practical for large arraysbut if we don’t do it, non-clobber antidependences remain

solution: handle potential non-clobbers in a post-pass(same way we deal with loop clobbers in static analysis)

// initialize:int[100] array;memset(&array, 100*4, 0);// accumulate:for (...) array[i] += foo(i);

idempotent code generation: implementation, analysis, and evaluation

bnez r2

sub r2

load r0 r2 r3

unmodified r2

r2 r1idempotent

load r3 r3

load r0 r2 r1

r3 idempotent

Documents

idempotent mathematics and interval analysisidempotent...

idempotent generators in finite partition...

sos rule formats for idempotent terms and idempotent unary

research article linear maps on upper triangular matrices...

sos rule formats for idempotent terms and idempotent unary...

chapter 6 generalized inverses of -idempotent...

equivariant multiplications and idempotent splittings of g

free idempotent generated semigroups over the full linear...

next generation science standards systems implementation ......

endomorphisms of the poset of idempotent...

free idempotent generated semigroups...

1bdjgjd +pvsobm pg .buifnbujdt - msp · pacific journal of...

analysis and implementation of second generation criteria

next generation science standards district implementation...

next generation firewall implementation - hitachi...

juniper networks evpn implementation for next-generation

singular matrices as products of idempotent matrices

am i idempotent?

chapter 3 · allows one to prove limit theorems for...

filter models: non-idempotent intersection types ... models:...