idempotent code generation: implementation, analysis, and evaluation

Post on 22-Feb-2016

59 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Idempotent Code Generation: Implementation, Analysis, and Evaluation. Marc de Kruijf ( ) Karthikeyan Sankaralingam. CGO 2013, Shenzhen. Example. source code. int sum( int *array, int len ) { int x = 0; for ( int i = 0; i < len ; ++ i ) x += array[ i ]; - PowerPoint PPT Presentation

TRANSCRIPT

Idempotent Code Generation: Implementation, Analysis, and Evaluation

Marc de Kruijf ( )Karthikeyan Sankaralingam

CGO 2013, Shenzhen

Example

2

int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}

source code

Example

3

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

F F F F0

faults

exceptions

x

load ?

mis-speculations

Example

4

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

assembly code

BAD STUFF HAPPENS!

R0 and R1 are unmodified

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

Example

5

assembly code

just re-execute!

convention:use checkpoints/buffers

It’s Idempotent!

6

idempoh… what…?

int sum(int *data, int len) { int x = 0; for (int i = 0; i < len; ++i) x += data[i]; return x;}

=

6

Idempotent Region Constructionpreviously… in PLDI ’12

idempotent regionsALL THE TIME

before:after:

7

Idempotent Code Generationnow… in CGO ’13

int sum(int *array, int len) { int x = 0; for (int i = 0; i < len; ++i) x += array[i]; return x;}

how do we get from here...

8

Idempotent Code Generationnow… in CGO ’13

to here...

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

9

Idempotent Code Generationnow… in CGO ’13

not here (this is not idempotent) ...

R2 = load [R1] R1 = 0LOOP: R4 = load [R0 + R2] R1 = add R1, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

10

Idempotent Code Generationnow… in CGO ’13

and not here (this is slow) ...

R3 = R1 R2 = load [R3] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

11

Idempotent Code Generationnow… in CGO ’13

here...

R2 = load [R1] R3 = 0LOOP: R4 = load [R0 + R2] R3 = add R3, R4 R2 = sub R2, 1 bnez R2, LOOPEXIT: return R3

13

F F F F0

faults

exceptions

x

load ?

mis-speculations

Hampton & Asanović, ICS ’06De Kruijf & Sankaralingam, MICRO ’11 Menon et al., ISCA ’12

Kim et al., TOPLAS ’06Zhang et al., ASPLOS ‘13

De Kruijf et al., ISCA ’10Feng et al., MICRO ’11De Kruijf et al., PLDI ’12

Idempotent Code Generationapplications to prior work

13

Idempotent Code Generationexecutive summary

(1) how do we generate efficient idempotent code?

(2) how do external factors affect overhead?(a) idempotent region size(b) instruction set (ISA) characteristics(c) control flow side-effects

each can affect overheads by 10% or more

algorithms made available in source code form: http://research.cs.wisc.edu/vertical/iCompiler

not covered in this talk

14

Presentation Overview

❶ Introduction

❷ Analysis

❸ Evaluation

(a) idempotent region size(b) ISA characteristics(c) control flow side-effects

Analysis

16

(a) idempotent region size

region size

over

head

- number of inputs increasing- likelihood of spills growing

- maximum spill cost reached- amortized over more instructions

Analysis

17

(b) ISA characteristics

(1) two-address (e.g. x86) vs. three-address (e.g. ARM)ADD R1, R2 -> R1

Idempotent? NO!

(2) register-memory (e.g. x86) vs. register-register (e.g. ARM)

(3) number of available registers

ADD R1, R2 = R3 idempotent? YES!

for register-memory, register spills may be less costly (microarchitecture dependent)

impact is obvious, but… more registers is not always enough (see back-up slide)

Analysis(c) control flow side-effects

x = ...

... = f(x)y = ...

18

region boundaries

x’s “shadow interval” given no side-effects

x’s live interval

Analysis(c) control flow side-effects

x = ...

... = f(x)y = ...

19

region boundaries

x’s “shadow interval” given side-effects

x’s live interval

19

Presentation Overview

❶ Introduction

❷ Analysis

❸ Evaluation(a) idempotent region size(b) ISA characteristics(c) control flow side-effects

Evaluation

21

methodology

measurements – performance overhead: dynamic instruction count – for x86, using PIN – for ARM, using gem5 – region size: instructions between boundaries (path length)

benchmarks – SPEC 2006, PARSEC, and Parboil suites

0%

10%

20%

30%

40%

50%

Evaluation

region size

over

head

YOU ARE HERE(baseline: typically 10-30 instructions)

?

(a) idempotent region size

22

10+ instructions

13.1% (geometric mean)

0%

10%

20%

30%

40%

50%

region size

over

head

23

detectionlatency

??

Evaluation(a) idempotent region size

13.1%

0%

10%

20%

30%

40%

50%

region size

over

head

24

detectionlatency

re-executiontime

?

Evaluation(a) idempotent region size

0.06%13.1%

11.1%

24

Evaluation

SPEC INT SPEC FP PARSEC Parboil OVERALL0

5

10

15

20

perc

enta

ge o

verh

ead

x86-64 ARMv7

Three-address support matters more for FP benchmarksRegister-memory matters more for integer benchmarks

(b) ISA characteristics

25

Evaluation

SPEC INT SPEC FP PARSEC Parboil namd libquantum OVERALL05

10152025303540

perc

enta

ge o

verh

ead

no side-effects side-effects

(c) control flow side-effects

substantial only in two cases; insubstantial otherwiseintuition: typically compiler already spills for control flow divergence

26

Presentation Overview

❶ Introduction

❷ Analysis

❸ Evaluation

27

Conclusions

(a) region size – matters a lot; large regions are ideal if recovery is infrequent

overheads approach zero as regions grow

overheads drop below 10% only with careful co-design

(b) instruction set – matters when region sizes must be small

supporting control flow side-effects is not expensive

(c) control flow side-effects – generally does not matter

28

Conclusions

code generation and static analysis algorithmshttp://research.cs.wisc.edu/vertical/iCompiler

applicability not limited to architecture designsee Zhang et al., ASPLOS ‘13: “ConAir: Featherweight Concurrency Bug Recovery [...]”

thank you!

29

Back-up Slides

ISA Characteristics

31

more registers isn’t always enough

x = 0;if (y > 0) x = 1;z = x + y;

C code R0 = 0

if (R1 > 0)

R0 = 1

R2 = R0 + R1

ISA Characteristics

32

more registers isn’t always enough

R0 = 0

if (R1 > 0)

R3 = R0

x = 0;if (y > 0) x = 1;z = x + y;

C code

R3 = 1

R2 = R3 + R1need an extra instruction

no matter what

32

14-GPR 12-GPR 10-GPR baseline02468

101214

data from SPEC INT only (SPEC INT uses General Purpose Registers (GPRs) only)

perc

enta

ge o

verh

ead

ISA Characteristicsidempotence vs. fewer registers

no idempotence, #GPR reduced from 16

Very Large Regions

34

how do we get there?

Problem #1: aliasing analysis – no flow-sensitive analysis in LLVM; hurts loops

Problem #2: loop optimizations – boundaries in loops are bad for everyone (next slides) – loop blocking, fission/fusion, inter-change, peeling, unrolling, scalarization, etc. can all help

Problem #3: large array structures – awareness of array access patterns can help (next slides)

Problem #4: intra-procedural scope – limited scope aggravates all effects listed above

Very Large Regions

35

Re: Problem #2 (cut in loops are bad)

i0 = φ(0, i1)

i1 = i0 + 1if (i1 < X)

for (i = 0; i < X; i++) { ...}

C code CFG + SSA

Very Large Regions

36

Re: Problem #2 (cut in loops are bad)

R0 = 0

R0 = R0 + 1if (R0 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

NO BOUNDARIES = NO PROBLEM

Very Large Regions

37

Re: Problem #2 (cut in loops are bad)

R0 = 0

R0 = R0 + 1if (R0 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

Very Large Regions

38

Re: Problem #2 (cut in loops are bad)

R1 = 0

R0 = R1

R1 = R0 + 1if (R1 < X)

for (i = 0; i < X; i++) { ...}

C code machine code

– “redundant” copy– extra boundary (pressure)

Very Large Regions

39

Re: Problem #3 (array access patterns)

[x] = a;b = [x];[x] = c;

[x] = a;b = a;[x] = c;

non-clobber antidependences… GONE!

PLDI ‘12 algorithm makes this simplifying assumption:

cheap for scalars, expensive for arrays

Very Large Regions

40

Re: Problem #3 (array access patterns)

not really practical for large arraysbut if we don’t do it, non-clobber antidependences remain

solution: handle potential non-clobbers in a post-pass(same way we deal with loop clobbers in static analysis)

// initialize:int[100] array;memset(&array, 100*4, 0);// accumulate:for (...) array[i] += foo(i);

top related