cere: llvm based codelet extractor and replayer for

CERE LLVM based Codelet Extractor andREplayer for Piecewise Benchmarking and

Optimization

Chadi Akel P de Oliveira Castro M Popov E Petit W Jalby

University of Versailles ndash Exascale Computing Research

EuroLLVM 2016

C E R E

Motivation

I Finding best application parameters is a costly iterativeprocess

C E R E

I Codelet Extractor and REplayerI Break an application into standalone codeletsI Make costly analysis affordable

I Focus on single regions instead of whole applicationsI Run a single representative by clustering similar codelets

1 16

Codelet Extraction

I Extract codelets as standalone microbenchmarks

C or Fortran application

for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]


codelet wrapper

capture machine state state dump

codelets can be recompiledand run independently

Reg

ion

+IR

2 16

CERE Workflow

ApplicationsLLVM IR Region

outlining

RegionCapture

Fast performance

prediction

Retarget for different architecture different optimizations

Change number of threads affinityruntime parameters

Warmup+

Replay

Working set and cache

capture

Generatecodelets wrapper

Working setsmemory dump

CodeletReplay

Invocationamp

Codeletsubsetting

CERE can extract codelets from

I Hot Loops

I OpenMP non-nested parallel regions [Popov et al 2015]

3 16

Outline

Extracting and Replaying CodeletsFaithfulRetargetable

ApplicationsArchitecture selectionCompiler flags tuningScalability prediction

DemoCapture and replay in NAS BTSimple flag replay for NAS FT

Conclusion

4 16

Capturing codelets at Intermediate Representation

I Faithful behaves similarly to the original region

I Retargetable modify runtime and compilation parameters

same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages

I LLVM Intermediate Representation is a good tradeoff

5 16

Faithful capture

I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets

I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required

I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)

6 16

Faithful capture memory

Capture access at page granularity coarse but fast

regionto capture

protect static and currently allocatedprocess memory (procselfmaps)

intercept memory allocation functionswith LD_PRELOAD

1 allocate memory

2 protect memory and returnto user program

segmentationfault handler

1 dump accessed memory to disk

2 unlock accessed page and return to user program

a[i]++

memoryaccess

a = malloc(256)

memoryallocation

I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page

7 16

Outline




Conclusion

8 16

Selecting Representative Codelets

I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures

I Detect redundancies and keep only one representative

0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay

Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives

9 16

Performance Signature Clustering

Maqao

Likwid

Static amp DynamicProfiling Vectorization ratio

FLOPSsCache Misses[ ] Clustering

f1f2f3

Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors

BT

SP

Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster

Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results

Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results

[Oliveira Castro et al 2014]

10 16

Codelet Based Architecture Selection

012 015

083 083

155 159

00

05

10

15

Atom Core 2 Sandy Bridge

Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em

Figure Benchmarking NAS serial on three architectures

I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks

11 16

Autotuning LLVM middle-end optimizations

I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning

Id of LLVM middleminusend optimization passes combination

Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3

Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes

CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16

Fast Scalability Benchmarking with OpenMP Codelets

1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es

1e8 SP compute rhs on Sandy Bridge

RealPredicted

Core2 Nehalem Sandy Bridge Ivy Bridge

Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237

Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]

13 16

Outline




Conclusion

14 16

Conclusion

I CERE breaks an application into faithful and retargetablecodelets

I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis

I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM

I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports

15 16

Thanks for your attention

C E R E

httpsbenchmark-subsettinggithubiocere

distributed under the LGPLv3

16 16

Bibliography I

Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE

Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55

Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322

Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132

Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE

17 16

Retargetable replay register state

I Issue Register state is non-portable between architectures

I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code

I Register agnostic capture

I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge

I Preliminar portability tests between x86 and ARM 32 bits

18 16

Retargetable replay outlining regions

Step 1 Outline the region to capture using CodeExtractor pass

original

0 = load i32 i align 4

1 = load i32 saddr align 4

cmp = icmp slt i32 0 1

br i1 cmp loop branch here

label forbody

label forexitStub

define internal void outlined(

i32 i i32 saddr

i32 aaddr)

call void start_capture(i32 i

i32 saddr i32 aaddr)


ret void

original

call void outlined(i32 i


Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original

19 16

Comparison to other Code Isolating tools

CERECodeIsolator

Astex CodeletFinder

SimPoint

SupportLanguage C(++)

Fortran Fortran C Fortran C(++)

Fortranassembly

Extraction IR source source source assemblyIndirections yes no no yes yes

ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no

ReductionCapture size reduced reduced reduced full -

Instances yes manual manual manual yesCode sign yes no no no yes

20 16

Accuracy Summary NAS SER

Core2 Haswell

0

25

50

75

100

NASA

lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep

o

f Exe

c T

ime

accurate replay codelet coverage

Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

21 16

Accuracy Summary SPEC FP 2006

Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime

Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15

I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO

I low matching (soplex calculix gamess) warmup ldquobugsrdquo

22 16

CERE page capture dump size

94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200

bt cg ep ft is lu mg sp

Size

(MB)

full dump (Codelet Finder) page granularity dump (CERE)

Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump

23 16

CERE page capture overhead

CERE page capture is coarser but faster

CERE ATOM 325 PIN 171 Dyninst 40

cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366

Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]

24 16

CERE cache warmup

for (i=0 i lt size i++) a[i] += b[i]

array a[] pages array b[] pages

21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory

FIFO (most recently unprotected)

warmup page trace

25 16

Test architectures

Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell

CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16

32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+

26 16

Clustering NR Codelets

cut for K = 14

Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3

lop_13toeplz_2four1_2tridag_2tridag_1

ludcmp_4hqr_15

relax2_26svdcmp_14svdcmp_13

hqr_13hqr_12_sqjacobi_5hqr_12

svdcmp_11elmhes_11mprove_9

matadd_16svdcmp_6elmhes_10balanc_3

Computation PatternDP 2 simultaneous reductions

DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product

DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions

SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order

MP First step FFTDP First order recurrenceDP First order recurrence

SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix

DP Red Black Sweeps Laplacian operatorDP Vector divide element wise

DP Norm + Vector divideDP Sum of the absolute values of a matrix column

SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector

DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row

DP Linear combination of matrix columnsDP Vector multiply element wise

27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern

DP Vector divide element wiseDP Norm + Vector divide

DP Sum of the absolute values of a matrix columnSP Sum of a square matrix

SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix

ReductionSums

Long LatencyOperations

Similar computation patterns

27 16

Capturing Architecture Change

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275

Cluster A triple-nestedhigh latency operations(div and exp)

Cluster B stencil on fiveplanes (memory bound)

Core 2rarr 293 GHzrarr 3 MB

Atomrarr 166 GHzrarr 1 MB

Nehalem (Ref)Freq 186 GHzLLC 12 MB

50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16

Same Cluster = Same Speedup

LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection

I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge

I Validated on NAS + Core 2I The feature set is still among the best on NAS

Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040

0 5 10 15 20 25Number of clusters

Med

ian

e

rror

Worst

Median

Best

GA features

30 16

Reduction Factor Breakdown

Reduction Total Reduced invocations Clustering

Atom 443 times12 times37Core 2 247 times87 times28

Sandy Bridge 225 times63 times36

Table Benchmarking reduction factor breakdown with 18representatives

31 16

Profiling Features

SP

codelets

Likwid

Maqao

Performance counters per codelet

Static dissassembly and analysis

4 dynamic features

8 static features

Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)

FLOPSL2 BandwidthL3 Miss RateMem Bandwidth

32 16

Clang OpenMP front-end

void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)

C code

Clang OpenMPfront end

define i32 main() entrycall __kmpc_fork_call omp_microtask()

define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)

LLVM simplified IR Thread execution model

33 16

Extracting and Replaying Codelets

Faithful

Retargetable

Applications

Architecture selection

Compiler flags tuning

Scalability prediction

Demo

Capture and replay in NAS BT

Simple flag replay for NAS FT

Conclusion

Appendix

Motivation

I Finding best application parameters is a costly iterativeprocess

C E R E

I Codelet Extractor and REplayerI Break an application into standalone codeletsI Make costly analysis affordable

I Focus on single regions instead of whole applicationsI Run a single representative by clustering similar codelets

1 16

Codelet Extraction





codelet wrapper



Reg

ion

+IR

2 16

CERE Workflow


outlining

RegionCapture

Fast performance

prediction



Warmup+

Replay


capture



CodeletReplay

Invocationamp

Codeletsubsetting


I Hot Loops


3 16

Outline




Conclusion

4 16






5 16

Faithful capture




6 16



regionto capture



1 allocate memory





a[i]++

memoryaccess

a = malloc(256)

memoryallocation


7 16

Outline




Conclusion

8 16




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Codelet Extraction





codelet wrapper



Reg

ion

+IR

2 16

CERE Workflow


outlining

RegionCapture

Fast performance

prediction



Warmup+

Replay


capture



CodeletReplay

Invocationamp

Codeletsubsetting


I Hot Loops


3 16

Outline




Conclusion

4 16






5 16

Faithful capture




6 16



regionto capture



1 allocate memory





a[i]++

memoryaccess

a = malloc(256)

memoryallocation


7 16

Outline




Conclusion

8 16




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

CERE Workflow


outlining

RegionCapture

Fast performance

prediction



Warmup+

Replay


capture



CodeletReplay

Invocationamp

Codeletsubsetting


I Hot Loops


3 16

Outline




Conclusion

4 16






5 16

Faithful capture




6 16



regionto capture



1 allocate memory





a[i]++

memoryaccess

a = malloc(256)

memoryallocation


7 16

Outline




Conclusion

8 16




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Outline




Conclusion

4 16






5 16

Faithful capture




6 16



regionto capture



1 allocate memory





a[i]++

memoryaccess

a = malloc(256)

memoryallocation


7 16

Outline




Conclusion

8 16




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix






5 16

Faithful capture




6 16



regionto capture



1 allocate memory





a[i]++

memoryaccess

a = malloc(256)

memoryallocation


7 16

Outline




Conclusion

8 16




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Faithful capture




6 16



regionto capture



1 allocate memory





a[i]++

memoryaccess

a = malloc(256)

memoryallocation


7 16

Outline




Conclusion

8 16




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix



regionto capture



1 allocate memory





a[i]++

memoryaccess

a = malloc(256)

memoryallocation


7 16

Outline




Conclusion

8 16




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Outline




Conclusion

8 16




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix




0e+00

2e+07

4e+07

6e+07

8e+07

0 1000 2000 3000invocation

Cyc

les

replay


9 16


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


Maqao

Likwid



f1f2f3


BT

SP



Model

Sandy Bridge

Atom

Core2

FullBenchmarks

Results


10 16


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


012 015

083 083

155 159

00

05

10

15


Geo

met

ricm

ean

spee

dup

Real Speedup

Predicted Speedup

Refe

renc

e N

ehal

em



11 16




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix




Cyc

les

(Ivy

brid

ge 3

4G

Hz)

60e+07

80e+07

10e+08

12e+08

0 200 400 600

originalreplayO3




1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


1 2 4 8 16 32Threads

0

1

2

3

4

5

6

Runti

me c

ycl

es


RealPredicted




13 16

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Outline




Conclusion

14 16

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Conclusion





15 16


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


C E R E



16 16

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Bibliography I






17 16







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix







18 16



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix



original





label forbody

label forexitStub


i32 i i32 saddr

i32 aaddr)




ret void

original




19 16


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


CERECodeIsolator

Astex CodeletFinder

SimPoint



Fortranassembly





20 16


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


Core2 Haswell

0

25

50

75

100

NASA


o

f Exe

c T

ime



21 16


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


Haswell

0

25

50

75

100SPEC

FP06

games

s

sphin

x3de

alII

wrf

povra

y

calcu

lixso

plex

leslie3

dton

to

gromac

s

zeus

mplbm milc

cactu

sADM

namd

bwave

s

gemsfd

td

spec

rand

o

f Exe

c T

ime




22 16


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


94

27

106

2651

1

367

117131

46

89

8

472

101 96

22

0

50

100

150

200


Size

(MB)



23 16






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix






24 16

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

CERE cache warmup



21 22 23 50 51 52

pages addresses

21 5051 20

46 17 47 18 48 19 49

22

Reprotect 20

memory


warmup page trace

25 16

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Test architectures




26 16


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


cut for K = 14



ludcmp_4hqr_15

















27 16


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


cut for K = 14



ludcmp_4hqr_15





Computation Pattern




ReductionSums



27 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

28 16


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix


LUerhsf 49

FTappftf 45

BTrhsf 266

SPrhsf 275






50

100

200

400

slower

50

100

200

400

8001500

3000

slower

40 50 60

slower

50

100

200

400faster

Reference Target

ms

ms

29 16

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Feature Selection



Atom

Core 2

Sandy Bridge

1020304050

10203040

10203040


Med

ian

e

rror

Worst

Median

Best

GA features

30 16






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix






31 16

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

Profiling Features

SP

codelets

Likwid

Maqao



4 dynamic features

8 static features



32 16



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix



C code





33 16


Faithful

Retargetable

Applications




Demo



Conclusion

Appendix

cere: llvm based codelet extractor and replayer for

Documents