cere: llvm based codelet extractor and replayer for
TRANSCRIPT
CERE LLVM based Codelet Extractor andREplayer for Piecewise Benchmarking and
Optimization
Chadi Akel P de Oliveira Castro M Popov E Petit W Jalby
University of Versailles ndash Exascale Computing Research
EuroLLVM 2016
C E R E
Motivation
I Finding best application parameters is a costly iterativeprocess
C E R E
I Codelet Extractor and REplayerI Break an application into standalone codeletsI Make costly analysis affordable
I Focus on single regions instead of whole applicationsI Run a single representative by clustering similar codelets
1 16
Codelet Extraction
I Extract codelets as standalone microbenchmarks
C or Fortran application
for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]
for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]
codelet wrapper
capture machine state state dump
codelets can be recompiledand run independently
Reg
ion
+IR
2 16
CERE Workflow
ApplicationsLLVM IR Region
outlining
RegionCapture
Fast performance
prediction
Retarget for different architecture different optimizations
Change number of threads affinityruntime parameters
Warmup+
Replay
Working set and cache
capture
Generatecodelets wrapper
Working setsmemory dump
CodeletReplay
Invocationamp
Codeletsubsetting
CERE can extract codelets from
I Hot Loops
I OpenMP non-nested parallel regions [Popov et al 2015]
3 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
4 16
Capturing codelets at Intermediate Representation
I Faithful behaves similarly to the original region
I Retargetable modify runtime and compilation parameters
same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages
I LLVM Intermediate Representation is a good tradeoff
5 16
Faithful capture
I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets
I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required
I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)
6 16
Faithful capture memory
Capture access at page granularity coarse but fast
regionto capture
protect static and currently allocatedprocess memory (procselfmaps)
intercept memory allocation functionswith LD_PRELOAD
1 allocate memory
2 protect memory and returnto user program
segmentationfault handler
1 dump accessed memory to disk
2 unlock accessed page and return to user program
a[i]++
memoryaccess
a = malloc(256)
memoryallocation
I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page
7 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Motivation
I Finding best application parameters is a costly iterativeprocess
C E R E
I Codelet Extractor and REplayerI Break an application into standalone codeletsI Make costly analysis affordable
I Focus on single regions instead of whole applicationsI Run a single representative by clustering similar codelets
1 16
Codelet Extraction
I Extract codelets as standalone microbenchmarks
C or Fortran application
for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]
for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]
codelet wrapper
capture machine state state dump
codelets can be recompiledand run independently
Reg
ion
+IR
2 16
CERE Workflow
ApplicationsLLVM IR Region
outlining
RegionCapture
Fast performance
prediction
Retarget for different architecture different optimizations
Change number of threads affinityruntime parameters
Warmup+
Replay
Working set and cache
capture
Generatecodelets wrapper
Working setsmemory dump
CodeletReplay
Invocationamp
Codeletsubsetting
CERE can extract codelets from
I Hot Loops
I OpenMP non-nested parallel regions [Popov et al 2015]
3 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
4 16
Capturing codelets at Intermediate Representation
I Faithful behaves similarly to the original region
I Retargetable modify runtime and compilation parameters
same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages
I LLVM Intermediate Representation is a good tradeoff
5 16
Faithful capture
I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets
I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required
I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)
6 16
Faithful capture memory
Capture access at page granularity coarse but fast
regionto capture
protect static and currently allocatedprocess memory (procselfmaps)
intercept memory allocation functionswith LD_PRELOAD
1 allocate memory
2 protect memory and returnto user program
segmentationfault handler
1 dump accessed memory to disk
2 unlock accessed page and return to user program
a[i]++
memoryaccess
a = malloc(256)
memoryallocation
I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page
7 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Codelet Extraction
I Extract codelets as standalone microbenchmarks
C or Fortran application
for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]
for (i = 0 i lt N i ++) for (j = 0 j lt N j ++) a[i][j] += b[i]c[j]
codelet wrapper
capture machine state state dump
codelets can be recompiledand run independently
Reg
ion
+IR
2 16
CERE Workflow
ApplicationsLLVM IR Region
outlining
RegionCapture
Fast performance
prediction
Retarget for different architecture different optimizations
Change number of threads affinityruntime parameters
Warmup+
Replay
Working set and cache
capture
Generatecodelets wrapper
Working setsmemory dump
CodeletReplay
Invocationamp
Codeletsubsetting
CERE can extract codelets from
I Hot Loops
I OpenMP non-nested parallel regions [Popov et al 2015]
3 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
4 16
Capturing codelets at Intermediate Representation
I Faithful behaves similarly to the original region
I Retargetable modify runtime and compilation parameters
same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages
I LLVM Intermediate Representation is a good tradeoff
5 16
Faithful capture
I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets
I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required
I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)
6 16
Faithful capture memory
Capture access at page granularity coarse but fast
regionto capture
protect static and currently allocatedprocess memory (procselfmaps)
intercept memory allocation functionswith LD_PRELOAD
1 allocate memory
2 protect memory and returnto user program
segmentationfault handler
1 dump accessed memory to disk
2 unlock accessed page and return to user program
a[i]++
memoryaccess
a = malloc(256)
memoryallocation
I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page
7 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
CERE Workflow
ApplicationsLLVM IR Region
outlining
RegionCapture
Fast performance
prediction
Retarget for different architecture different optimizations
Change number of threads affinityruntime parameters
Warmup+
Replay
Working set and cache
capture
Generatecodelets wrapper
Working setsmemory dump
CodeletReplay
Invocationamp
Codeletsubsetting
CERE can extract codelets from
I Hot Loops
I OpenMP non-nested parallel regions [Popov et al 2015]
3 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
4 16
Capturing codelets at Intermediate Representation
I Faithful behaves similarly to the original region
I Retargetable modify runtime and compilation parameters
same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages
I LLVM Intermediate Representation is a good tradeoff
5 16
Faithful capture
I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets
I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required
I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)
6 16
Faithful capture memory
Capture access at page granularity coarse but fast
regionto capture
protect static and currently allocatedprocess memory (procselfmaps)
intercept memory allocation functionswith LD_PRELOAD
1 allocate memory
2 protect memory and returnto user program
segmentationfault handler
1 dump accessed memory to disk
2 unlock accessed page and return to user program
a[i]++
memoryaccess
a = malloc(256)
memoryallocation
I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page
7 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
4 16
Capturing codelets at Intermediate Representation
I Faithful behaves similarly to the original region
I Retargetable modify runtime and compilation parameters
same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages
I LLVM Intermediate Representation is a good tradeoff
5 16
Faithful capture
I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets
I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required
I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)
6 16
Faithful capture memory
Capture access at page granularity coarse but fast
regionto capture
protect static and currently allocatedprocess memory (procselfmaps)
intercept memory allocation functionswith LD_PRELOAD
1 allocate memory
2 protect memory and returnto user program
segmentationfault handler
1 dump accessed memory to disk
2 unlock accessed page and return to user program
a[i]++
memoryaccess
a = malloc(256)
memoryallocation
I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page
7 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Capturing codelets at Intermediate Representation
I Faithful behaves similarly to the original region
I Retargetable modify runtime and compilation parameters
same assembly 6= assembly [Akel et al 2013]hard to retarget (compiler ISA) easy to retargetcostly support various ISA costly support various languages
I LLVM Intermediate Representation is a good tradeoff
5 16
Faithful capture
I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets
I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required
I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)
6 16
Faithful capture memory
Capture access at page granularity coarse but fast
regionto capture
protect static and currently allocatedprocess memory (procselfmaps)
intercept memory allocation functionswith LD_PRELOAD
1 allocate memory
2 protect memory and returnto user program
segmentationfault handler
1 dump accessed memory to disk
2 unlock accessed page and return to user program
a[i]++
memoryaccess
a = malloc(256)
memoryallocation
I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page
7 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Faithful capture
I Required for semantically accurate replayI Register stateI Memory stateI OS state locks file descriptors sockets
I No support for OS state except for locks CERE captures fullyfrom userland no kernel modules required
I Required for performance accurate replayI Preserve code generationI Cache stateI NUMA ownershipI Other warmup state (eg branch predictor)
6 16
Faithful capture memory
Capture access at page granularity coarse but fast
regionto capture
protect static and currently allocatedprocess memory (procselfmaps)
intercept memory allocation functionswith LD_PRELOAD
1 allocate memory
2 protect memory and returnto user program
segmentationfault handler
1 dump accessed memory to disk
2 unlock accessed page and return to user program
a[i]++
memoryaccess
a = malloc(256)
memoryallocation
I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page
7 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Faithful capture memory
Capture access at page granularity coarse but fast
regionto capture
protect static and currently allocatedprocess memory (procselfmaps)
intercept memory allocation functionswith LD_PRELOAD
1 allocate memory
2 protect memory and returnto user program
segmentationfault handler
1 dump accessed memory to disk
2 unlock accessed page and return to user program
a[i]++
memoryaccess
a = malloc(256)
memoryallocation
I Small dump footprint only touched pages are savedI Warmup cache replay trace of most recently touched pagesI NUMA detect first touch of each page
7 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
8 16
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Selecting Representative Codelets
I Key Idea Applications have redundanciesI Same codelet called multiple timesI Codelets sharing similar performance signatures
I Detect redundancies and keep only one representative
0e+00
2e+07
4e+07
6e+07
8e+07
0 1000 2000 3000invocation
Cyc
les
replay
Figure SPEC tonto make ftshell2F901133 execution trace 90of NAS codelets can be reduced to four or less representatives
9 16
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Performance Signature Clustering
Maqao
Likwid
Static amp DynamicProfiling Vectorization ratio
FLOPSsCache Misses[ ] Clustering
f1f2f3
Step A Perform static and dynamic analysis on a reference architecture to capture codelets feature vectors
BT
SP
Step B Using the proximity between feature vectors we cluster similar codelets and select one representative per cluster
Step C CERE extracts the representatives as standalone codelets A model extrapolates full benchmark results
Model
Sandy Bridge
Atom
Core2
FullBenchmarks
Results
[Oliveira Castro et al 2014]
10 16
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Codelet Based Architecture Selection
012 015
083 083
155 159
00
05
10
15
Atom Core 2 Sandy Bridge
Geo
met
ricm
ean
spee
dup
Real Speedup
Predicted Speedup
Refe
renc
e N
ehal
em
Figure Benchmarking NAS serial on three architectures
I real speedup when benchmarking original applicationsI predicted speedup predicted with representative codeletsI CERE 31times cheaper than running the full benchmarks
11 16
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Autotuning LLVM middle-end optimizations
I LLVM middle-end offers more than 50 optimization passesI Codelet replay enable per-region fast optimization tuning
Id of LLVM middleminusend optimization passes combination
Cyc
les
(Ivy
brid
ge 3
4G
Hz)
60e+07
80e+07
10e+08
12e+08
0 200 400 600
originalreplayO3
Figure NAS SP ysolve codelet 1000 schedules of random passescombinations explored based on O3 passes
CERE 149times cheaper than running the full benchmark( 27times cheaper when tuning codelets covering 75 of SP) 12 16
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Fast Scalability Benchmarking with OpenMP Codelets
1 2 4 8 16 32Threads
0
1
2
3
4
5
6
Runti
me c
ycl
es
1e8 SP compute rhs on Sandy Bridge
RealPredicted
Core2 Nehalem Sandy Bridge Ivy Bridge
Accuracy 982 971 926 972Acceleration times 252 times 274 times 237 times 237
Figure Varying thread number at replay in SP and average results overOMP NAS [Popov et al 2015]
13 16
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Outline
Extracting and Replaying CodeletsFaithfulRetargetable
ApplicationsArchitecture selectionCompiler flags tuningScalability prediction
DemoCapture and replay in NAS BTSimple flag replay for NAS FT
Conclusion
14 16
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Conclusion
I CERE breaks an application into faithful and retargetablecodelets
I Piece-wise autotuningI Different architectureI Compiler optimizationsI ScalabilityI Other exploration costly analysis
I LimitationsI No support for codelets performing IO (OS state not captured)I Cannot explore source-level optimizationsI Tied to LLVM
I Full accuracy reports on NAS and SPECrsquo06 FP available atbenchmark-subsettinggithubiocereReports
15 16
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Thanks for your attention
C E R E
httpsbenchmark-subsettinggithubiocere
distributed under the LGPLv3
16 16
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Bibliography I
Akel Chadi et al (2013) ldquoIs Source-code Isolation Viable for PerformanceCharacterizationrdquo In 42nd International Conference on Parallel ProcessingWorkshops IEEE
Gao Xiaofeng et al (2005) ldquoReducing overheads for acquiring dynamicmemory tracesrdquo In Workload Characterization Symposium 2005Proceedings of the IEEE International IEEE pp 46ndash55
Liao Chunhua et al (2010) ldquoEffective source-to-source outlining to supportwhole program empirical optimizationrdquo In Languages and Compilers forParallel Computing Springer pp 308ndash322
Oliveira Castro Pablo de et al (2014) ldquoFine-grained Benchmark Subsettingfor System Selectionrdquo In Proceedings of Annual IEEEACM InternationalSymposium on Code Generation and Optimization ACM p 132
Popov Mihail et al (2015) ldquoPCERE Fine-grained Parallel BenchmarkDecomposition for Scalability Predictionrdquo In Proceedings of the 29th IEEEInternational Parallel and Distributed Processing Symposium IPDPS IEEE
17 16
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Retargetable replay register state
I Issue Register state is non-portable between architectures
I Solution capture at a function call boundaryI No shared state through registers except function argumentsI Get arguments directly through portable IR code
I Register agnostic capture
I Portable across Atom Core 2 Haswell Ivybridge NehalemSandybridge
I Preliminar portability tests between x86 and ARM 32 bits
18 16
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Retargetable replay outlining regions
Step 1 Outline the region to capture using CodeExtractor pass
original
0 = load i32 i align 4
1 = load i32 saddr align 4
cmp = icmp slt i32 0 1
br i1 cmp loop branch here
label forbody
label forexitStub
define internal void outlined(
i32 i i32 saddr
i32 aaddr)
call void start_capture(i32 i
i32 saddr i32 aaddr)
0 = load i32 i align 4
ret void
original
call void outlined(i32 i
i32 saddr i32 aaddr)
Step 2 Call start capture just after the function callStep 3 At replay reinlining and variable cloning [Liao et al 2010]steps ensure that the compilation context is close to original
19 16
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Comparison to other Code Isolating tools
CERECodeIsolator
Astex CodeletFinder
SimPoint
SupportLanguage C(++)
Fortran Fortran C Fortran C(++)
Fortranassembly
Extraction IR source source source assemblyIndirections yes no no yes yes
ReplaySimulator yes yes yes yes yesHardware yes yes yes yes no
ReductionCapture size reduced reduced reduced full -
Instances yes manual manual manual yesCode sign yes no no no yes
20 16
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Accuracy Summary NAS SER
Core2 Haswell
0
25
50
75
100
NASA
lu cg mg sp ft bt is ep lu cg mg sp ft bt is ep
o
f Exe
c T
ime
accurate replay codelet coverage
Figure The Coverage the percentage of the execution time captured bycodelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
21 16
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Accuracy Summary SPEC FP 2006
Haswell
0
25
50
75
100SPEC
FP06
games
s
sphin
x3de
alII
wrf
povra
y
calcu
lixso
plex
leslie3
dton
to
gromac
s
zeus
mplbm milc
cactu
sADM
namd
bwave
s
gemsfd
td
spec
rand
o
f Exe
c T
ime
Figure The Coverage is the percentage of the execution time capturedby codelets The Accurate Replay is the percentage of execution timereplayed with an error less than 15
I low coverage (sphinx3 wrf povray) lt 2000 cycles or IO
I low matching (soplex calculix gamess) warmup ldquobugsrdquo
22 16
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
CERE page capture dump size
94
27
106
2651
1
367
117131
46
89
8
472
101 96
22
0
50
100
150
200
bt cg ep ft is lu mg sp
Size
(MB)
full dump (Codelet Finder) page granularity dump (CERE)
Figure Comparison between the page capture and full dump size onNASA benchmarks CERE page granularity dump only contains thepages accessed by a codelet Therefore it is much smaller than a fullmemory dump
23 16
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
CERE page capture overhead
CERE page capture is coarser but faster
CERE ATOM 325 PIN 171 Dyninst 40
cga 194 9882 22267 89686fta 241 4422 12764 105470lua 624 8072 15346 3014mga 89 10769 16861 98953spa 732 6756 9304 20366
Slowdown of a full capture run against the original application run(takes into account the cost of writing the memory dumps and logsto disk and of tracing the page accesses during the wholeexecution) We compare to the overhead of memory tracing toolsas reported by [Gao et al 2005]
24 16
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
CERE cache warmup
for (i=0 i lt size i++) a[i] += b[i]
array a[] pages array b[] pages
21 22 23 50 51 52
pages addresses
21 5051 20
46 17 47 18 48 19 49
22
Reprotect 20
memory
FIFO (most recently unprotected)
warmup page trace
25 16
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Test architectures
Atom Core 2 Nehalem Sandy Bridge Ivy Bridge Haswell
CPU D510 E7500 L5609 E31240 i7-3770 i7-4770Frequency (GHz) 166 293 186 330 340 340Cores 2 2 4 4 4 4L1 cache (KB) 2times56 2times64 4times64 4times64 4times64 4times64L2 cache (KB) 2times512 3 MB 4times256 4times256 4times256 4times256L3 cache (MB) - - 12 8 8 8Ram (GB) 4 4 8 6 16 16
32 bits portability test ARM1176JZF-S on a Raspberry Pi Model B+
26 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation PatternDP 2 simultaneous reductions
DP MG Laplacian fine to coarse mesh transitionMP Dense Matrix x vector product
DP Vector multiply in ascdesc orderDP FFT butterfly computationDP 3 simultaneous reductions
SP Dense Matrix x vector productDP Laplacian finite difference constant coefficientsDP Vector multiply element wise in ascdesc order
MP First step FFTDP First order recurrenceDP First order recurrence
SP Dot product over lower half square matrixSP Addition on the diagonal elements of a matrix
DP Red Black Sweeps Laplacian operatorDP Vector divide element wise
DP Norm + Vector divideDP Sum of the absolute values of a matrix column
SP Sum of a square matrixSP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
DP Multiplying a matrix row by a scalarDP Linear combination of matrix rowsDP Substracting a vector with a vector
DP Sum of two square matrices element wiseDP Sum of the absolute values of a matrix row
DP Linear combination of matrix columnsDP Vector multiply element wise
27 16
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Clustering NR Codelets
cut for K = 14
Codelettoeplz_1rstrct_29mprove_8toeplz_4realft_4toeplz_3svbksb_3
lop_13toeplz_2four1_2tridag_2tridag_1
ludcmp_4hqr_15
relax2_26svdcmp_14svdcmp_13
hqr_13hqr_12_sqjacobi_5hqr_12
svdcmp_11elmhes_11mprove_9
matadd_16svdcmp_6elmhes_10balanc_3
Computation Pattern
DP Vector divide element wiseDP Norm + Vector divide
DP Sum of the absolute values of a matrix columnSP Sum of a square matrix
SP Sum of the upper half of a square matrixSP Sum of the lower half of a square matrix
ReductionSums
Long LatencyOperations
Similar computation patterns
27 16
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Capturing Architecture Change
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
28 16
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Same Cluster = Same Speedup
LUerhsf 49
FTappftf 45
BTrhsf 266
SPrhsf 275
Cluster A triple-nestedhigh latency operations(div and exp)
Cluster B stencil on fiveplanes (memory bound)
Core 2rarr 293 GHzrarr 3 MB
Atomrarr 166 GHzrarr 1 MB
Nehalem (Ref)Freq 186 GHzLLC 12 MB
50
100
200
400
slower
50
100
200
400
8001500
3000
slower
40 50 60
slower
50
100
200
400faster
Reference Target
ms
ms
29 16
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Feature Selection
I Genetic Algorithm train on Numerical Recipes + Atom +Sandy Bridge
I Validated on NAS + Core 2I The feature set is still among the best on NAS
Atom
Core 2
Sandy Bridge
1020304050
10203040
10203040
0 5 10 15 20 25Number of clusters
Med
ian
e
rror
Worst
Median
Best
GA features
30 16
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Reduction Factor Breakdown
Reduction Total Reduced invocations Clustering
Atom 443 times12 times37Core 2 247 times87 times28
Sandy Bridge 225 times63 times36
Table Benchmarking reduction factor breakdown with 18representatives
31 16
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Profiling Features
SP
codelets
Likwid
Maqao
Performance counters per codelet
Static dissassembly and analysis
4 dynamic features
8 static features
Bytes Stored cycleStallsEstimated IPCNumber of DIVNumber of SDPressure in P1Ratio ADD+SUBMULVectorization (FPFP+INTINT)
FLOPSL2 BandwidthL3 Miss RateMem Bandwidth
32 16
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-
Clang OpenMP front-end
void main() pragma omp parallel int p = omp_get_thread_num() printf(dp)
C code
Clang OpenMPfront end
define i32 main() entrycall __kmpc_fork_call omp_microtask()
define internal void omp_microtask() entry p = alloca i32 align 4 call = call i32 omp_get_thread_num() store i32 call i32 p align 4 1 = load i32 p align 4 call printf(1)
LLVM simplified IR Thread execution model
33 16
- Extracting and Replaying Codelets
-
- Faithful
- Retargetable
-
- Applications
-
- Architecture selection
- Compiler flags tuning
- Scalability prediction
-
- Demo
-
- Capture and replay in NAS BT
- Simple flag replay for NAS FT
-
- Conclusion
- Appendix
-