the ecm (execution-cache-memory) performance model · the ecm (execution-cache-memory) performance...
TRANSCRIPT
The ECM (Execution-Cache-Memory)
Performance Model
J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop “Memory issues on Multi- and Manycore Platforms” at PPAM 2009, the 8th International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, September 13-16, 2009. Lecture Notes in Computer Science Volume 6067, 2010, pp 615-624. DOI: 10.1007/978-3-642-14390-8_64. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908 H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Submitted. Preprint: arXiv:1410.5010
Assumptions and shortcomings of the roofline model
Assumes one of two bottlenecks
1. In-core execution
2. Bandwidth of a single hierarchy level
Latency effects are not modeled pure data streaming assumed
In-core execution is sometimes hard to
model
Saturation effects in multicore
chips are not explained
ECM model gives more insight
A(:)=B(:)+C(:)*D(:)
Roofline predicts full socket BW
ECM model (c) RRZE 2014 2
The Execution-Cache-Memory (ECM)
model
ECM Model
ECM = “Execution-Cache-Memory”
Observations:
Single-core execution time is not the maximum of 1. In-core execution
2. Data transfers through a single bottleneck
Data transfers may or may not overlap with each
other or with in-core execution
Scaling is linear until the relevant bottleneck is
reached
ECM model Input:
Same as for Roofline
+ data transfer times in hierarchy
ECM model (c) RRZE 2014 4
Example: Schönauer Vector Triad in L2 cache
REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision
Analysis for Sandy Bridge core w/ AVX (unit of work: 1 cache line)
ECM model
1 LD/cy + 0.5 ST/cy
Registers
L1
L2
32 B/cy (2 cy/CL)
Machine characteristics:
Arithmetic: 1 ADD/cy+ 1 MULT/cy
Registers
L1
L2
Triad analysis (per CL):
6 cy/CL
10 cy/CL
Arithmetic: AVX: 2 cy/CL
LD LD ST/2
LD ST/2 LD LD
ST/2 LD
ST/2
LD
ADD MULT
ADD MULT
LD LD WA ST
Roofline prediction: 16/10 F/cy
Timeline:
16 F/CL (AVX)
Measurement: 16F / ≈17cy
(c) RRZE 2014 5
Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) on a Sandy Bridge Core with AVX
ECM model
CL transfer
Write-allocate CL transfer
(c) RRZE 2014 6
Testing different overlap hypotheses
ECM model
Results suggest no overlap!
(c) RRZE 2014 7
Multicore scaling in the ECM model
Identify relevant bandwidth bottlenecks
L3 cache
Memory interface
Scale single-thread performance until first bottleneck is hit:
ECM model
𝑛 threads: 𝑃 𝑛 = min(𝑛𝑃0, 𝐼 ∙ 𝑏𝑆 )
. . . Example: Scalable L3
on Sandy Bridge
(c) RRZE 2014 8
ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)
on a Sandy Bridge socket (no-overlap assumption)
Model: Scales until saturation
sets in
Saturation point (# cores) well
predicted
Measurement: scaling not perfect
Caveat: This is specific for this
architecture and this benchmark!
Check: Use “overlappable” kernel
code
ECM model (c) RRZE 2014 9
ECM prediction vs. measurements for A(:)=B(:)+C(:)/D(:)
on a Sandy Bridge socket (full overlap assumption)
ECM model
In-core execution is dominated by
divide operation
(44 cycles with AVX, 22 scalar)
Almost perfect agreement with
ECM model
General observation:
If the L1 cache is 100% occupied
by LD, there is no overlap
throughout the hierarchy
If there is “slack” at the L1, there is
overlap in the hierarchy
(c) RRZE 2014 10
Example 1: A 2D Jacobi stencil in DP with SSE2 on Sandy Bridge
ECM model (c) RRZE 2014 11
Example 1: 2D Jacobi in DP with SSE2 on SNB
ECM model
Instruction count - 13 LOAD - 4 STORE - 12 ADD - 4 MUL
4-way unrolling 8 LUP / iteration
(c) RRZE 2014 12
Example 1: 2D Jacobi in DP with SSE2 on SNB
ECM model
Code characteristics
(SSE instructions per iteration)
- 13 LOAD
- 4 STORE
- 12 ADD
- 4 MUL
Processor characteristics
(SSE instructions per cycle)
- 2 LOAD || (1 LOAD + 1 STORE)
- 1 ADD
- 1 MUL
LD LD LD LD 2LD 2LD 2LD 2LD L
ST ST ST ST
+ + + + + + + + + + + +
* * * * core
execution:
12 cy
(c) RRZE 2014 13
Example 1: 2D Jacobi in DP with SSE2 on SNB
ECM model
Situation 1: Data set fits into L1 cache
ECM prediction:
(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s
Measurement: 2.2 GLUP/s
Situation 2: Data set fits into L2 cache (not into L1)
3 additional transfer streams from L2 to L1 (data delay)
Prediction:
(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s
Measurement: 1.9 GLUP/s
Overlap?
12 cy
6 cy t0 RFO t1
(c) RRZE 2014 14
Example 1: 2D Jacobi in DP with SSE2 on SNB
ECM model
LD LD LD LD 2LD 2LD 2LD 2LD L
ST ST ST ST
+ + + + + + + + + + + +
* * * *
core execution: 12 cycles
ECM prediction w/ overlap:
(8 LUP / (8.5+6) cy) * 3.5 GHz = 1.9 GLUP/s
Measurement: 1.9 GLUP/s
L1 „single ported“ no overlap during LD/ST
L2 delay: 6 cycles
12 cy
6 cy RFO t0 t1
“If the model fails, we learn something”
(c) RRZE 2014 15
LOAD bottleneck:
8.5 cy
ECM model – the rules
1. LOADs in the L1 cache do not overlap with any other data transfer in the memory hierarchy
2. Everything else in the core overlaps perfectly with data transfers
3. The scaling limit is set by the ratio of
# cycles per CL overall
# cycles per CL at the bottleneck
4. The Roofline Model is recovered when assuming full overlap of all contributions
ECM model
LOAD
L2-L1
L3-L2
MEM-L3
STORE
ADD MULT …
tim
e [c
y]
6 cy
9 cy
9 cy
19 cy
Example:
Single-core (data in L1): 8 cy (ADD)
Single-core (data in memory):
6+9+9+19 cy = 43 cy
Scaling limit: 43 / 19 = 2.3 cores
8 cy 3 cy 43 cy 4 cy
(c) RRZE 2014 16
Core time = overlapping and non-overlapping contributions
ECM prediction = maximum of overlapping time and sum of all other
contributions
Convenient shorthand notation for contributions:
Example from prev. slide:
Predictions for data in different memory hierarchy levels:
Experimental data (measured) notation:
Saturation assumption for memory bottleneck:
ECM model – notation
(c) RRZE 2014 17 ECM model
ECM Model for DAXPY (AVX) on SNB 2.7 GHz (phinally)
Loop:
Contributions:
Predictions:
(c) RRZE 2014 18 ECM model
ECM Model and measurements for array sum on SNB 2.7 GHz
(phinally)
Loop:
Naive = scalar, no unrolling (full 3 cy penalty per ADD)
(c) RRZE 2014 19 ECM model
ECM Model and measurements for 2D Jacobi (AVX)
on SNB 2.7 GHz (phinally)
Loop:
LC = layer condition satisfied in
(c) RRZE 2014 20 ECM model
Jacobi 2D impact of inner loop blocking on SNB (phinally)
(c) RRZE 2014 21 ECM model
ECM
Jacobi 2D: Why outer loop blocking?
(c) RRZE 2014 22 ECM model
Extra data prefetched from memory at block
boundaries
Kahan dot product
Kahan dot product
Goal: Compute large sums (many operands) with controlled numerical
error
(c) RRZE 2014 24 ECM model
__attribute__((optimize("no-tree-vectorize")))
void ddot_kahan_scalar_comp(
int N, const double* a, const double* b, double* r)
{
int i;
double sum = 0.0;
double c = 0.0;
for (i=0; i<N; ++i) {
double prod = a[i]*b[i];
double y = prod-c;
double t = sum+y;
c = (t-sum)-y;
sum = t;
}
(*r) = sum;
}
Example (from Wikipedia)
6-digit FP, initial sum = 10000.0, adding 3.14159 and 2.71828
(c) RRZE 2014 25 ECM model
y = 3.14159 - 0 y = input[i] - c
t = 10000.0 + 3.14159
= 10003.1 Many digits have been lost!
c = (10003.1 - 10000.0) - 3.14159 This must be evaluated as written!
= 3.10000 - 3.14159 Assimilated part of y recovered, vs. full y.
= -.0415900
sum = 10003.1 Inaccurate result
On the next step, c gives the error.
y = 2.71828 - -.0415900 Shortfall from previous stage included.
= 2.75987 It is of a size similar to y: most digits meet.
t = 10003.1 + 2.75987 But few meet the digits of sum.
= 10005.85987, rounds to 10005.9
c = (10005.9 - 10003.1) - 2.75987 This extracts whatever went in.
= 2.80000 - 2.75987 In this case, too much.
= .040130 The excess would be subtracted off next time.
sum = 10005.9 Exact result is 10005.85987,
this is correctly rounded to 6 digits.
ECM Model and measurements on Emmy
(IVB 2.2 GHz, 3 cy/CL from memory)
Standard DP ddot:
Scalar:
AVX:
Kahan ddot:
Scalar:
AVX:
Conclusion: DP Kahan ddot saturates even in scalar mode
SP Kahan will not saturate
(c) RRZE 2014 26 ECM model
Performance Modeling of Stencil Codes
Applying the ECM model to stencil updates:
- 3D Jacobi smoother (DP, AVX)
- Long-range stencil (SP, AVX)
(H. Stengel, RRZE)
Example 2: A 3D Jacobi smoother
with AVX vectorization
on an Intel Ivy Bridge processor
ECM model (c) RRZE 2014 28
Jacobi 3D Manual Analysis
Cycle Count
(4x unroll + AVX = 16 LUP)
MUL 4
ADD 20
LOAD 24
STORE 8
ECM model
Operation Count
(1 LUP)
MUL 1
ADD 5
LOAD 6
STORE 1
(c) RRZE 2014 29
Interlude: Intel Architecture Code Analyzer (IACA)
Performs architecture-specific code analysis
Prerequisite: Mark start and end of dominant work loop
In high-level code (documented)
In assembly code (see iacaMarks.h)
Does not influence code optimization (e.g. vectorization)
Assembly loop might perform multiple updates per iteration (unrolling, SIMD)
Important reports (throughput mode):
Block throughput: runtime of one loop iteration ( core-time)
Throughput bottleneck: limiting resource for code execution
Port pressure: dominant pipeline port
ECM model (c) RRZE 2014 30
16 updates (4x unroll + AVX) = 2 cache lines per loop iteration #pragma vector aligned
ECM model (c) RRZE 2014 31
Jacobi 3D ECM
ECM model
Non-LD/ST time Data transfers
L1-R
EG
(LD
12
cy)
L2-L
1
(10
cy)
L3
-L2
(1
0cy
)
M-L
3
(1
2cy
)
44cy
FrontEnd stalls
0.5*(24.1 - 24) =0.05cy
AD
D
(1
0cy
)
Stores (4cy)
Times [cy] for 8 LUP (DP) = 1 CL update = 0.5 loop iterations (ASM) = 0.5 * IACA output
Single-core performance 3.0GHz / (44cy/ 8LUP) = 545MLUP/s Measurement (N=400): 542MLUP/s (~44cy)
IACA throughput: 24.1cy/16LUP
MUL (2cy) Reg-Reg
(6cy)
Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz Memory Bandwidth 47 GB/s
#pragma vector aligned
(c) RRZE 2014 32
Socket Scaling
ECM model
Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz Memory Bandwidth 47 GB/s
(c) RRZE 2014 34
Example 3: 3D long-range stencil in single precision
with AVX on Sandy Bridge
ECM model (c) RRZE 2014 35
Example 3: 3D long-range stencil in SP with AVX on SNB
Core execution
4 neighbors per direction
Operations per update (code)
27 LOAD (25 V, 1 ROC, 1 U)
1 STORE (U)
26 ADD
15 MUL
Core time &
actual LOAD count
IACA
ECM model
Collaboration with D. Keyes & T. Malas
(KAUST)
(c) RRZE 2014 36
IACA example output – Core execution
ECM model
AVX vectorization, no unrolling: One iteration updates 8 SP (float) elements Multiply all numbers by 2X to get time for updating 1 CacheLine (16 floats)
128 Bit Loads
Data transfer: LOAD ports REG – L1: 2*30.5 cy = 61 cy
Core Execution time (16 LUP) = 2*34.25 cy = 68.5 cy
(c) RRZE 2014 37
Example 3: Data delay
Problem size: 2603 (single precision) – cy/CL
Spatial blocking Layer condition at L3 and row condition in L1: OK
ECM model
61 cy
From IACA analysis
Minimum data transfer to main memory: 4 WORD/LUP (LD: U,V,ROC – ST:U)
MemBW=40 GB/s
17 cy
8 LOADS to V can be served directly by L3 cache + 1 from main memory
24 cy
24 cy
(c) RRZE 2014 38
Example 3: Putting it all together
Core execution (Non-LD/ST cycles) Data delay
L1-R
EG (
Load
) 6
1 c
y L2
-L1
2
4 c
y L3
-L2
2
4 c
y
M-L
3
17
cy
12
6 c
y
FrontEnd stalls overlap: (68.5-61) cy =7.5cy
AD
D
52
cy MU
LT
38
cy
Re
g-R
eg
tran
sfe
rs
48
cy
Stores 4cy
Single-core performance (ECM Model) 2.7GHz / (126cy / 16LUP) = 343 MLUP/s
Measurement: 320 MLUP/s
IACA throughput 68.5 cy / CL (sp)
ECM model (c) RRZE 2014 39
optimization target!
temporal blocking useless!
Socket scaling
ECM model
memory bandwidth limit
(c) RRZE 2014 41
ECM model: Conclusions & outlook
Saturation effects are ubiquitous; understanding them gives us
opportunity to
Find out about optimization opportunities
Save energy by letting cores idle see power model later on
Putting idle cores to better use communication, functional decomposition
Simple models work best. Do not try to complicate things unless it
is really necessary!
Possible extensions to the ECM model
Accommodate latency effects
Model simple “architectural hazards”
ECM model (c) RRZE 2014 42