the ecm (execution-cache-memory) performance model · the ecm (execution-cache-memory) performance...

The ECM (Execution-Cache-Memory)

Performance Model

J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop “Memory issues on Multi- and Manycore Platforms” at PPAM 2009, the 8th International Conference on Parallel Processing and Applied Mathematics, Wroclaw, Poland, September 13-16, 2009. Lecture Notes in Computer Science Volume 6067, 2010, pp 615-624. DOI: 10.1007/978-3-642-14390-8_64. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908 H. Stengel, J. Treibig, G. Hager, and G. Wellein: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. Submitted. Preprint: arXiv:1410.5010

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1007/978-3-642-14390-8_64

http://dx.doi.org/10.1002/cpe.3180




http://arxiv.org/abs/1208.2908

http://arxiv.org/abs/1410.5010

Assumptions and shortcomings of the roofline model

Assumes one of two bottlenecks

1. In-core execution

2. Bandwidth of a single hierarchy level

Latency effects are not modeled pure data streaming assumed

In-core execution is sometimes hard to

model

Saturation effects in multicore

chips are not explained

ECM model gives more insight

A(:)=B(:)+C(:)*D(:)

Roofline predicts full socket BW

ECM model (c) RRZE 2014 2

The Execution-Cache-Memory (ECM)

model

ECM Model

ECM = “Execution-Cache-Memory”

Observations:

Single-core execution time is not the maximum of 1. In-core execution

2. Data transfers through a single bottleneck

Data transfers may or may not overlap with each

other or with in-core execution

Scaling is linear until the relevant bottleneck is

reached

ECM model Input:

Same as for Roofline

+ data transfer times in hierarchy


Example: Schönauer Vector Triad in L2 cache

REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision

Analysis for Sandy Bridge core w/ AVX (unit of work: 1 cache line)

ECM model

1 LD/cy + 0.5 ST/cy

Registers

L1

L2

32 B/cy (2 cy/CL)

Machine characteristics:

Arithmetic: 1 ADD/cy+ 1 MULT/cy

Registers

L1

L2

Triad analysis (per CL):

6 cy/CL

10 cy/CL

Arithmetic: AVX: 2 cy/CL

LD LD ST/2

LD ST/2 LD LD

ST/2 LD

ST/2

LD

ADD MULT

ADD MULT

LD LD WA ST

Roofline prediction: 16/10 F/cy

Timeline:

16 F/CL (AVX)

Measurement: 16F / ≈17cy

(c) RRZE 2014 5

Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) on a Sandy Bridge Core with AVX

ECM model

CL transfer

Write-allocate CL transfer

(c) RRZE 2014 6

Testing different overlap hypotheses

ECM model

Results suggest no overlap!

(c) RRZE 2014 7

Multicore scaling in the ECM model

Identify relevant bandwidth bottlenecks

L3 cache

Memory interface

Scale single-thread performance until first bottleneck is hit:

ECM model

𝑛 threads: 𝑃 𝑛 = min(𝑛𝑃0, 𝐼 ∙ 𝑏𝑆 )

. . . Example: Scalable L3

on Sandy Bridge

(c) RRZE 2014 8

ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)

on a Sandy Bridge socket (no-overlap assumption)

Model: Scales until saturation

sets in

Saturation point (# cores) well

predicted

Measurement: scaling not perfect

Caveat: This is specific for this

architecture and this benchmark!

Check: Use “overlappable” kernel

code


ECM prediction vs. measurements for A(:)=B(:)+C(:)/D(:)

on a Sandy Bridge socket (full overlap assumption)

ECM model

In-core execution is dominated by

divide operation

(44 cycles with AVX, 22 scalar)

Almost perfect agreement with

ECM model

General observation:

If the L1 cache is 100% occupied

by LD, there is no overlap

throughout the hierarchy

If there is “slack” at the L1, there is

overlap in the hierarchy

(c) RRZE 2014 10

Example 1: A 2D Jacobi stencil in DP with SSE2 on Sandy Bridge


Example 1: 2D Jacobi in DP with SSE2 on SNB

ECM model

Instruction count - 13 LOAD - 4 STORE - 12 ADD - 4 MUL

4-way unrolling 8 LUP / iteration

(c) RRZE 2014 12


ECM model

Code characteristics

(SSE instructions per iteration)

- 13 LOAD

- 4 STORE

- 12 ADD

- 4 MUL

Processor characteristics

(SSE instructions per cycle)

- 2 LOAD || (1 LOAD + 1 STORE)

- 1 ADD

- 1 MUL

LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * * core

execution:

12 cy

(c) RRZE 2014 13


ECM model

Situation 1: Data set fits into L1 cache

ECM prediction:

(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s

Measurement: 2.2 GLUP/s

Situation 2: Data set fits into L2 cache (not into L1)

3 additional transfer streams from L2 to L1 (data delay)

Prediction:

(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s


Overlap?

12 cy

6 cy t0 RFO t1

(c) RRZE 2014 14


ECM model

LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * *

core execution: 12 cycles

ECM prediction w/ overlap:

(8 LUP / (8.5+6) cy) * 3.5 GHz = 1.9 GLUP/s


L1 „single ported“ no overlap during LD/ST

L2 delay: 6 cycles

12 cy

6 cy RFO t0 t1

“If the model fails, we learn something”

(c) RRZE 2014 15

LOAD bottleneck:

8.5 cy

ECM model – the rules

1. LOADs in the L1 cache do not overlap with any other data transfer in the memory hierarchy

2. Everything else in the core overlaps perfectly with data transfers

3. The scaling limit is set by the ratio of

# cycles per CL overall

# cycles per CL at the bottleneck

4. The Roofline Model is recovered when assuming full overlap of all contributions

ECM model

LOAD

L2-L1

L3-L2

MEM-L3

STORE

ADD MULT …

tim

e [c

y]

6 cy

9 cy

9 cy

19 cy

Example:

Single-core (data in L1): 8 cy (ADD)

Single-core (data in memory):

6+9+9+19 cy = 43 cy

Scaling limit: 43 / 19 = 2.3 cores

8 cy 3 cy 43 cy 4 cy

(c) RRZE 2014 16

Core time = overlapping and non-overlapping contributions

ECM prediction = maximum of overlapping time and sum of all other

contributions

Convenient shorthand notation for contributions:

Example from prev. slide:

Predictions for data in different memory hierarchy levels:

Experimental data (measured) notation:

Saturation assumption for memory bottleneck:

ECM model – notation

(c) RRZE 2014 17 ECM model

ECM Model for DAXPY (AVX) on SNB 2.7 GHz (phinally)

Loop:

Contributions:

Predictions:


ECM Model and measurements for array sum on SNB 2.7 GHz

(phinally)

Loop:

Naive = scalar, no unrolling (full 3 cy penalty per ADD)


ECM Model and measurements for 2D Jacobi (AVX)

on SNB 2.7 GHz (phinally)

Loop:

LC = layer condition satisfied in


Jacobi 2D impact of inner loop blocking on SNB (phinally)


ECM

Jacobi 2D: Why outer loop blocking?


Extra data prefetched from memory at block

boundaries

Kahan dot product

Kahan dot product

Goal: Compute large sums (many operands) with controlled numerical

error


__attribute__((optimize("no-tree-vectorize")))

void ddot_kahan_scalar_comp(

int N, const double* a, const double* b, double* r)

{

int i;

double sum = 0.0;

double c = 0.0;

for (i=0; i<N; ++i) {

double prod = a[i]*b[i];

double y = prod-c;

double t = sum+y;

c = (t-sum)-y;

sum = t;

}

(*r) = sum;

}

Example (from Wikipedia)

6-digit FP, initial sum = 10000.0, adding 3.14159 and 2.71828


y = 3.14159 - 0 y = input[i] - c

t = 10000.0 + 3.14159

= 10003.1 Many digits have been lost!

c = (10003.1 - 10000.0) - 3.14159 This must be evaluated as written!

= 3.10000 - 3.14159 Assimilated part of y recovered, vs. full y.

= -.0415900

sum = 10003.1 Inaccurate result

On the next step, c gives the error.

y = 2.71828 - -.0415900 Shortfall from previous stage included.

= 2.75987 It is of a size similar to y: most digits meet.

t = 10003.1 + 2.75987 But few meet the digits of sum.

= 10005.85987, rounds to 10005.9

c = (10005.9 - 10003.1) - 2.75987 This extracts whatever went in.

= 2.80000 - 2.75987 In this case, too much.

= .040130 The excess would be subtracted off next time.

sum = 10005.9 Exact result is 10005.85987,

this is correctly rounded to 6 digits.

ECM Model and measurements on Emmy

(IVB 2.2 GHz, 3 cy/CL from memory)

Standard DP ddot:

Scalar:

AVX:

Kahan ddot:

Scalar:

AVX:

Conclusion: DP Kahan ddot saturates even in scalar mode

SP Kahan will not saturate


Performance Modeling of Stencil Codes

Applying the ECM model to stencil updates:

- 3D Jacobi smoother (DP, AVX)

- Long-range stencil (SP, AVX)

(H. Stengel, RRZE)

Example 2: A 3D Jacobi smoother

with AVX vectorization

on an Intel Ivy Bridge processor


Jacobi 3D Manual Analysis

Cycle Count

(4x unroll + AVX = 16 LUP)

MUL 4

ADD 20

LOAD 24

STORE 8

ECM model

Operation Count

(1 LUP)

MUL 1

ADD 5

LOAD 6

STORE 1

(c) RRZE 2014 29

Interlude: Intel Architecture Code Analyzer (IACA)

Performs architecture-specific code analysis

Prerequisite: Mark start and end of dominant work loop

In high-level code (documented)

In assembly code (see iacaMarks.h)

Does not influence code optimization (e.g. vectorization)

Assembly loop might perform multiple updates per iteration (unrolling, SIMD)

Important reports (throughput mode):

Block throughput: runtime of one loop iteration ( core-time)

Throughput bottleneck: limiting resource for code execution

Port pressure: dominant pipeline port


16 updates (4x unroll + AVX) = 2 cache lines per loop iteration #pragma vector aligned


Jacobi 3D ECM

ECM model

Non-LD/ST time Data transfers

L1-R

EG

(LD

12

cy)

L2-L

1

(10

cy)

L3

-L2

(1

0cy

)

M-L

3

(1

2cy

)

44cy

FrontEnd stalls

0.5*(24.1 - 24) =0.05cy

AD

D

(1

0cy

)

Stores (4cy)

Times [cy] for 8 LUP (DP) = 1 CL update = 0.5 loop iterations (ASM) = 0.5 * IACA output

Single-core performance 3.0GHz / (44cy/ 8LUP) = 545MLUP/s Measurement (N=400): 542MLUP/s (~44cy)

IACA throughput: 24.1cy/16LUP

MUL (2cy) Reg-Reg

(6cy)

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz Memory Bandwidth 47 GB/s

#pragma vector aligned

(c) RRZE 2014 32

Socket Scaling

ECM model

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz Memory Bandwidth 47 GB/s

(c) RRZE 2014 34

Example 3: 3D long-range stencil in single precision

with AVX on Sandy Bridge


Example 3: 3D long-range stencil in SP with AVX on SNB

Core execution

4 neighbors per direction

Operations per update (code)

27 LOAD (25 V, 1 ROC, 1 U)

1 STORE (U)

26 ADD

15 MUL

Core time &

actual LOAD count

IACA

ECM model

Collaboration with D. Keyes & T. Malas

(KAUST)

(c) RRZE 2014 36

IACA example output – Core execution

ECM model

AVX vectorization, no unrolling: One iteration updates 8 SP (float) elements Multiply all numbers by 2X to get time for updating 1 CacheLine (16 floats)

128 Bit Loads

Data transfer: LOAD ports REG – L1: 2*30.5 cy = 61 cy

Core Execution time (16 LUP) = 2*34.25 cy = 68.5 cy

(c) RRZE 2014 37

Example 3: Data delay

Problem size: 2603 (single precision) – cy/CL

Spatial blocking Layer condition at L3 and row condition in L1: OK

ECM model

61 cy

From IACA analysis

Minimum data transfer to main memory: 4 WORD/LUP (LD: U,V,ROC – ST:U)

MemBW=40 GB/s

17 cy

8 LOADS to V can be served directly by L3 cache + 1 from main memory

24 cy

24 cy

(c) RRZE 2014 38

Example 3: Putting it all together

Core execution (Non-LD/ST cycles) Data delay

L1-R

EG (

Load

) 6

1 c

y L2

-L1

2

4 c

y L3

-L2

2

4 c

y

M-L

3

17

cy

12

6 c

y

FrontEnd stalls overlap: (68.5-61) cy =7.5cy

AD

D

52

cy MU

LT

38

cy

Re

g-R

eg

tran

sfe

rs

48

cy

Stores 4cy

Single-core performance (ECM Model) 2.7GHz / (126cy / 16LUP) = 343 MLUP/s

Measurement: 320 MLUP/s

IACA throughput 68.5 cy / CL (sp)


optimization target!

temporal blocking useless!

Socket scaling

ECM model

memory bandwidth limit

(c) RRZE 2014 41

ECM model: Conclusions & outlook

Saturation effects are ubiquitous; understanding them gives us

opportunity to

Find out about optimization opportunities

Save energy by letting cores idle see power model later on

Putting idle cores to better use communication, functional decomposition

Simple models work best. Do not try to complicate things unless it

is really necessary!

Possible extensions to the ECM model

Accommodate latency effects

Model simple “architectural hazards”


the ecm (execution-cache-memory) performance model · the ecm (execution-cache-memory) performance...

Documents