how to get peak flops (cpu) what i wish i knew when i ...contents 1 introduction 2 an endeavor to...

70
How to get peak FLOPS (CPU) — What I wish I knew when I was twenty about CPU — Kenjiro Taura 1 / 54

Upload: others

Post on 22-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

How to get peak FLOPS (CPU)

— What I wish I knew when I was twenty

about CPU —

Kenjiro Taura

1 / 54

Page 2: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Contents

1 Introduction

2 An endeavor to nearly peak FLOPS

3 Latency

4 Instruction Level Parallelism (ILP)

5 Analyzing throughput

6 A simple yet fairly fast single-core matrix multiply

2 / 54

Page 3: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Contents

1 Introduction

2 An endeavor to nearly peak FLOPS

3 Latency

4 Instruction Level Parallelism (ILP)

5 Analyzing throughput

6 A simple yet fairly fast single-core matrix multiply

3 / 54

Page 4: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

What you need to know to get a nearly peak

FLOPS

so you now know how to use multicores and SIMDinstructions

they are two key elements to get a nearly peak FLOPS

the last key element: Instruction Level Parallelism (ILP) ofsuperscalar processors

4 / 54

Page 5: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Contents

1 Introduction

2 An endeavor to nearly peak FLOPS

3 Latency

4 Instruction Level Parallelism (ILP)

5 Analyzing throughput

6 A simple yet fairly fast single-core matrix multiply

5 / 54

Page 6: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

An endeavor to nearly peak FLOPS

let’s run the simplest code you can think of�1 #if __AVX512F__

2 const int vwidth = 64;

3 #elif __AVX__

4 const int vwidth = 32;

5 #else

6 #error "you’d better have a better machine"

7 #endif

8

9 const int valign = sizeof(float);

10 typedef float floatv attribute ((vector size(vwidth),aligned(valign)));

11 /∗ SIMD lanes ∗/12 const int L = sizeof(floatv) / sizeof(float);�1 floatv a, x, c;

2 for (i = 0; i < n; i++) {

3 x = a * x + c;

4 }

the code performs L × n FMAs and almost nothing else

6 / 54

Page 7: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Notes on experiments

the source code for the following experiments is in 06axpy

directory

the computation is trivial but the measurement part isLinux-specific

perf event open to get CPU clocks (not reference clocks)clock gettime to get time in nano second resolution

it will compile on MacOS too, but the results are inaccurate

reference clocks substitute for CPU clocksgettimeofday (micro second granularity) substitutes forclock gettime

7 / 54

Page 8: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Notes on experiments

on Linux, you need to allow user processes to getperformance events by�

1 $ sudo sysctl -w kernel.perf_event_paranoid=-1

exact results depend on the CPU microarchitecture and ISA�1 $ cat /proc/cpuinfo

and google the model name (e.g., “Xeon Gold 6126”)the following experiments show results on an Skylake X CPU

Skylake X is a variant of Skylake supporting AVX-512login node or the big partition of IST cluster

there is a Skylake, which has the same microarchitecture butdoes not support AVX-512 (marketing conspiracy?)your newest laptop will be Kaby Lake, even newer thanSkylake but does not have AVX-512. Its limit is:

2× 256-bit FMAs/cycle = 32 flops/cycle8 / 54

Page 9: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Let’s run it!

compile�1 $ gcc -o axpy -march=native -O3 axpy.c

and run!�1 $ ./axpy simd

2 algo = simd

3 m = 8

4 n = 1000000000

5 flops = 32000000000

6 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns

7 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter

8 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS

took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock

≈ 1/8 of the single core peak (64 flops/clock)

9 / 54

Page 10: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Let’s run it!

compile�1 $ gcc -o axpy -march=native -O3 axpy.c

and run!�1 $ ./axpy simd

2 algo = simd

3 m = 8

4 n = 1000000000

5 flops = 32000000000

6 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns

7 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter

8 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS

took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock

≈ 1/8 of the single core peak (64 flops/clock)

9 / 54

Page 11: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Let’s run it!

compile�1 $ gcc -o axpy -march=native -O3 axpy.c

and run!�1 $ ./axpy simd

2 algo = simd

3 m = 8

4 n = 1000000000

5 flops = 32000000000

6 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns

7 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter

8 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS

took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock

≈ 1/8 of the single core peak (64 flops/clock)

9 / 54

Page 12: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Let’s run it!

compile�1 $ gcc -o axpy -march=native -O3 axpy.c

and run!�1 $ ./axpy simd

2 algo = simd

3 m = 8

4 n = 1000000000

5 flops = 32000000000

6 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns

7 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter

8 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS

took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock

≈ 1/8 of the single core peak (64 flops/clock)

9 / 54

Page 13: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

How to investigate

put a landmark in the assembly code�1 asm volatile ("# axpy simd: ax+c loop begin");

2 for (i = 0; i < n; i++) {

3 x = a * x + c;

4 }

5 asm volatile ("# axpy simd: ax+c loop end");

compile into assembly�1 $ gcc -S -march=native -O3 axpy.c

and see axpy.s in your editor

10 / 54

Page 14: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

How to investigate

put a landmark in the assembly code�1 asm volatile ("# axpy simd: ax+c loop begin");

2 for (i = 0; i < n; i++) {

3 x = a * x + c;

4 }

5 asm volatile ("# axpy simd: ax+c loop end");

compile into assembly�1 $ gcc -S -march=native -O3 axpy.c

and see axpy.s in your editor

10 / 54

Page 15: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Assembly

�1 # axpy_simd: ax+c loop begin

2 # 0 "" 2

3 #NO_APP

4 testq %rdi, %rdi

5 jle .L659

6 xorl %edx, %edx

7 .p2align 4,,10

8 .p2align 3

9 .L660:

10 addq $1,%rdx

11 vfmadd132ps %zmm0,%zmm1,%zmm2

12 cmpq %rdx,%rdi

13 jne .L660

14 .L659:

15 #APP

16 # 63 "axpy.cc" 1

17 # axpy_simd: ax+c loop end

why?

11 / 54

Page 16: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Suspect looping overhead?

if you suspect the overhead of other instructions, here is anunrolled version that has much fewer overhead instructions

its performance is identical�1 #pragma GCC optimize("unroll-loops", 8)

2 long axpy_simd(long n, floatv a,

3 floatv* X, floatv c) {

4 ...

5 for (i = 0; i < n; i++) {

6 x = a * x + c;

7 }

8 }

�1 .L1662:

2 addq $8, %rdx

3 vfmadd132ps %zmm0,%zmm1,%zmm2

4 vfmadd132ps %zmm0,%zmm1,%zmm2

5 cmpq %rdx,%rdi

6 vfmadd132ps %zmm0,%zmm1,%zmm2

7 vfmadd132ps %zmm0,%zmm1,%zmm2

8 vfmadd132ps %zmm0,%zmm1,%zmm2

9 vfmadd132ps %zmm0,%zmm1,%zmm2

10 vfmadd132ps %zmm0,%zmm1,%zmm2

11 vfmadd132ps %zmm0,%zmm1,%zmm2

12 jne .L1662

12 / 54

Page 17: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Contents

1 Introduction

2 An endeavor to nearly peak FLOPS

3 Latency

4 Instruction Level Parallelism (ILP)

5 Analyzing throughput

6 A simple yet fairly fast single-core matrix multiply

13 / 54

Page 18: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Latency and throughput

a (Skylake-X) core can execute two vfmaddps instructionsevery cycle

yet, it does not mean the result of vfmaddps at line 3 belowis available in the next cycle for vfmaddps at the next line�

1 .L1662:

2 addq $8, %rdx

3 vfmadd132ps %zmm0,%zmm1,%zmm2

4 vfmadd132ps %zmm0,%zmm1,%zmm2

5 cmpq %rdx,%rdi

6 vfmadd132ps %zmm0,%zmm1,%zmm2

7 ...

8 vfmadd132ps %zmm0,%zmm1,%zmm2

9 jne .L1662

what you need to know:“two vfmadd132ps instructions every cycle” refers to thethroughputeach instruction has a specific latency (> 1 cycle)

14 / 54

Page 19: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Latency and throughput

a (Skylake-X) core can execute two vfmaddps instructionsevery cycle

yet, it does not mean the result of vfmaddps at line 3 belowis available in the next cycle for vfmaddps at the next line�

1 .L1662:

2 addq $8, %rdx

3 vfmadd132ps %zmm0,%zmm1,%zmm2

4 vfmadd132ps %zmm0,%zmm1,%zmm2

5 cmpq %rdx,%rdi

6 vfmadd132ps %zmm0,%zmm1,%zmm2

7 ...

8 vfmadd132ps %zmm0,%zmm1,%zmm2

9 jne .L1662

what you need to know:“two vfmadd132ps instructions every cycle” refers to thethroughputeach instruction has a specific latency (> 1 cycle)

14 / 54

Page 20: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Latencies

instruction Haswell Broadwell Skylake

fp add 3 3 4fp mul 5 3 4fp fmadd 5 5 4typical integer ops 1 1 1. . . . . . . . . . . .

http://www.agner.org/optimize/ is an invaluable resource

put the following two docs under your pillow

3. The microarchitecture of Intel, AMD and VIA CPUs: Anoptimization guide for assembly programmers and compilermakers4. Instruction tables: Lists of instruction latencies,throughputs and micro-operation breakdowns for Intel, AMDand VIA CPUs

15 / 54

Page 21: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Our code in light of latencies

in our code, a vfmadd uses the result of the immediatelypreceding vfmadd

that was obvious from the source code too�1 .L1662:

2 addq $8, %rdx

3 vfmadd132ps %zmm0,%zmm1,%zmm2

4 vfmadd132ps %zmm0,%zmm1,%zmm2

5 cmpq %rdx,%rdi

6 vfmadd132ps %zmm0,%zmm1,%zmm2

7 ...

8 vfmadd132ps %zmm0,%zmm1,%zmm2

9 jne .L1662

�1 for (i = 0; i < n; i++) {

2 x = a * x + c;

3 }

Conclusion:

the loop can’t run faster than 4 cycles/iteration

vfmaddps

zmm2

vfmaddps vfmaddps vfmaddps

zmm2 zmm2 zmm2 zmm2

16 / 54

Page 22: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

CPU clocks vs. reference clocks

CPU changes clock frequency depending on the load (DVFS)

reference clock runs at the same frequency (it is alwaysproportional to the absolute time)

an instruction takes a specified number of CPU clocks, notreference clocks

the CPU clock is more predictable and thus more convenientfor a precise reasoning of the code

reference clock

absolute time

CPU clock

vfmaddps vfmaddps vfmaddps vfmaddps

17 / 54

Page 23: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

How to overcome latencies?

increase parallelism (no other ways)!

you can’t make a serial chain of computation run faster(change the algorithm if you want to)

vfmaddps

zmm2

vfmaddps vfmaddps vfmaddps

zmm2 zmm2 zmm2 zmm2

you can only increase throughput, by running multipleindependent chains

we expect the following to finish in the same number of cyclesas the original one, despite it performs twice as many flops�

1 for (i = 0; i < n; i++) {

2 x0 = a * x0 + c;

3 x1 = a * x1 + c;

4 }

18 / 54

Page 24: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

How to overcome latencies?

increase parallelism (no other ways)!

you can’t make a serial chain of computation run faster(change the algorithm if you want to)

vfmaddps

zmm2

vfmaddps vfmaddps vfmaddps

zmm2 zmm2 zmm2 zmm2

you can only increase throughput, by running multipleindependent chains

we expect the following to finish in the same number of cyclesas the original one, despite it performs twice as many flops�

1 for (i = 0; i < n; i++) {

2 x0 = a * x0 + c;

3 x1 = a * x1 + c;

4 }

18 / 54

Page 25: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

How to overcome latencies?

increase parallelism (no other ways)!

you can’t make a serial chain of computation run faster(change the algorithm if you want to)

vfmaddps

zmm2

vfmaddps vfmaddps vfmaddps

zmm2 zmm2 zmm2 zmm2

you can only increase throughput, by running multipleindependent chains

we expect the following to finish in the same number of cyclesas the original one, despite it performs twice as many flops�

1 for (i = 0; i < n; i++) {

2 x0 = a * x0 + c;

3 x1 = a * x1 + c;

4 }

18 / 54

Page 26: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

How to overcome latencies?

increase parallelism (no other ways)!

you can’t make a serial chain of computation run faster(change the algorithm if you want to)

vfmaddps

zmm2

vfmaddps vfmaddps vfmaddps

zmm2 zmm2 zmm2 zmm2

you can only increase throughput, by running multipleindependent chains

we expect the following to finish in the same number of cyclesas the original one, despite it performs twice as many flops�

1 for (i = 0; i < n; i++) {

2 x0 = a * x0 + c;

3 x1 = a * x1 + c;

4 }

18 / 54

Page 27: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Increase the number of chains further . . .

we expect to reach peak FLOPS with ≥ 2/(1/4) = 8 chains�1 template<int nv>

2 long axpy_simd_c( ... ) {

3 for (long i = 0; i < n; i++) {

4 for (long j = 0; j < nv; j++) {

5 X[j] = a * X[j] + c;

6 } } }

19 / 54

Page 28: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Results

012345678

0 2 4 6 8 10 12 14 160816243240485664

CPU

cycles/iter

flops/CPU

cycle

variables

CPU cycles/iterflops/CPU cycle

chains clocks/iter flops/clock1 4.0 7.9992 4.001 15.9983 4.001 23.9964 4.001 31.9955 4.049 39.5176 4.001 47.9947 4.113 54.4588 4.001 63.9919 4.567 63.05910 5.001 63.98211 5.501 63.99112 6.001 63.98813 6.502 63.98114 7.001 63.98915 7.501 63.99

20 / 54

Page 29: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Contents

1 Introduction

2 An endeavor to nearly peak FLOPS

3 Latency

4 Instruction Level Parallelism (ILP)

5 Analyzing throughput

6 A simple yet fairly fast single-core matrix multiply

21 / 54

Page 30: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Superscalar processors

What you need to know:

instructions are decoded in the program order,

but the execution orders are not constrained by it (out oforder execution)

⇒ as a crude approximation, performance is constrained onlyby

latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)

22 / 54

Page 31: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Superscalar processors

What you need to know:

instructions are decoded in the program order,

but the execution orders are not constrained by it (out oforder execution)

⇒ as a crude approximation, performance is constrained onlyby

latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)

22 / 54

Page 32: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Superscalar processors

What you need to know:

instructions are decoded in the program order,

but the execution orders are not constrained by it (out oforder execution)

⇒ as a crude approximation, performance is constrained onlyby

latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)

22 / 54

Page 33: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Superscalar processors

What you need to know:

instructions are decoded in the program order,

but the execution orders are not constrained by it (out oforder execution)

⇒ as a crude approximation, performance is constrained onlyby

latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)

22 / 54

Page 34: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Superscalar processors

What you need to know:

instructions are decoded in the program order,

but the execution orders are not constrained by it (out oforder execution)

⇒ as a crude approximation, performance is constrained onlyby

latency: length of dependent chain of instructions

throughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)

22 / 54

Page 35: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Superscalar processors

What you need to know:

instructions are decoded in the program order,

but the execution orders are not constrained by it (out oforder execution)

⇒ as a crude approximation, performance is constrained onlyby

latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)

22 / 54

Page 36: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

A general theory of workload performance on

aggressive superscalar machines

dependency constrains how fast a computation can proceed,even if there are infinite number of execution resources

increase the number of independent computations and youincrease throughput, until it hits the limit of executionresources

012345678

0 2 4 6 8 10 12 14 160816243240485664

CPU

cycles/iter

flops/CPU

cycle

variables

CPU cycles/iterflops/CPU cycle

23 / 54

Page 37: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Contents

1 Introduction

2 An endeavor to nearly peak FLOPS

3 Latency

4 Instruction Level Parallelism (ILP)

5 Analyzing throughput

6 A simple yet fairly fast single-core matrix multiply

24 / 54

Page 38: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

What if the number of chains is a variable?

purpose: illustrate the same concept with a slightly morecomplex/common case

let’s try the following code, identical to the one we sucessfullyachieved nearly peak performance for, except that the numberof variables (nv) is now a variable (not a compile-timeconstant)�

1 void axpy_simd_m(..., long m) {

2 for (long i = 0; i < n; i++) {

3 for (long j = 0; j < m; j++) {

4 X[j] = a * X[j] + c;

5 } } }

25 / 54

Page 39: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

When we experiment . . .

chains clocks/iter flops/clock1 11.002 2.9092 11.037 5.7993 11.028 8.7054 11.131 11.4995 12.021 13.316 14.018 13.6967 16.013 13.9898 18.006 14.2189 20.57 14.00110 22.017 14.53511 24.008 14.66212 26.024 14.75613 28.011 14.85114 30.022 14.92315 32.653 14.7

0

5

10

15

20

25

0 2 4 6 8 10 12 14 160

8

16

24

CPU

cycles/iter

flops/CPU

cycle

variables

CPU cycles/iterflops/CPU cycle

the pattern is similar, but there are two differences:the latency of a single update became ≈ 11 cyclesthe throughput hits the plateau at ≈ 2 cycles/iter

26 / 54

Page 40: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Take a look at assembly

it looks like:�1 .L1811:

2 vmovaps %zmm0, %zmm2

3 addq $64, %rcx

4 vfmadd132ps -64(%rcx), %zmm1, %zmm2

5 vmovups %zmm2, -64(%rcx)

6 cmpq %rcx, %r8

7 jne .L1811

what’s the difference from the code we have seen before(whose latency = 4 cycles)?�

1 .L1800:

2 addq $1, %rdx

3 vfmadd132ps %zmm0, %zmm1, %zmm2

4 cmpq %rdx, %rdi

5 jne .L1800

27 / 54

Page 41: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

The reason of the latency (11 cycles)

both stem from the fact that the code now involvesload/stores

what you need to know:

just like FMA, each instruction has its own latency

28 / 54

Page 42: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Latencies of various instructions

instruction Haswell Broadwell Skylake

fp add 3 3 4fp mul 5 3 4fp fmadd 5 5 4typical integer ops 1 1 1load 3 3 3store(*) 4 4 3. . . . . . . . . . . .

(*) 256 bit (not sure for 512 bit store)

3 + 4 + 3 = 10 ̸= 11; I could not get information thatconfirms the extra cycle, but a simpler experiment shows thesame result 1may be 512 bit store takes 4 cycles?)

1(

29 / 54

Page 43: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

The reason of the throughput

what you need to know:

Just like FMA, all instructions have their own through-put limits, due to execution resources (dispatch ports andexecution units)

“two vfmadds per cycle” is just an example of it

30 / 54

Page 44: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Some representative throughput limits

Throughput = the number of that instruction that can beexecuted by cycle

instruction Haswell Broadwell Skylake

fp add/mul/fmadd 2 2 2load 2 2 2store 1 1 1typical integer ops 4 4 4. . . . . . . . . . . .

31 / 54

Page 45: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

A back of envelope calculation

instruction type 1/throughputvmovaps %zmm0,%zmm2 register move 0.33

addq $64,%rcx int op 0.25

vfmadd132ps -64(%rcx),%zmm1,%zmm2 load + FMA 0.5, 0.5vmovups %zmm2,-64(%rcx) store 1.0cmpq %rcx,%r8 compare 0.25jne .L1811 jump 1-2

I don’t know what 1-2 means

we can conclude that the throuhput ≤ 1 due to the store

more general cases require understanding dispatch ports

32 / 54

Page 46: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Dispatch ports

each instruction(µ-operation) is dispatchedto a specific port

Skylake X ports

fmadd → port 0 or 1load → port 2 or 3store → port 4load/store addressgeneration → port 7int → port 0, 1, 5, or 6etc.

source:

https://en.wikichip.org/wiki/intel/microarchitectures/skylake (server)

each port can take only a single operation per cyclethis determines the aggregate throughput of all instructionsthat go to that port

with destination ports of instructions, one can calculate thethroughput limit of a given loop

33 / 54

Page 47: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Intel Architecture Code Analyzer

a great tool to analyze the throughput (and latency to someextent) limit

https://software.intel.com/en-us/articles/

intel-architecture-code-analyzer

34 / 54

Page 48: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

How to overcome the throughput limit?

the goal is two iterations/cycle (throughput limit of FMA)

the bottleneck is a store instruction (1/cycle)

we obviously need to quit loading/storing data for everysingle fmadd�

1 for (i = 0; i < n; i++) {

2 for (j = 0; j < nv; j++) {

3 x[j] = a * x[j] + c; // load; fmadd; store4 }

5 }

the minimum “unit” of a computation should look like:�1 load x[j] to a register;

2 do “a * x + c” several times on the register;3 store the result to x[j];

and run multiple independent units

35 / 54

Page 49: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Several ways to arrange computation

take a variable at a time and

run it until the end (suffer

from latency)

i

j

advance all variables one step

at a time (suffer from the

store throughput)

i

j

strategy 1: take a few

variables and run them until

the endi

j

smallandco

nsta

nt

load on register

strategy 2: advance all

variables, a few steps at a time

i

j

on registerload store

36 / 54

Page 50: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Implementing strategy 1

say we advance ten elements at a time�1 for (j = 0; j < nv; j += b) {

2 /* run b variables until the end */3 for (i = 0; i < n; i++) {

4 for (jj = j; jj < j + b; jj++) {

5 xx[jj] = a * xx[jj] + c;

6 }

7 }

8 }

we hope it loads/stores each variable only once through the iloop (line 2)!

this coding depends on the compiler’s smartness we havewitnessed

promote fixed-sized arrays into registers

37 / 54

Page 51: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Implementing strategy 2

advance all variables, say three, steps at a time�1 for (i = 0; i < n; i += 3) {

2 /* run all variables 3 steps */3 for (j = 0; j < m; j++) {

4 for (ii = 0; ii < 3; ii++) {

5 x[j] = a * x[j] + c;

6 }

7 }

8 }

again, we hope the compiler’s smartness to eliminateintermediate load/stores (purple parts)

the latency of a single j iteration increases, but we hope the jloop exposes lots of independent computations

38 / 54

Page 52: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Results of strategy I

chains clocks/iter flops/clock1 4.0 7.9992 4.0 15.9983 4.001 23.9964 4.001 31.9965 4.018 39.8186 4.001 47.9917 4.114 54.4478 4.001 63.999 4.583 62.83410 5.001 63.98611 5.501 63.99112 6.001 63.99313 6.501 63.98814 7.001 63.99115 7.501 63.987

012345678

0 2 4 6 8 10 12 14 160816243240485664

CPU

cycles/iter

flops/CPU

cycle

variables

CPU cycles/iterflops/CPU cycle

39 / 54

Page 53: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Contents

1 Introduction

2 An endeavor to nearly peak FLOPS

3 Latency

4 Instruction Level Parallelism (ILP)

5 Analyzing throughput

6 A simple yet fairly fast single-core matrix multiply

40 / 54

Page 54: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Developing near peak FLOPS matrix multiply

let’s develop a (single core) matrix multiply that runs atfairly good FLOPS on Skylake-X

it is a great application of the concept you have just learned

C+ = A ∗B

+= *M

N K

K

N

C A B

41 / 54

Page 55: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Developing near peak FLOPS matrix multiply

let’s develop a (single core) matrix multiply that runs atfairly good FLOPS on Skylake-X

it is a great application of the concept you have just learned

C+ = A ∗B

M

NK

K

CA

B

41 / 54

Page 56: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

A few simplifying assumptions

we add assumptions that M , N , and K are multiple ofcertain numbers along the way, (forget how to process“remainder” rows/columns in this slide)

we assume matrix sizes are known at compile time and are“convenient” (e.g., they are small)

multiplication of larger (and unknown size) matrices can bebuilt on top of this

42 / 54

Page 57: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 1: Baseline code

j

M

N

K

i

K

�1 $ ./mmc00

2 M = 12, N = 32, K = 192

3 A : 12 x 192 (ld=192) 9216 bytes

4 B : 192 x 32 (ld=32) 24576 bytes

5 C : 12 x 32 (ld=32) 1536 bytes

6 total = 35328 bytes

7 ...

8 3.456 CPU clocks/iter

9 2.520 REF clocks/iter

10 0.579 flops/CPU clock

11 0.794 flops/REF clock

12 2.058 GFLOPS�1 for (i = 0; i < M; i++)

2 for (j = 0; j < N; j++)

3 for (k = 0; k < K; k++)

4 C(i,j) += A(i,k) * B(k,j);

it runs at ≈ 3.5 clocks / innermost looplatency of fmadd on C(i,j) − fraction

43 / 54

Page 58: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 1: analysis

latency limit : latency of FMA

I don’t know why it’s slightly smaller than 4, but note thatthe true dependence occurs only for additions

throughput limit : not important

→≈ 2 flops / 4 cycles = 0.5 flops/cycle

44 / 54

Page 59: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 2: Vectorization

j

M

N

K

i

K

�1 M = 12, N = 32, K = 192

2 A : 12 x 192 (ld=192) 9216 bytes

3 B : 192 x 32 (ld=32) 24576 bytes

4 C : 12 x 32 (ld=32) 1536 bytes

5 total = 35328 bytes

6 repeat : 100000 times

7 ...

8 3.555 CPU clocks/iter

9 2.681 REF clocks/iter

10 9.002 flops/CPU clock

11 11.936 flops/REF clock

12 30.960 GFLOPS

�1 for (i = 0; i < M; i++)

2 for (j = 0; j < N; j += L)

3 for (k = 0; k < K; k++)

4 C(i,j:j+L) += A(i,k) * B(k,j:j+L);

assumption: N is a multiple of SIMD lanes (L)

45 / 54

Page 60: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 2: Vectorization

j

iM

NK

K

�1 M = 12, N = 32, K = 192

2 A : 12 x 192 (ld=192) 9216 bytes

3 B : 192 x 32 (ld=32) 24576 bytes

4 C : 12 x 32 (ld=32) 1536 bytes

5 total = 35328 bytes

6 repeat : 100000 times

7 ...

8 3.555 CPU clocks/iter

9 2.681 REF clocks/iter

10 9.002 flops/CPU clock

11 11.936 flops/REF clock

12 30.960 GFLOPS

�1 for (i = 0; i < M; i++)

2 for (j = 0; j < N; j += L)

3 for (k = 0; k < K; k++)

4 C(i,j:j+L) += A(i,k) * B(k,j:j+L);

assumption: N is a multiple of SIMD lanes (L)

45 / 54

Page 61: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 2: analysis

almost the same as step 1, except that each iteration nowdoes 16 FMAs (as opposed to an FMA)

→≈ 32 flops / 4 cycles = 8 flops/cycle

46 / 54

Page 62: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 3: increase parallelism!

j

iM

NK

K

�1 login000:07mm$ ./mmc02

2 M = 8, N = 32, K = 192

3 A : 8 x 192 (ld=192) 6144 bytes

4 B : 192 x 32 (ld=32) 24576 bytes

5 C : 8 x 32 (ld=32) 1024 bytes

6 ...

7 5.451 CPU clocks/iter

8 4.387 REF clocks/iter

9 46.966 flops/CPU clock

10 58.349 flops/REF clock

11 151.341 GFLOPS

update bM vector elements of C concurrently�1 for (i = 0; i < M; i += bM)

2 for (j = 0; j < N; j += L)

3 for (k = 0; k < K; k++)

4 for (di = 0; di < bM; di++)

5 C(i+di,j:j+L) += A(i+di,k) * B(k,j:j+L);

Skylake requires bM ≥ 8 to reach peak FLOPS47 / 54

Page 63: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 3: increase parallelism!

iM

NK

j

K

�1 login000:07mm$ ./mmc02

2 M = 8, N = 32, K = 192

3 A : 8 x 192 (ld=192) 6144 bytes

4 B : 192 x 32 (ld=32) 24576 bytes

5 C : 8 x 32 (ld=32) 1024 bytes

6 ...

7 5.451 CPU clocks/iter

8 4.387 REF clocks/iter

9 46.966 flops/CPU clock

10 58.349 flops/REF clock

11 151.341 GFLOPS

update bM vector elements of C concurrently�1 for (i = 0; i < M; i += bM)

2 for (j = 0; j < N; j += L)

3 for (k = 0; k < K; k++)

4 for (di = 0; di < bM; di++)

5 C(i+di,j:j+L) += A(i+di,k) * B(k,j:j+L);

Skylake requires bM ≥ 8 to reach peak FLOPS47 / 54

Page 64: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 3: analysis

iM

NK

j

K

�1 for (i = 0; i < M; i += bM)

2 for (j = 0; j < N; j += L)

3 for (k = 0; k < K; k++)

4 for (di = 0; di < bM; di++)

5 C(i+di,j:j+L) += A(i+di,k) * B(k,j:j+L);

the for loop at line 4 performsbM loads (broadcasts) for A(i+di,k)1 load for B(k,j:j+L)bM FMAs

remember the load throughput = 2 loads/cycleto achieve 2 FMAs/cycle, we must have

the number of loads ≤ the number of FMAs

we need to remove an extra load instruction48 / 54

Page 65: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 4: Reuse an element of A

M

N

j

K

i

K

�1 $ ./mmc03

2 M = 8, N = 32, K = 192

3 A : 8 x 192 (ld=192) 6144 bytes

4 B : 192 x 32 (ld=32) 24576 bytes

5 C : 8 x 32 (ld=32) 1024 bytes

6 ...

7 4.926 CPU clocks/iter

8 4.938 REF clocks/iter

9 51.969 flops/CPU clock

10 51.846 flops/REF clock

11 134.474 GFLOPS

12

update bM’ × bN block rather than bM × 1�1 for (i = 0; i < M; i += bM’)

2 for (j = 0; j < N; j += bN * L)

3 for (k = 0; k < K; k++)

4 for (di = 0; di < bM’; di++)

5 for (dj = 0; dj < bN * L; dj += L)

6 C(i+di,j+dj:j+dj+L) += A(i+di,k) * B(k,j+dj:j+L);

49 / 54

Page 66: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Step 4: Analysis

the for loop at line 4 performs

bM’ loads (broadcast) for A(i+di,k)bN loads for B(k,j:j+L)bM’ × bN FMAs

the minimum requirement for it to achieve the peak FLOPSis bM’ × bN ≥ 8

in the experiments, when we set bM’ = 6 and bN = 2, it gets59 flops/cycle (92% of the peak)

we need to note that this happens only when the matrix issmall (M = 12, N = 32, K = 160) and we repeat it manytimes

the issue for large matrices will be the next topic

50 / 54

Page 67: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Simultaneous Multithreading (SMT)

each physical core has seveal hardware threads or virtual cores

recent Xeon processors (including Skylake-X) : 2Knights Landing/Mill : 4IBM Power : 8

a.k.a. Hyperthreading R⃝virtual cores on a single physical core

are concurrently executed by the hardwarehave their own registers (switching between them havealmost no overhead)share most execution resources (dispatch ports, floating pointunits, L1/L2 caches, etc.)

memorycontroller L3 cache

hardware thread(virtual core, CPU)

(physical) core

L2 cacheL1 cache

chip (socket, node, CPU)

51 / 54

Page 68: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Performance implications of virtual cores

having as many threads as virtual cores on a physical core

helps improve throughput of latency-bound applications, butdoes not help throughput-bound applications

note: if you have more threads than virtual cores, operatingsystems get involved to switch between them (in a muchcoarser granularity, say 1 ms (10−3 sec) rather than everycycle ∼ 1 ns (10−9 sec))

it never helps mitigate the latency of arithmetic operationsit helps mitigate the latency of much bigger I/O latencies(say when accessing HDD)

52 / 54

Page 69: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Takeaways (1)

peak FLOPS of recent Intel CPUs = “execute two fmaddsevery cycle” (no other combinations)

other processors have different limits, but the basic is thesame

no, single-core performance is not about reducing the numberof instructions

it’s about how to increase parallelism

SIMDILP

53 / 54

Page 70: How to get peak FLOPS (CPU) What I wish I knew when I ...Contents 1 Introduction 2 An endeavor to nearly peak FLOPS 3 Latency 4 Instruction Level Parallelism (ILP) 5 Analyzing throughput

Takeaways (2)

dependent instructions incur latencies and hinder parallelism

independent instructions are executed in parallel, up tothroughput limits

throughput limits are determined by dispatch ports

54 / 54