how to get peak flops (cpu) what i wish i knew when i ...contents 1 introduction 2 an endeavor to...
TRANSCRIPT
How to get peak FLOPS (CPU)
— What I wish I knew when I was twenty
about CPU —
Kenjiro Taura
1 / 54
Contents
1 Introduction
2 An endeavor to nearly peak FLOPS
3 Latency
4 Instruction Level Parallelism (ILP)
5 Analyzing throughput
6 A simple yet fairly fast single-core matrix multiply
2 / 54
Contents
1 Introduction
2 An endeavor to nearly peak FLOPS
3 Latency
4 Instruction Level Parallelism (ILP)
5 Analyzing throughput
6 A simple yet fairly fast single-core matrix multiply
3 / 54
What you need to know to get a nearly peak
FLOPS
so you now know how to use multicores and SIMDinstructions
they are two key elements to get a nearly peak FLOPS
the last key element: Instruction Level Parallelism (ILP) ofsuperscalar processors
4 / 54
Contents
1 Introduction
2 An endeavor to nearly peak FLOPS
3 Latency
4 Instruction Level Parallelism (ILP)
5 Analyzing throughput
6 A simple yet fairly fast single-core matrix multiply
5 / 54
An endeavor to nearly peak FLOPS
let’s run the simplest code you can think of�1 #if __AVX512F__
2 const int vwidth = 64;
3 #elif __AVX__
4 const int vwidth = 32;
5 #else
6 #error "you’d better have a better machine"
7 #endif
8
9 const int valign = sizeof(float);
10 typedef float floatv attribute ((vector size(vwidth),aligned(valign)));
11 /∗ SIMD lanes ∗/12 const int L = sizeof(floatv) / sizeof(float);�1 floatv a, x, c;
2 for (i = 0; i < n; i++) {
3 x = a * x + c;
4 }
the code performs L × n FMAs and almost nothing else
6 / 54
Notes on experiments
the source code for the following experiments is in 06axpy
directory
the computation is trivial but the measurement part isLinux-specific
perf event open to get CPU clocks (not reference clocks)clock gettime to get time in nano second resolution
it will compile on MacOS too, but the results are inaccurate
reference clocks substitute for CPU clocksgettimeofday (micro second granularity) substitutes forclock gettime
7 / 54
Notes on experiments
on Linux, you need to allow user processes to getperformance events by�
1 $ sudo sysctl -w kernel.perf_event_paranoid=-1
exact results depend on the CPU microarchitecture and ISA�1 $ cat /proc/cpuinfo
and google the model name (e.g., “Xeon Gold 6126”)the following experiments show results on an Skylake X CPU
Skylake X is a variant of Skylake supporting AVX-512login node or the big partition of IST cluster
there is a Skylake, which has the same microarchitecture butdoes not support AVX-512 (marketing conspiracy?)your newest laptop will be Kaby Lake, even newer thanSkylake but does not have AVX-512. Its limit is:
2× 256-bit FMAs/cycle = 32 flops/cycle8 / 54
Let’s run it!
compile�1 $ gcc -o axpy -march=native -O3 axpy.c
and run!�1 $ ./axpy simd
2 algo = simd
3 m = 8
4 n = 1000000000
5 flops = 32000000000
6 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns
7 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter
8 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS
took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock
≈ 1/8 of the single core peak (64 flops/clock)
9 / 54
Let’s run it!
compile�1 $ gcc -o axpy -march=native -O3 axpy.c
and run!�1 $ ./axpy simd
2 algo = simd
3 m = 8
4 n = 1000000000
5 flops = 32000000000
6 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns
7 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter
8 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS
took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock
≈ 1/8 of the single core peak (64 flops/clock)
9 / 54
Let’s run it!
compile�1 $ gcc -o axpy -march=native -O3 axpy.c
and run!�1 $ ./axpy simd
2 algo = simd
3 m = 8
4 n = 1000000000
5 flops = 32000000000
6 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns
7 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter
8 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS
took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock
≈ 1/8 of the single core peak (64 flops/clock)
9 / 54
Let’s run it!
compile�1 $ gcc -o axpy -march=native -O3 axpy.c
and run!�1 $ ./axpy simd
2 algo = simd
3 m = 8
4 n = 1000000000
5 flops = 32000000000
6 4000530984 CPU clocks, 2967749142 REF clocks, 1144168403 ns
7 4.000531 CPU clocks/iter, 2.967749 REF clocks/iter, 1.144168 ns/iter
8 7.998938 flops/CPU clock, 10.782583 flops/REF clock, 27.967911 GFLOPS
took ≈ 4 CPU clocks/iteration ≈ 8 flops/clock
≈ 1/8 of the single core peak (64 flops/clock)
9 / 54
How to investigate
put a landmark in the assembly code�1 asm volatile ("# axpy simd: ax+c loop begin");
2 for (i = 0; i < n; i++) {
3 x = a * x + c;
4 }
5 asm volatile ("# axpy simd: ax+c loop end");
compile into assembly�1 $ gcc -S -march=native -O3 axpy.c
and see axpy.s in your editor
10 / 54
How to investigate
put a landmark in the assembly code�1 asm volatile ("# axpy simd: ax+c loop begin");
2 for (i = 0; i < n; i++) {
3 x = a * x + c;
4 }
5 asm volatile ("# axpy simd: ax+c loop end");
compile into assembly�1 $ gcc -S -march=native -O3 axpy.c
and see axpy.s in your editor
10 / 54
Assembly
�1 # axpy_simd: ax+c loop begin
2 # 0 "" 2
3 #NO_APP
4 testq %rdi, %rdi
5 jle .L659
6 xorl %edx, %edx
7 .p2align 4,,10
8 .p2align 3
9 .L660:
10 addq $1,%rdx
11 vfmadd132ps %zmm0,%zmm1,%zmm2
12 cmpq %rdx,%rdi
13 jne .L660
14 .L659:
15 #APP
16 # 63 "axpy.cc" 1
17 # axpy_simd: ax+c loop end
why?
11 / 54
Suspect looping overhead?
if you suspect the overhead of other instructions, here is anunrolled version that has much fewer overhead instructions
its performance is identical�1 #pragma GCC optimize("unroll-loops", 8)
2 long axpy_simd(long n, floatv a,
3 floatv* X, floatv c) {
4 ...
5 for (i = 0; i < n; i++) {
6 x = a * x + c;
7 }
8 }
⇒
�1 .L1662:
2 addq $8, %rdx
3 vfmadd132ps %zmm0,%zmm1,%zmm2
4 vfmadd132ps %zmm0,%zmm1,%zmm2
5 cmpq %rdx,%rdi
6 vfmadd132ps %zmm0,%zmm1,%zmm2
7 vfmadd132ps %zmm0,%zmm1,%zmm2
8 vfmadd132ps %zmm0,%zmm1,%zmm2
9 vfmadd132ps %zmm0,%zmm1,%zmm2
10 vfmadd132ps %zmm0,%zmm1,%zmm2
11 vfmadd132ps %zmm0,%zmm1,%zmm2
12 jne .L1662
12 / 54
Contents
1 Introduction
2 An endeavor to nearly peak FLOPS
3 Latency
4 Instruction Level Parallelism (ILP)
5 Analyzing throughput
6 A simple yet fairly fast single-core matrix multiply
13 / 54
Latency and throughput
a (Skylake-X) core can execute two vfmaddps instructionsevery cycle
yet, it does not mean the result of vfmaddps at line 3 belowis available in the next cycle for vfmaddps at the next line�
1 .L1662:
2 addq $8, %rdx
3 vfmadd132ps %zmm0,%zmm1,%zmm2
4 vfmadd132ps %zmm0,%zmm1,%zmm2
5 cmpq %rdx,%rdi
6 vfmadd132ps %zmm0,%zmm1,%zmm2
7 ...
8 vfmadd132ps %zmm0,%zmm1,%zmm2
9 jne .L1662
what you need to know:“two vfmadd132ps instructions every cycle” refers to thethroughputeach instruction has a specific latency (> 1 cycle)
14 / 54
Latency and throughput
a (Skylake-X) core can execute two vfmaddps instructionsevery cycle
yet, it does not mean the result of vfmaddps at line 3 belowis available in the next cycle for vfmaddps at the next line�
1 .L1662:
2 addq $8, %rdx
3 vfmadd132ps %zmm0,%zmm1,%zmm2
4 vfmadd132ps %zmm0,%zmm1,%zmm2
5 cmpq %rdx,%rdi
6 vfmadd132ps %zmm0,%zmm1,%zmm2
7 ...
8 vfmadd132ps %zmm0,%zmm1,%zmm2
9 jne .L1662
what you need to know:“two vfmadd132ps instructions every cycle” refers to thethroughputeach instruction has a specific latency (> 1 cycle)
14 / 54
Latencies
instruction Haswell Broadwell Skylake
fp add 3 3 4fp mul 5 3 4fp fmadd 5 5 4typical integer ops 1 1 1. . . . . . . . . . . .
http://www.agner.org/optimize/ is an invaluable resource
put the following two docs under your pillow
3. The microarchitecture of Intel, AMD and VIA CPUs: Anoptimization guide for assembly programmers and compilermakers4. Instruction tables: Lists of instruction latencies,throughputs and micro-operation breakdowns for Intel, AMDand VIA CPUs
15 / 54
Our code in light of latencies
in our code, a vfmadd uses the result of the immediatelypreceding vfmadd
that was obvious from the source code too�1 .L1662:
2 addq $8, %rdx
3 vfmadd132ps %zmm0,%zmm1,%zmm2
4 vfmadd132ps %zmm0,%zmm1,%zmm2
5 cmpq %rdx,%rdi
6 vfmadd132ps %zmm0,%zmm1,%zmm2
7 ...
8 vfmadd132ps %zmm0,%zmm1,%zmm2
9 jne .L1662
�1 for (i = 0; i < n; i++) {
2 x = a * x + c;
3 }
Conclusion:
the loop can’t run faster than 4 cycles/iteration
vfmaddps
zmm2
vfmaddps vfmaddps vfmaddps
zmm2 zmm2 zmm2 zmm2
16 / 54
CPU clocks vs. reference clocks
CPU changes clock frequency depending on the load (DVFS)
reference clock runs at the same frequency (it is alwaysproportional to the absolute time)
an instruction takes a specified number of CPU clocks, notreference clocks
the CPU clock is more predictable and thus more convenientfor a precise reasoning of the code
reference clock
absolute time
CPU clock
vfmaddps vfmaddps vfmaddps vfmaddps
17 / 54
How to overcome latencies?
increase parallelism (no other ways)!
you can’t make a serial chain of computation run faster(change the algorithm if you want to)
vfmaddps
zmm2
vfmaddps vfmaddps vfmaddps
zmm2 zmm2 zmm2 zmm2
you can only increase throughput, by running multipleindependent chains
we expect the following to finish in the same number of cyclesas the original one, despite it performs twice as many flops�
1 for (i = 0; i < n; i++) {
2 x0 = a * x0 + c;
3 x1 = a * x1 + c;
4 }
18 / 54
How to overcome latencies?
increase parallelism (no other ways)!
you can’t make a serial chain of computation run faster(change the algorithm if you want to)
vfmaddps
zmm2
vfmaddps vfmaddps vfmaddps
zmm2 zmm2 zmm2 zmm2
you can only increase throughput, by running multipleindependent chains
we expect the following to finish in the same number of cyclesas the original one, despite it performs twice as many flops�
1 for (i = 0; i < n; i++) {
2 x0 = a * x0 + c;
3 x1 = a * x1 + c;
4 }
18 / 54
How to overcome latencies?
increase parallelism (no other ways)!
you can’t make a serial chain of computation run faster(change the algorithm if you want to)
vfmaddps
zmm2
vfmaddps vfmaddps vfmaddps
zmm2 zmm2 zmm2 zmm2
you can only increase throughput, by running multipleindependent chains
we expect the following to finish in the same number of cyclesas the original one, despite it performs twice as many flops�
1 for (i = 0; i < n; i++) {
2 x0 = a * x0 + c;
3 x1 = a * x1 + c;
4 }
18 / 54
How to overcome latencies?
increase parallelism (no other ways)!
you can’t make a serial chain of computation run faster(change the algorithm if you want to)
vfmaddps
zmm2
vfmaddps vfmaddps vfmaddps
zmm2 zmm2 zmm2 zmm2
you can only increase throughput, by running multipleindependent chains
we expect the following to finish in the same number of cyclesas the original one, despite it performs twice as many flops�
1 for (i = 0; i < n; i++) {
2 x0 = a * x0 + c;
3 x1 = a * x1 + c;
4 }
18 / 54
Increase the number of chains further . . .
we expect to reach peak FLOPS with ≥ 2/(1/4) = 8 chains�1 template<int nv>
2 long axpy_simd_c( ... ) {
3 for (long i = 0; i < n; i++) {
4 for (long j = 0; j < nv; j++) {
5 X[j] = a * X[j] + c;
6 } } }
19 / 54
Results
012345678
0 2 4 6 8 10 12 14 160816243240485664
CPU
cycles/iter
flops/CPU
cycle
variables
CPU cycles/iterflops/CPU cycle
chains clocks/iter flops/clock1 4.0 7.9992 4.001 15.9983 4.001 23.9964 4.001 31.9955 4.049 39.5176 4.001 47.9947 4.113 54.4588 4.001 63.9919 4.567 63.05910 5.001 63.98211 5.501 63.99112 6.001 63.98813 6.502 63.98114 7.001 63.98915 7.501 63.99
20 / 54
Contents
1 Introduction
2 An endeavor to nearly peak FLOPS
3 Latency
4 Instruction Level Parallelism (ILP)
5 Analyzing throughput
6 A simple yet fairly fast single-core matrix multiply
21 / 54
Superscalar processors
What you need to know:
instructions are decoded in the program order,
but the execution orders are not constrained by it (out oforder execution)
⇒ as a crude approximation, performance is constrained onlyby
latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)
22 / 54
Superscalar processors
What you need to know:
instructions are decoded in the program order,
but the execution orders are not constrained by it (out oforder execution)
⇒ as a crude approximation, performance is constrained onlyby
latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)
22 / 54
Superscalar processors
What you need to know:
instructions are decoded in the program order,
but the execution orders are not constrained by it (out oforder execution)
⇒ as a crude approximation, performance is constrained onlyby
latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)
22 / 54
Superscalar processors
What you need to know:
instructions are decoded in the program order,
but the execution orders are not constrained by it (out oforder execution)
⇒ as a crude approximation, performance is constrained onlyby
latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)
22 / 54
Superscalar processors
What you need to know:
instructions are decoded in the program order,
but the execution orders are not constrained by it (out oforder execution)
⇒ as a crude approximation, performance is constrained onlyby
latency: length of dependent chain of instructions
throughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)
22 / 54
Superscalar processors
What you need to know:
instructions are decoded in the program order,
but the execution orders are not constrained by it (out oforder execution)
⇒ as a crude approximation, performance is constrained onlyby
latency: length of dependent chain of instructionsthroughput: the number of particular types of instructions itcan execute per cycle (e.g., two fmadds/cycle)
22 / 54
A general theory of workload performance on
aggressive superscalar machines
dependency constrains how fast a computation can proceed,even if there are infinite number of execution resources
increase the number of independent computations and youincrease throughput, until it hits the limit of executionresources
012345678
0 2 4 6 8 10 12 14 160816243240485664
CPU
cycles/iter
flops/CPU
cycle
variables
CPU cycles/iterflops/CPU cycle
23 / 54
Contents
1 Introduction
2 An endeavor to nearly peak FLOPS
3 Latency
4 Instruction Level Parallelism (ILP)
5 Analyzing throughput
6 A simple yet fairly fast single-core matrix multiply
24 / 54
What if the number of chains is a variable?
purpose: illustrate the same concept with a slightly morecomplex/common case
let’s try the following code, identical to the one we sucessfullyachieved nearly peak performance for, except that the numberof variables (nv) is now a variable (not a compile-timeconstant)�
1 void axpy_simd_m(..., long m) {
2 for (long i = 0; i < n; i++) {
3 for (long j = 0; j < m; j++) {
4 X[j] = a * X[j] + c;
5 } } }
25 / 54
When we experiment . . .
chains clocks/iter flops/clock1 11.002 2.9092 11.037 5.7993 11.028 8.7054 11.131 11.4995 12.021 13.316 14.018 13.6967 16.013 13.9898 18.006 14.2189 20.57 14.00110 22.017 14.53511 24.008 14.66212 26.024 14.75613 28.011 14.85114 30.022 14.92315 32.653 14.7
0
5
10
15
20
25
0 2 4 6 8 10 12 14 160
8
16
24
CPU
cycles/iter
flops/CPU
cycle
variables
CPU cycles/iterflops/CPU cycle
the pattern is similar, but there are two differences:the latency of a single update became ≈ 11 cyclesthe throughput hits the plateau at ≈ 2 cycles/iter
26 / 54
Take a look at assembly
it looks like:�1 .L1811:
2 vmovaps %zmm0, %zmm2
3 addq $64, %rcx
4 vfmadd132ps -64(%rcx), %zmm1, %zmm2
5 vmovups %zmm2, -64(%rcx)
6 cmpq %rcx, %r8
7 jne .L1811
what’s the difference from the code we have seen before(whose latency = 4 cycles)?�
1 .L1800:
2 addq $1, %rdx
3 vfmadd132ps %zmm0, %zmm1, %zmm2
4 cmpq %rdx, %rdi
5 jne .L1800
27 / 54
The reason of the latency (11 cycles)
both stem from the fact that the code now involvesload/stores
what you need to know:
just like FMA, each instruction has its own latency
28 / 54
Latencies of various instructions
instruction Haswell Broadwell Skylake
fp add 3 3 4fp mul 5 3 4fp fmadd 5 5 4typical integer ops 1 1 1load 3 3 3store(*) 4 4 3. . . . . . . . . . . .
(*) 256 bit (not sure for 512 bit store)
3 + 4 + 3 = 10 ̸= 11; I could not get information thatconfirms the extra cycle, but a simpler experiment shows thesame result 1may be 512 bit store takes 4 cycles?)
1(
29 / 54
The reason of the throughput
what you need to know:
Just like FMA, all instructions have their own through-put limits, due to execution resources (dispatch ports andexecution units)
“two vfmadds per cycle” is just an example of it
30 / 54
Some representative throughput limits
Throughput = the number of that instruction that can beexecuted by cycle
instruction Haswell Broadwell Skylake
fp add/mul/fmadd 2 2 2load 2 2 2store 1 1 1typical integer ops 4 4 4. . . . . . . . . . . .
31 / 54
A back of envelope calculation
instruction type 1/throughputvmovaps %zmm0,%zmm2 register move 0.33
addq $64,%rcx int op 0.25
vfmadd132ps -64(%rcx),%zmm1,%zmm2 load + FMA 0.5, 0.5vmovups %zmm2,-64(%rcx) store 1.0cmpq %rcx,%r8 compare 0.25jne .L1811 jump 1-2
I don’t know what 1-2 means
we can conclude that the throuhput ≤ 1 due to the store
more general cases require understanding dispatch ports
32 / 54
Dispatch ports
each instruction(µ-operation) is dispatchedto a specific port
Skylake X ports
fmadd → port 0 or 1load → port 2 or 3store → port 4load/store addressgeneration → port 7int → port 0, 1, 5, or 6etc.
source:
https://en.wikichip.org/wiki/intel/microarchitectures/skylake (server)
each port can take only a single operation per cyclethis determines the aggregate throughput of all instructionsthat go to that port
with destination ports of instructions, one can calculate thethroughput limit of a given loop
33 / 54
Intel Architecture Code Analyzer
a great tool to analyze the throughput (and latency to someextent) limit
https://software.intel.com/en-us/articles/
intel-architecture-code-analyzer
34 / 54
How to overcome the throughput limit?
the goal is two iterations/cycle (throughput limit of FMA)
the bottleneck is a store instruction (1/cycle)
we obviously need to quit loading/storing data for everysingle fmadd�
1 for (i = 0; i < n; i++) {
2 for (j = 0; j < nv; j++) {
3 x[j] = a * x[j] + c; // load; fmadd; store4 }
5 }
the minimum “unit” of a computation should look like:�1 load x[j] to a register;
2 do “a * x + c” several times on the register;3 store the result to x[j];
and run multiple independent units
35 / 54
Several ways to arrange computation
take a variable at a time and
run it until the end (suffer
from latency)
i
j
advance all variables one step
at a time (suffer from the
store throughput)
i
j
strategy 1: take a few
variables and run them until
the endi
j
smallandco
nsta
nt
load on register
strategy 2: advance all
variables, a few steps at a time
i
j
on registerload store
36 / 54
Implementing strategy 1
say we advance ten elements at a time�1 for (j = 0; j < nv; j += b) {
2 /* run b variables until the end */3 for (i = 0; i < n; i++) {
4 for (jj = j; jj < j + b; jj++) {
5 xx[jj] = a * xx[jj] + c;
6 }
7 }
8 }
we hope it loads/stores each variable only once through the iloop (line 2)!
this coding depends on the compiler’s smartness we havewitnessed
promote fixed-sized arrays into registers
37 / 54
Implementing strategy 2
advance all variables, say three, steps at a time�1 for (i = 0; i < n; i += 3) {
2 /* run all variables 3 steps */3 for (j = 0; j < m; j++) {
4 for (ii = 0; ii < 3; ii++) {
5 x[j] = a * x[j] + c;
6 }
7 }
8 }
again, we hope the compiler’s smartness to eliminateintermediate load/stores (purple parts)
the latency of a single j iteration increases, but we hope the jloop exposes lots of independent computations
38 / 54
Results of strategy I
chains clocks/iter flops/clock1 4.0 7.9992 4.0 15.9983 4.001 23.9964 4.001 31.9965 4.018 39.8186 4.001 47.9917 4.114 54.4478 4.001 63.999 4.583 62.83410 5.001 63.98611 5.501 63.99112 6.001 63.99313 6.501 63.98814 7.001 63.99115 7.501 63.987
012345678
0 2 4 6 8 10 12 14 160816243240485664
CPU
cycles/iter
flops/CPU
cycle
variables
CPU cycles/iterflops/CPU cycle
39 / 54
Contents
1 Introduction
2 An endeavor to nearly peak FLOPS
3 Latency
4 Instruction Level Parallelism (ILP)
5 Analyzing throughput
6 A simple yet fairly fast single-core matrix multiply
40 / 54
Developing near peak FLOPS matrix multiply
let’s develop a (single core) matrix multiply that runs atfairly good FLOPS on Skylake-X
it is a great application of the concept you have just learned
C+ = A ∗B
+= *M
N K
K
N
C A B
41 / 54
Developing near peak FLOPS matrix multiply
let’s develop a (single core) matrix multiply that runs atfairly good FLOPS on Skylake-X
it is a great application of the concept you have just learned
C+ = A ∗B
M
NK
K
CA
B
41 / 54
A few simplifying assumptions
we add assumptions that M , N , and K are multiple ofcertain numbers along the way, (forget how to process“remainder” rows/columns in this slide)
we assume matrix sizes are known at compile time and are“convenient” (e.g., they are small)
multiplication of larger (and unknown size) matrices can bebuilt on top of this
42 / 54
Step 1: Baseline code
j
M
N
K
i
K
�1 $ ./mmc00
2 M = 12, N = 32, K = 192
3 A : 12 x 192 (ld=192) 9216 bytes
4 B : 192 x 32 (ld=32) 24576 bytes
5 C : 12 x 32 (ld=32) 1536 bytes
6 total = 35328 bytes
7 ...
8 3.456 CPU clocks/iter
9 2.520 REF clocks/iter
10 0.579 flops/CPU clock
11 0.794 flops/REF clock
12 2.058 GFLOPS�1 for (i = 0; i < M; i++)
2 for (j = 0; j < N; j++)
3 for (k = 0; k < K; k++)
4 C(i,j) += A(i,k) * B(k,j);
it runs at ≈ 3.5 clocks / innermost looplatency of fmadd on C(i,j) − fraction
43 / 54
Step 1: analysis
latency limit : latency of FMA
I don’t know why it’s slightly smaller than 4, but note thatthe true dependence occurs only for additions
throughput limit : not important
→≈ 2 flops / 4 cycles = 0.5 flops/cycle
44 / 54
Step 2: Vectorization
j
M
N
K
i
K
�1 M = 12, N = 32, K = 192
2 A : 12 x 192 (ld=192) 9216 bytes
3 B : 192 x 32 (ld=32) 24576 bytes
4 C : 12 x 32 (ld=32) 1536 bytes
5 total = 35328 bytes
6 repeat : 100000 times
7 ...
8 3.555 CPU clocks/iter
9 2.681 REF clocks/iter
10 9.002 flops/CPU clock
11 11.936 flops/REF clock
12 30.960 GFLOPS
�1 for (i = 0; i < M; i++)
2 for (j = 0; j < N; j += L)
3 for (k = 0; k < K; k++)
4 C(i,j:j+L) += A(i,k) * B(k,j:j+L);
assumption: N is a multiple of SIMD lanes (L)
45 / 54
Step 2: Vectorization
j
iM
NK
K
�1 M = 12, N = 32, K = 192
2 A : 12 x 192 (ld=192) 9216 bytes
3 B : 192 x 32 (ld=32) 24576 bytes
4 C : 12 x 32 (ld=32) 1536 bytes
5 total = 35328 bytes
6 repeat : 100000 times
7 ...
8 3.555 CPU clocks/iter
9 2.681 REF clocks/iter
10 9.002 flops/CPU clock
11 11.936 flops/REF clock
12 30.960 GFLOPS
�1 for (i = 0; i < M; i++)
2 for (j = 0; j < N; j += L)
3 for (k = 0; k < K; k++)
4 C(i,j:j+L) += A(i,k) * B(k,j:j+L);
assumption: N is a multiple of SIMD lanes (L)
45 / 54
Step 2: analysis
almost the same as step 1, except that each iteration nowdoes 16 FMAs (as opposed to an FMA)
→≈ 32 flops / 4 cycles = 8 flops/cycle
46 / 54
Step 3: increase parallelism!
j
iM
NK
K
�1 login000:07mm$ ./mmc02
2 M = 8, N = 32, K = 192
3 A : 8 x 192 (ld=192) 6144 bytes
4 B : 192 x 32 (ld=32) 24576 bytes
5 C : 8 x 32 (ld=32) 1024 bytes
6 ...
7 5.451 CPU clocks/iter
8 4.387 REF clocks/iter
9 46.966 flops/CPU clock
10 58.349 flops/REF clock
11 151.341 GFLOPS
update bM vector elements of C concurrently�1 for (i = 0; i < M; i += bM)
2 for (j = 0; j < N; j += L)
3 for (k = 0; k < K; k++)
4 for (di = 0; di < bM; di++)
5 C(i+di,j:j+L) += A(i+di,k) * B(k,j:j+L);
Skylake requires bM ≥ 8 to reach peak FLOPS47 / 54
Step 3: increase parallelism!
iM
NK
j
K
�1 login000:07mm$ ./mmc02
2 M = 8, N = 32, K = 192
3 A : 8 x 192 (ld=192) 6144 bytes
4 B : 192 x 32 (ld=32) 24576 bytes
5 C : 8 x 32 (ld=32) 1024 bytes
6 ...
7 5.451 CPU clocks/iter
8 4.387 REF clocks/iter
9 46.966 flops/CPU clock
10 58.349 flops/REF clock
11 151.341 GFLOPS
update bM vector elements of C concurrently�1 for (i = 0; i < M; i += bM)
2 for (j = 0; j < N; j += L)
3 for (k = 0; k < K; k++)
4 for (di = 0; di < bM; di++)
5 C(i+di,j:j+L) += A(i+di,k) * B(k,j:j+L);
Skylake requires bM ≥ 8 to reach peak FLOPS47 / 54
Step 3: analysis
iM
NK
j
K
�1 for (i = 0; i < M; i += bM)
2 for (j = 0; j < N; j += L)
3 for (k = 0; k < K; k++)
4 for (di = 0; di < bM; di++)
5 C(i+di,j:j+L) += A(i+di,k) * B(k,j:j+L);
the for loop at line 4 performsbM loads (broadcasts) for A(i+di,k)1 load for B(k,j:j+L)bM FMAs
remember the load throughput = 2 loads/cycleto achieve 2 FMAs/cycle, we must have
the number of loads ≤ the number of FMAs
we need to remove an extra load instruction48 / 54
Step 4: Reuse an element of A
M
N
j
K
i
K
�1 $ ./mmc03
2 M = 8, N = 32, K = 192
3 A : 8 x 192 (ld=192) 6144 bytes
4 B : 192 x 32 (ld=32) 24576 bytes
5 C : 8 x 32 (ld=32) 1024 bytes
6 ...
7 4.926 CPU clocks/iter
8 4.938 REF clocks/iter
9 51.969 flops/CPU clock
10 51.846 flops/REF clock
11 134.474 GFLOPS
12
update bM’ × bN block rather than bM × 1�1 for (i = 0; i < M; i += bM’)
2 for (j = 0; j < N; j += bN * L)
3 for (k = 0; k < K; k++)
4 for (di = 0; di < bM’; di++)
5 for (dj = 0; dj < bN * L; dj += L)
6 C(i+di,j+dj:j+dj+L) += A(i+di,k) * B(k,j+dj:j+L);
49 / 54
Step 4: Analysis
the for loop at line 4 performs
bM’ loads (broadcast) for A(i+di,k)bN loads for B(k,j:j+L)bM’ × bN FMAs
the minimum requirement for it to achieve the peak FLOPSis bM’ × bN ≥ 8
in the experiments, when we set bM’ = 6 and bN = 2, it gets59 flops/cycle (92% of the peak)
we need to note that this happens only when the matrix issmall (M = 12, N = 32, K = 160) and we repeat it manytimes
the issue for large matrices will be the next topic
50 / 54
Simultaneous Multithreading (SMT)
each physical core has seveal hardware threads or virtual cores
recent Xeon processors (including Skylake-X) : 2Knights Landing/Mill : 4IBM Power : 8
a.k.a. Hyperthreading R⃝virtual cores on a single physical core
are concurrently executed by the hardwarehave their own registers (switching between them havealmost no overhead)share most execution resources (dispatch ports, floating pointunits, L1/L2 caches, etc.)
memorycontroller L3 cache
hardware thread(virtual core, CPU)
(physical) core
L2 cacheL1 cache
chip (socket, node, CPU)
51 / 54
Performance implications of virtual cores
having as many threads as virtual cores on a physical core
helps improve throughput of latency-bound applications, butdoes not help throughput-bound applications
note: if you have more threads than virtual cores, operatingsystems get involved to switch between them (in a muchcoarser granularity, say 1 ms (10−3 sec) rather than everycycle ∼ 1 ns (10−9 sec))
it never helps mitigate the latency of arithmetic operationsit helps mitigate the latency of much bigger I/O latencies(say when accessing HDD)
52 / 54
Takeaways (1)
peak FLOPS of recent Intel CPUs = “execute two fmaddsevery cycle” (no other combinations)
other processors have different limits, but the basic is thesame
no, single-core performance is not about reducing the numberof instructions
it’s about how to increase parallelism
SIMDILP
53 / 54
Takeaways (2)
dependent instructions incur latencies and hinder parallelism
independent instructions are executed in parallel, up tothroughput limits
throughput limits are determined by dispatch ports
54 / 54