LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
1
The Roofline Model
Samuel Williams
Lawrence Berkeley National Laboratory
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
2
Outline
Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
3
Challenges / Goals
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
4
Four Architectures
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllersH
yper
Tra
nspo
rtOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erT
rans
port
4G
B/s
(eac
h di
rect
ion
)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
inte
rcon
nect
86.4 GB/s
768MB 900MHz GDDR3 Device DRAM
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
ThreadCluster
192KB L2 (Textures only)
24 ROPs
6 x 64b memory controllers
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
VMTPPE
512KL2
<20
GB
/s(e
ach
dire
ctio
n)
Sun Victoria FallsAMD Barcelona
NVIDIA G80IBM Cell Blade
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
5
Challenges / Goals
We have extremely varied architectures. Moreover, the characteristics of numerical methods can vary
dramatically.
The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next.
We wish to understand whether or not we’ve attained good performance (high fraction of a theoretical peak)
We wish to identify performance bottlenecks and enumerate potential remediation strategies.
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
6
Fundamentals
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
7
Little’s Law
Little’s Law:
Concurrency = Latency * Bandwidth
- or -
Effective Throughput = Expressed Concurrency / Latency
Bandwidth conventional memory bandwidth #floating-point units
Latency memory latency functional unit latency
Concurrency: bytes expressed to the memory subsystem concurrent (parallel) memory operations
For example, consider a CPU with 2 FPU’s each with a 4-cycle latency. Little’s law states that we must express 8-way ILP to fully utilize the machine.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
8
Little’s Law Examples
consider a CPU with 2 FPU’s each with a 4-cycle latency.
Little’s law states that we must express 8-way ILP to fully utilize the machine.
Solution: unroll/jam the code to express 8 independent FP operations.
Note, simply unrolling dependent operations (e.g. reduction) does not increase ILP. It simply amortizes loop overhead.
Applied to FPUsApplied to Memory consider a CPU with 20GB/s of
bandwidth and 100ns memory latency.
Little’s law states that we must express 2KB of concurrency (independent memory operations) to the memory subsystem to attain peak performance
On today’s superscalar processors, hardware stream prefetchers speculatively load consecutive elements.
Solution: express the memory access pattern in a streaming fashion in order to engage the prefetchers.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
9
Three Classes of Locality
Temporal Locality reusing data (either registers or cache lines) multiple times amortizes the impact of limited bandwidth. transform loops or algorithms to maximize reuse.
Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in 64-128Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout
Sequential Locality Many memory address patterns access cache lines sequentially. CPU’s hardware stream prefetchers exploit this observation to hide
speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
10
Arithmetic Intensity
True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes
Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality)
Others have constant intensity
Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.
A r i t h m e t i c I n t e n s i t y
O( N )O( log(N) )
O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTsDense Linear Algebra
(BLAS3)Particle Methods
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
NUMA
11
Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among
between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
NUMA
12
Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among
between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
13
Roofline Model
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
14
Overlap of Communication
Consider a simple example in which a FP kernel maintains a working set in DRAM.
We assume we can perfectly overlap computation with communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation)
Thus, time, is the maximum of the time required to transfer the data and the time required to perform the floating point operations.
Byte’s / STREAM Bandwidth
Flop’s / Flop/s
time
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline ModelBasic Concept
15
Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.
where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after
being filtered by the cache), programmers can inspect the figure, and bound performance.
Moreover, provides insights as to which optimizations will potentially be beneficial.
AttainablePerformanceij
= minFLOP/s with Optimizations1-i
AI * Bandwidth with Optimizations1-j
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
16
Example
Consider the Opteron 2356: Dual Socket (NUMA) limited HW stream prefetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD
datapaths 4-cycle FP latency
Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline ModelBasic Concept
17
Plot on log-log scale Given AI, we can easily
bound performance But architectures are much
more complicated
We will bound performance as we eliminate specific forms of in-core parallelism
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modelcomputational ceilings
18
Opterons have dedicated multipliers and adders.
If the code is dominated by adds, then attainable performance is half of peak.
We call these Ceilings They act like constraints on
performance
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
mul / add imbalance
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modelcomputational ceilings
19
Opterons have 128-bit datapaths.
If instructions aren’t SIMDized, attainable performance will be halved
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
mul / add imbalance
w/out SIMD
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modelcomputational ceilings
20
On Opterons, floating-point instructions have a 4 cycle latency.
If we don’t express 4-way ILP, performance will drop by as much as 4x
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
w/out ILP
peak DP
Stream
Ban
dwidth
mul / add imbalance
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modelcommunication ceilings
21
We can perform a similar exercise taking away parallelism from the memory subsystem
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modelcommunication ceilings
22
Explicit software prefetch instructions are required to achieve peak bandwidth
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
w/out
SW
pre
fetch
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modelcommunication ceilings
23
Opterons are NUMA As such memory traffic
must be correctly balanced among the two sockets to achieve good Stream bandwidth.
We could continue this by examining strided or random memory access patterns
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Ban
dwidth
w/out
SW
pre
fetch
w/out
NUM
A
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modelcomputation + communication ceilings
24
We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
peak DP
mul / add imbalance
w/out ILP
Stream
Ban
dwidth
w/out
SW
pre
fetch
w/out
NUM
A
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modellocality walls
25
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
FLOPs
Compulsory MissesAI =
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
26
Cache Behavior
Knowledge of the underlying cache operation can be critical. For example, caches are organized into lines. Lines are organized
into sets & ways (associativity) Thus, we must mimic the effect of Mark Hill’s 3C’s of caches Impacts of conflict, compulsory, and capacity misses are both
architecture- and application-dependent. Ultimately they reduce the actual flop:byte ratio.
Moreover, many caches are write allocate. a write allocate cache read in an entire cache line upon a write miss. If the application ultimately overwrites that line, the read was
superfluous (further reduces flop:byte ratio) Because programs access data in words, but hardware transfers it
in 64 or 128B cache lines, spatial locality is key Array-of-structure data layouts can lead to dramatically lower
flop:byte ratios. e.g. if a program only operates on the “red” field of a pixel, bandwidth is
wasted.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modellocality walls
27
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
FLOPs
Allocations + Compulsory MissesAI =
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modellocality walls
28
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
+ca
pa
city miss tra
ffic
FLOPs
Capacity + Allocations + CompulsoryAI =
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modellocality walls
29
Remember, memory traffic includes more than just compulsory misses.
As such, actual arithmetic intensity may be substantially lower.
Walls are unique to the architecture-kernel combination
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
+ca
pa
city miss tra
ffic
+co
nflict m
iss traffic
FLOPs
Conflict + Capacity + Allocations + CompulsoryAI =
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline Modellocality walls
30
Optimizations remove these walls and ceilings which act to constrain performance.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
w/out SIMD
mul / add imbalance
w/out ILP
w/out
SW
pre
fetch
w/out
NUM
A
Opteron 2356(Barcelona)
peak DP
Stream
Ban
dwidth
on
ly com
pu
lsory m
iss traffic
+w
rite a
lloca
tion
traffic
+ca
pa
city miss tra
ffic
+co
nflict m
iss traffic
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
31
Instruction Issue Bandwidth
On a superscalar processor, there is likely ample instruction issue bandwidth.
This allows loads, integer, and FP instructions to be issued simultaneously.
As such, we assumed that expression of parallelism was the underlying challenge for in-core.
However, on some architectures, finite instruction-issue bandwidth can become a major impediment to performance.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
Roofline ModelInstruction Mix
32
As the instruction mix shifts away from floating-point, finite issue bandwidth begins to affect limits on in-core performance.
On Niagara2, with dual issues units but only 1 FPU, FP instructions must constitute ≥50% of the mix to attain peak performance.
A similar approach should be used on GPUs where proper use of CUDA solves the parallelism challenges.
actual FLOP:Byte ratio
atta
inab
le G
FLO
P/s
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
UltraSparc T2+ T5140(Niagara2)
Stream
Ban
dwidth
12% FP
25% FP
w/out
SW
pre
fetch
w/out
NUM
A
peak DP, ≥50% FP
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
33
Optimization Categorization
Maximizing (attained)In-core Performance
Minimizing (total)Memory Traffic
Maximizing (attained)Memory Bandwidth
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
34
Optimization Categorization
MinimizingMemory Traffic
MaximizingMemory Bandwidth
MaximizingIn-core Performance
• Exploit in-core parallelism (ILP, DLP, etc…)
• Good (enough) floating-point balance
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
35
Optimization Categorization
MinimizingMemory Traffic
MaximizingMemory Bandwidth
MaximizingIn-core Performance
• Exploit in-core parallelism (ILP, DLP, etc…)
• Good (enough) floating-point balance
???
?unroll &jam
explicitSIMD
reorder
eliminatebranches
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
???
?unroll &jam
explicitSIMD
reorder
eliminatebranches
36
Optimization Categorization
MaximizingIn-core Performance
MinimizingMemory Traffic
• Exploit in-core parallelism (ILP, DLP, etc…)
• Good (enough) floating-point balance
MaximizingMemory Bandwidth
• Exploit NUMA
• Hide memory latency
• Satisfy Little’s Law
?memoryaffinity ?SW
prefetch
?DMAlists
?unit-stridestreams
?TLBblocking
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
?memoryaffinity ?SW
prefetch
?DMAlists
?unit-stridestreams
?TLBblocking
???
?unroll &jam
explicitSIMD
reorder
eliminatebranches
37
Optimization Categorization
MaximizingIn-core Performance
MaximizingMemory Bandwidth
• Exploit in-core parallelism (ILP, DLP, etc…)
• Good (enough) floating-point balance
• Exploit NUMA
• Hide memory latency
• Satisfy Little’s Law
MinimizingMemory Traffic
Eliminate:• Capacity misses• Conflict misses• Compulsory misses• Write allocate behavior
? ???
cacheblockingarray
padding
compressdata
streamingstores
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
38
Optimization Categorization
MaximizingIn-core Performance
MinimizingMemory Traffic
MaximizingMemory Bandwidth
• Exploit in-core parallelism (ILP, DLP, etc…)
• Good (enough) floating-point balance
• Exploit NUMA
• Hide memory latency
• Satisfy Little’s Law
?memoryaffinity ?SW
prefetch
?DMAlists
?unit-stridestreams
?TLBblocking
Eliminate:• Capacity misses• Conflict misses• Compulsory misses• Write allocate behavior
? ???
cacheblockingarray
padding
compressdata
streamingstores?
??
?unroll &jam
explicitSIMD
reorder
eliminatebranches
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
39
Examples
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
40
Multicore SMPs Used
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Hyp
erT
rans
portOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erT
rans
port
4G
B/s
(eac
h di
rect
ion
)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
MT
SP
AR
C
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
pe
r h
ub p
er
dire
ctio
n)
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
SP
E2
56K
MF
C
VMTPPE
512KL2
<20
GB
/s(e
ach
dire
ctio
n)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
10.66 GB/s
Core
FSB
Core Core Core
10.66 GB/s
Core
FSB
Core Core Core
4MBshared L2
4MBshared L2
4MBshared L2
4MBshared L2
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
41
Heat Equation
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
7-point Stencil
42
PDE grid
+Y
+Z
+X
stencil for heat equation PDE
y+1
y-1
x-1
z-1
z+1
x+1x,y,z
Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil
for all x,y,z:u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*(
u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t))
Clearly each stencil performs: 8 floating-point operations 8 memory references
all but 2 should be
filtered by an ideal
cache 6 memory streams
all but 2 should be filtered
(less than # HW prefetchers) Ideally, AI=0.5. However, write-allocate bounds it to 0.33.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
43
Roofline model for Stencil(out-of-the-box code)
Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than
multiplies (imbalance) Ideal flop:byte ratio 1/3
High locality requirements Capacity and conflict
misses will severely impair flop:byte ratio
Clovertown
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
Cell BladeVictoria Falls
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
Barcelona
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aun
align
ed D
MA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
fits w
ithin
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve Cellimplementation
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
44
Roofline model for Stencil(out-of-the-box code)
Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than
multiplies (imbalance) Ideal flop:byte ratio 1/3
High locality requirements Capacity and conflict
misses will severely impair flop:byte ratio
Clovertown
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
Cell BladeVictoria Falls
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
Barcelona
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aun
align
ed D
MA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
fits w
ithin
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve Cellimplementation
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
45
Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)
Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3
Clovertown has huge caches but is pinned to lower BW ceiling
Cache management is essential when capacity/thread is low
Clovertown
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
Cell BladeVictoria Falls
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
Barcelona
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aun
align
ed D
MA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
fits w
ithin
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve Cellimplementation
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
46
Roofline model for Stencil(SIMDization + cache bypass)
Make SIMDization explicit Technically, this swaps ILP
and SIMD ceilings Use cache bypass
instruction: movntpd Increases flop:byte ratio to
~0.5 on x86/Cell
Clovertown
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
Cell BladeVictoria Falls
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
Barcelona
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out
NUM
Aun
align
ed D
MA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
fits w
ithin
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
47
SpMV
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
48
Sparse MatrixVector Multiplication
What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure
What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors
Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD)
(a)algebra conceptualization
(c)CSR reference code
for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}
A x y
(b)CSR data structure
A.val[ ]
A.rowStart[ ]
...
...
A.col[ ]...
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
49
Roofline model for SpMV
Double precision roofline models
In-core optimizations 1..i DRAM optimizations 1..j
FMA is inherent in SpMV (place at bottom)
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
data
set d
atas
et fit
s in
snoo
p filt
er
GFlopsi,j(AI) = min InCoreGFlopsi
StreamBWj * AI
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
50
Roofline model for SpMV(overlay arithmetic intensity)
Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory
flop:byte < 0.166
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
data
set d
atas
et fit
s in
snoo
p filt
er
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
51
Roofline model for SpMV(out-of-the-box parallel)
Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory
flop:byte < 0.166 For simplicity: dense
matrix in sparse format1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
data
set d
atas
et fit
s in
snoo
p filt
er
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
52
Roofline model for SpMV(NUMA & SW prefetch)
compulsory flop:byte ~ 0.166
utilize all memory channels
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
data
set d
atas
et fit
s in
snoo
p filt
er
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
53
Roofline model for SpMV(matrix compression)
Inherent FMA Register blocking improves
ILP, DLP, flop:byte ratio, and FP% of instructions
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
att
ain
ab
le G
flop
/s
4
8
16
32
64
128
1/81/4
1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out
NUM
Aba
nk c
onflic
ts
25% FP
peak DP
12% FP
w/out
SW
pre
fetc
h
w/out
NUM
A
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
data
set d
atas
et fit
s in
snoo
p filt
er
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out
SW
pre
fetc
h
w/out
NUM
A
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
54
Various Kernels
We have examined and heavily optimized a number of kernels and applications for both CPUs and GPUs.
We observe that for most, performance is highly correlated with DRAM bandwidth – particularly on the GPU.
Note, GTC has a strong scatter/gather component that skews STREAM-based rooflines.
512
256
128
64
32
16
8
4
2
1024
1/16 1 2 4 8 16 321/81/4
1/21/32
single-precision peak
double-precision peak
STREAM b
andw
idth
RTM/wave eqn.
7pt Stencil27pt Stencil
Xeon X5550 (Nehalem)
DP add-only
SpMV
DGEMM
GTC/chargei
GTC/pushi
512
256
128
64
32
16
8
4
2
1024
1/16 1 2 4 8 16 321/81/4
1/21/32
single-precision peak
double-precision peak
Device
ban
dwidt
hRTM/wave eqn.
NVIDIA C2050 (Fermi)
DP add-only
SpMV
7pt Stencil
27pt Stencil
DGEMM
GTC/chargei
GTC/pushi
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
55
Alternate Rooflines
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
56
No overlap of communication and computation
Previously, we assumed perfect overlap of communication or computation.
What happens if there is a dependency (either inherent or by a lack of optimization) that serializes communication and computation ?
Byte’s / STREAM Bandwidth
Flop’s / Flop/s
time
Byte’s / STREAM Bandwidth Flop’s / Flop/s
time
Time is the sum of communication time and computation time. The result is that flop/s grows asymptotically.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
57
No overlap of communication and computation
Consider a generic machine If we can perfectly decouple
and overlap communication with computation, the roofline is sharp/angular.
However, without overlap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
58
Alternate Bandwidths
Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark)
STREAM is NOT a good proxy for short stanza/random cacheline access patterns as memory latency (instead of just bandwidth) is being exposed.
Thus one might conceive of alternate memory benchmarks to provide a bandwidth upper bound (ceiling)
Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:LLC byte ratios.
For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:PCIe byte ratio.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
59
Alternate Computations
Arising from HPC kernels, its no surprise roofline use DP Flop/s. Of course, it could use
SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc…
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
60
Time-based roofline
In some cases, it is easier to visualize performance in terms of seconds (i.e. time-to-solution).
We can invert the roofline (seconds per flop) and simply multiply by the number of requisite flop’s
Additionally, we could change the horizontal axis from locality to some more appealing metric.
F U T U R E T E C H N O L O G I E S G R O U P
LAWRENCE BERKELEY NATIONAL LABORATORY
61
Empirical Roofline
Thus far, all in-core estimates have been based on first principles analysis of the underlying computer architecture (frequency, SIMD-width, latency, etc…)
Conceivably, one could design a series of compiled benchmarks that would extract the relevant roofline parameters.
Similarly, one could use performance counters to extract application characteristics so one could accurately determine application coordinates.
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
62
Questions?
AcknowledgmentsResearch supported by DOE Office of Science under contract number
DE-AC02-05CH11231.
LAWRENCE BERKELEY NATIONAL LABORATORY
F U T U R E T E C H N O L O G I E S G R O U P
63
BACKUP SLIDES