p a r a l l e l c o m p u t i n g l a b o r a t o r y peri...p a r a l l e l c o m p u t i n g l a b...
TRANSCRIPT
![Page 1: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/1.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
1
BERKELEY PAR LAB
PERI :Auto-tuning Memory Intensive
Kernels for MulticoreSamuel Williams1,2, Kaushik Datta1,
Jonathan Carter2, Leonid Oliker1,2, John Shalf2,Katherine Yelick1,2, David Bailey2
1University of California, Berkeley2Lawrence Berkeley National Laboratory
![Page 2: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/2.jpg)
2
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Motivation
Multicore is the de facto solution for increasedpeak performance for the next decade
However, given the diversity of architectures,multicore guarantees neither good scalability norgood (attained) performance
We need a solution that providesperformance portability
![Page 3: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/3.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
3
BERKELEY PAR LAB
What’s a MemoryIntensive Kernel?
![Page 4: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/4.jpg)
4
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Arithmetic Intensity in HPC
True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem
size (increased temporal locality), but remains constant on others Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.
A r i t h m e t i c I n t e n s i t y
O( N )O( log(N) )O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTsDense Linear Algebra
(BLAS3)Particle Methods
![Page 5: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/5.jpg)
5
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Memory Intensive
Let us define memory intensive to be when the kernel’s arithmeticintensity is less than machine’s balance (flop:byte)
Performance ~ Stream BW * Arithmetic Intensity
Technology allows peak flops to improve faster than bandwidth.more and more kernels will be considered memory intensive
![Page 6: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/6.jpg)
6
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Outline
Motivation Memory Intensive Kernels Multicore SMPs of Interest Software Optimizations Introduction to Auto-tuning Auto-tuning Memory Intensive Kernels
Sparse Matrix Vector Multiplication (SpMV) Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD) Heat Equation Stencil (3D Laplacian)
Summary
![Page 7: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/7.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
7
BERKELEY PAR LAB
Multicore SMPsof Interest(used throughout the rest of the talk)
![Page 8: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/8.jpg)
8
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMPs Used
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Hyp
erTr
ansp
ortOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erTr
ansp
ort
4GB
/s(e
ach
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
<20G
B/s
(eac
h di
rect
ion)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
10.66 GB/s
Core
FSB
Core Core Core
10.66 GB/s
Core
FSB
Core Core Core
4MBshared L2
4MBshared L2
4MBshared L2
4MBshared L2
![Page 9: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/9.jpg)
9
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMPs Used
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Hyp
erTr
ansp
ortOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erTr
ansp
ort
4GB
/s(e
ach
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
<20G
B/s
(eac
h di
rect
ion)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
10.66 GB/s
Core
FSB
Core Core Core
10.66 GB/s
Core
FSB
Core Core Core
4MBshared L2
4MBshared L2
4MBshared L2
4MBshared L2
Conventional Cache-based
Memory Hierarchy
Disjoint Local Store
Memory Hierarchy
![Page 10: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/10.jpg)
10
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMPs Used
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Hyp
erTr
ansp
ortOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erTr
ansp
ort
4GB
/s(e
ach
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
<20G
B/s
(eac
h di
rect
ion)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
10.66 GB/s
Core
FSB
Core Core Core
10.66 GB/s
Core
FSB
Core Core Core
4MBshared L2
4MBshared L2
4MBshared L2
4MBshared L2
cache-based Pthreads
implementation
local store-based
libspe implementation
![Page 11: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/11.jpg)
11
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMPs Used
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Hyp
erTr
ansp
ortOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erTr
ansp
ort
4GB
/s(e
ach
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
<20G
B/s
(eac
h di
rect
ion)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
10.66 GB/s
Core
FSB
Core Core Core
10.66 GB/s
Core
FSB
Core Core Core
4MBshared L2
4MBshared L2
4MBshared L2
4MBshared L2
multithreaded cores
![Page 12: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/12.jpg)
12
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMPs Used(threads)
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Hyp
erTr
ansp
ortOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erTr
ansp
ort
4GB
/s(e
ach
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
<20G
B/s
(eac
h di
rect
ion)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
10.66 GB/s
Core
FSB
Core Core Core
10.66 GB/s
Core
FSB
Core Core Core
4MBshared L2
4MBshared L2
4MBshared L2
4MBshared L2
8 threads 8 threads
16* threads128 threads
*SPEs only
![Page 13: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/13.jpg)
13
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMPs Used(peak double precision flops)
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Hyp
erTr
ansp
ortOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erTr
ansp
ort
4GB
/s(e
ach
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
<20G
B/s
(eac
h di
rect
ion)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
10.66 GB/s
Core
FSB
Core Core Core
10.66 GB/s
Core
FSB
Core Core Core
4MBshared L2
4MBshared L2
4MBshared L2
4MBshared L2
75 GFlop/s 74 Gflop/s
29* GFlop/s19 GFlop/s
*SPEs only
![Page 14: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/14.jpg)
14
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Multicore SMPs Used(total DRAM bandwidth)
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Hyp
erTr
ansp
ortOpteron Opteron Opteron Opteron
667MHz DDR2 DIMMs
10.66 GB/s
2x64b memory controllers
Opteron Opteron Opteron Opteron
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
512KBvictim
2MB Shared quasi-victim (32 way)
SRI / crossbar
2MB Shared quasi-victim (32 way)
SRI / crossbarHyp
erTr
ansp
ort
4GB
/s(e
ach
dire
ctio
n)
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
667MHz FBDIMMs
21.33 GB/s 10.66 GB/s
4MB Shared L2 (16 way)(64b interleaved)
4 Coherency Hubs
2x128b controllers
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
MT
SP
AR
CM
T S
PA
RC
Crossbar
179 GB/s 90 GB/s
8 x
6.4
GB
/s(1
per
hub
per
dire
ctio
n)
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
VMTPPE
512KL2
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
BIF
512MB XDR DRAM
25.6 GB/s
EIB (ring network)
XDR memory controllers
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
SP
E25
6KM
FCS
PE
256K
MFC
VMTPPE
512KL2
<20G
B/s
(eac
h di
rect
ion)
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
667MHz FBDIMMs
Chipset (4x64b controllers)
10.66 GB/s(write)21.33 GB/s(read)
10.66 GB/s
Core
FSB
Core Core Core
10.66 GB/s
Core
FSB
Core Core Core
4MBshared L2
4MBshared L2
4MBshared L2
4MBshared L2
21 GB/s (read)10 GB/s (write) 21 GB/s
51 GB/s42 GB/s (read)21 GB/s (write)
*SPEs only
![Page 15: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/15.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
15
BERKELEY PAR LAB
Categorization ofSoftware Optimizations
![Page 16: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/16.jpg)
16
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Optimization Categorization
Maximizing (attained)In-core Performance
Minimizing (total)Memory Traffic
Maximizing (attained)Memory Bandwidth
![Page 17: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/17.jpg)
17
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Optimization Categorization
MinimizingMemory Traffic
MaximizingMemory Bandwidth
MaximizingIn-core Performance
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
![Page 18: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/18.jpg)
18
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Optimization Categorization
MinimizingMemory Traffic
MaximizingMemory Bandwidth
MaximizingIn-core Performance
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
?unroll &jam
?explicitSIMD
?reorder
?eliminatebranches
![Page 19: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/19.jpg)
19
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Optimization Categorization
MaximizingIn-core Performance
MinimizingMemory Traffic
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
?unroll &jam
?explicitSIMD
?reorder
?eliminatebranches
Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior
?cacheblocking?array
padding
?compressdata
?streamingstores
MaximizingMemory Bandwidth
•Exploit NUMA
•Hide memory latency
•Satisfy Little’s Law
?memoryaffinity ?SW
prefetch
?DMAlists
?unit-stridestreams
?TLBblocking
![Page 20: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/20.jpg)
20
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Optimization Categorization
MaximizingIn-core Performance
MaximizingMemory Bandwidth
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
?unroll &jam
?explicitSIMD
?reorder
?eliminatebranches
•Exploit NUMA
•Hide memory latency
•Satisfy Little’s Law
?memoryaffinity ?SW
prefetch
?DMAlists
?unit-stridestreams
?TLBblocking
MinimizingMemory Traffic
Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior
?cacheblocking?array
padding
?compressdata
?streamingstores
![Page 21: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/21.jpg)
21
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Optimization Categorization
MaximizingIn-core Performance
MinimizingMemory Traffic
MaximizingMemory Bandwidth
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
?unroll &jam
?explicitSIMD
?reorder
?eliminatebranches
•Exploit NUMA
•Hide memory latency
•Satisfy Little’s Law
?memoryaffinity ?SW
prefetch
?DMAlists
?unit-stridestreams
?TLBblocking
Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior
?cacheblocking?array
padding
?compressdata
?streamingstores
![Page 22: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/22.jpg)
22
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Optimization Categorization
MaximizingIn-core Performance
MinimizingMemory Traffic
MaximizingMemory Bandwidth
•Exploit in-core parallelism (ILP, DLP, etc…)
•Good (enough) floating-point balance
?unroll &jam
?explicitSIMD
?reorder
?eliminatebranches
•Exploit NUMA
•Hide memory latency
•Satisfy Little’s Law
?memoryaffinity ?SW
prefetch
?DMAlists
?unit-stridestreams
?TLBblocking
Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior
?cacheblocking?array
padding
?compressdata
?streamingstores
Each optimization has
a large parameter space
What are the optimal parameters?
![Page 23: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/23.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
23
BERKELEY PAR LAB
Introduction toAuto-tuning
![Page 24: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/24.jpg)
24
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Out-of-the-box Code Problem
Out-of-the-box code has (unintentional) assumptions on: cache sizes (>10MB) functional unit latencies(~1 cycle) etc…
These assumptions may result in poor performance when theyexceed the machine characteristics
![Page 25: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/25.jpg)
25
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuning?
Provides performance portability across the existing breadth andevolution of microprocessors
One time up front productivity cost is amortized by the number ofmachines its used on
Auto-tuning does not invent new optimizations Auto-tuning automates the exploration of the optimization and
parameter space Two components:
parameterized code generator (we wrote ours in Perl) Auto-tuning exploration benchmark
(combination of heuristics and exhaustive search) Can be extended with ISA specific optimizations (e.g. DMA, SIMD)
![Page 26: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/26.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
26
BERKELEY PAR LAB
Auto-tuning MemoryIntensive Kernels
Sparse Matrix Vector Multiplication (SpMV) SC’07Lattice-Boltzmann Magneto-hydrodynamics (LBMHD) IPDPS’08Explicit Heat Equation (Stencil) SC’08
![Page 27: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/27.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
27
BERKELEY PAR LAB
Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick,James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on EmergingMulticore Platforms", Supercomputing (SC), 2007.
![Page 28: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/28.jpg)
28
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Sparse MatrixVector Multiplication
What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure
What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors
Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD)
(a)algebra conceptualization
(c)CSR reference code
for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}
A x y
(b)CSR data structure
A.val[ ]
A.rowStart[ ]
...
...
A.col[ ]...
![Page 29: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/29.jpg)
29
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
The Dataset (matrices)
Unlike dense BLAS, performance is dictated by sparsity Suite of 14 matrices All bigger than the caches of our SMPs We’ll also include a median performance number
Dense
Protein FEM /Spheres
FEM /Cantilever
WindTunnel
FEM /Harbor QCD FEM /
Ship Economics Epidemiology
FEM /Accelerator Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
![Page 30: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/30.jpg)
30
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
SpMV Performance(simple parallelization)
Out-of-the box SpMVperformance on a suite of14 matrices
Scalability isn’t great Is this performance
good?
Naïve Pthreads
Naïve
![Page 31: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/31.jpg)
31
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned SpMV Performance(portable C)
Fully auto-tuned SpMVperformance across the suiteof matrices
Why do some optimizationswork better on somearchitectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
![Page 32: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/32.jpg)
32
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned SpMV Performance(architecture specific optimizations)
Fully auto-tuned SpMVperformance across the suiteof matrices
Included SPE/local storeoptimized version
Why do some optimizationswork better on somearchitectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
![Page 33: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/33.jpg)
33
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned SpMV Performance(architecture specific optimizations)
Fully auto-tuned SpMVperformance across the suiteof matrices
Included SPE/local storeoptimized version
Why do some optimizationswork better on somearchitectures?
Performance is better,but is performance good?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
Auto-tuning resulted in better performance,
but did it result in good performance?
![Page 34: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/34.jpg)
34
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline Model
2
1/8
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/4 1/2 1 2 4 8 161
Log
scal
e !
Log scale !
![Page 35: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/35.jpg)
35
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Naïve Roofline Model
Unrealistically optimisticmodel
Hand optimized StreamBW benchmark
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4
peak DP peak DP
peak DP
8
peak DP
IBM QS20Cell Blade
Sun T2+ T5140(Victoria Falls)
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
hand
optim
ized S
tream
BW
hand
optim
ized S
tream
BW
hand
optim
ized S
tream
BW
hand
optim
ized S
tream
BW
Gflop/s(AI) = min Peak Gflop/sStreamBW * AI
![Page 36: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/36.jpg)
36
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for SpMV
Double precision rooflinemodels
In-core optimizations 1..i DRAM optimizations 1..j
FMA is inherent in SpMV(place at bottom)
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out N
UMAba
nk co
nflict
s
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
datas
et da
taset
fits in
snoo
p filte
r
GFlopsi,j(AI) = min InCoreGFlopsi
StreamBWj * AI
![Page 37: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/37.jpg)
37
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for SpMV(overlay arithmetic intensity)
Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory
flop:byte < 0.1661
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out N
UMAba
nk co
nflict
s
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
datas
et da
taset
fits in
snoo
p filte
r
![Page 38: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/38.jpg)
38
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for SpMV(out-of-the-box parallel)
Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory
flop:byte < 0.166 For simplicity: dense
matrix in sparse format1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out N
UMAba
nk co
nflict
s
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
datas
et da
taset
fits in
snoo
p filte
r
![Page 39: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/39.jpg)
39
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for SpMV(NUMA & SW prefetch)
compulsory flop:byte ~0.166
utilize all memory channels
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out N
UMAba
nk co
nflict
s
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
datas
et da
taset
fits in
snoo
p filte
r
![Page 40: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/40.jpg)
40
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for SpMV(matrix compression)
Inherent FMA Register blocking improves
ILP, DLP, flop:byte ratio,and FP% of instructions
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out N
UMAba
nk co
nflict
s
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
datas
et da
taset
fits in
snoo
p filte
r
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
![Page 41: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/41.jpg)
41
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for SpMV(matrix compression)
Inherent FMA Register blocking improves
ILP, DLP, flop:byte ratio,and FP% of instructions
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
w/out SIMD
peak DP
w/out ILP
w/out FMA
w/out N
UMAba
nk co
nflict
s
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
Perform
ance
is
bandwidth lim
ited
![Page 42: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/42.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
42
BERKELEY PAR LAB
Auto-tuning Lattice-BoltzmannMagneto-Hydrodynamics
(LBMHD)Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick,"Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms",International Parallel & Distributed Processing Symposium (IPDPS), 2008.
Best Paper, Application Track
![Page 43: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/43.jpg)
43
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
LBMHD
Plasma turbulence simulation via Lattice Boltzmann Method Two distributions:
momentum distribution (27 scalar components) magnetic distribution (15 vector components)
Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)
Arithmetic Intensity: Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes) Requires about 1300 floating point operations per lattice update Just over 1.0 flops/byte (ideal)
Cache capacity requirements are independent of problem size Two problem sizes:
643 (0.3 GB) 1283 (2.5 GB)
periodic boundaryconditions
momentum distribution
14
4
13
16
5
8
9
21
12
+Y
2
25
1
3
24
23
22
26
0
18
6
17
19
7
10
11
20
15
+Z
+X
magnetic distribution
14
13
16
21
12
25
24
23
22
26
18
17
19
20
15
+Y
+Z
+X
macroscopic variables
+Y
+Z
+X
![Page 44: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/44.jpg)
44
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Initial LBMHD Performance
Generally, scalability looksgood
Scalability is good but is performance good?
*collision() only
Naïve+NUMA
![Page 45: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/45.jpg)
45
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned LBMHD Performance(portable C)
Auto-tuning avoids cacheconflict and TLB capacitymisses
Additionally, it exploits SIMDwhere the compiler doesn’t
Include a SPE/Local Storeoptimized version
*collision() only
+Vectorization
+Padding
Naïve+NUMA
![Page 46: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/46.jpg)
46
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned LBMHD Performance(architecture specific optimizations)
Auto-tuning avoids cacheconflict and TLB capacitymisses
Additionally, it exploits SIMDwhere the compiler doesn’t
Include a SPE/Local Storeoptimized version
*collision() only
+Explicit SIMDization
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
+small pages
![Page 47: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/47.jpg)
47
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for LBMHD
Far more adds thanmultiplies (imbalance)
Huge data sets
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
![Page 48: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/48.jpg)
48
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for LBMHD(overlay arithmetic intensity)
Far more adds thanmultiplies (imbalance)
Essentially randomaccess to memory
Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
![Page 49: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/49.jpg)
49
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for LBMHD(out-of-the-box parallel performance)
Far more adds thanmultiplies (imbalance)
Essentially randomaccess to memory
Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses
Peak VF performance with64 threads (out of 128) -high conflict misses
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Sun T2+ T5140(Victoria Falls)
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
Intel Xeon E5345(Clovertown)
w/out SIMD
w/out ILP
![Page 50: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/50.jpg)
50
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for LBMHD(Padding, Vectorization, Unrolling, Reordering, …)
Vectorize the code toeliminate TLB capacitymisses
Ensures unit stride access(bottom bandwidth ceiling)
Tune for optimal VL Clovertown pinned to
lower BW ceiling1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
No naïve SPEimplementation
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Sun T2+ T5140(Victoria Falls)
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
Intel Xeon E5345(Clovertown)
w/out SIMD
w/out ILP
![Page 51: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/51.jpg)
51
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for LBMHD(SIMDization + cache bypass)
Make SIMDization explicit Technically, this swaps ILP
and SIMD ceilings Use cache bypass
instruction: movntpd Increases flop:byte ratio to
~1.0 on x86/Cell
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
w/out ILP
w/out SIMD
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Sun T2+ T5140(Victoria Falls)
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
Intel Xeon E5345(Clovertown)
w/out ILP
w/out SIMD
![Page 52: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/52.jpg)
52
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for LBMHD(SIMDization + cache bypass)
Make SIMDization explicit Technically, this swaps ILP
and SIMD ceilings Use cache bypass
instruction: movntpd Increases flop:byte ratio to
~1.0 on x86/Cell
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Sun T2+ T5140(Victoria Falls)
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
Intel Xeon E5345(Clovertown)
w/out SIMD
w/out ILP
3 out o
f 4 m
achines
hit the R
oofline
![Page 53: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/53.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
53
BERKELEY PAR LAB
The Heat Equation Stencil
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter,Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil ComputationOptimization and Autotuning on State-of-the-Art Multicore Architecture”,Supercomputing (SC) (to appear), 2008.
![Page 54: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/54.jpg)
54
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
The Heat Equation Stencil
PDE grid
+Y
+Z
+X
stencil for heat equation PDE
y+1
y-1
x-1
z-1
z+1
x+1x,y,z
Explicit Heat equation (Laplacian ~ ∇2F(x,y,z)) on a regular grid Storage: one double per point in space 7-point nearest neighbor stencil Must:
read every point from DRAM perform 8 flops (linear combination) write every point back to DRAM
Just over 0.5 flops/byte (ideal) Cache locality is important Run one problem size: 2563
![Page 55: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/55.jpg)
55
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Stencil Performance(out-of-the-box code)
Expect performance tobe between SpMV andLBMHD
Scalability isuniversally poor
Performance is poor
Naïve
![Page 56: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/56.jpg)
56
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Stencil Performance(portable C)
Proper NUMAmanagement is essentialon most architectures
Moreover proper cacheblocking is still essentialeven with MB’s of cache
+SW Prefetching
+Unrolling
+Thread/Cache Blocking
+Padding
+NUMA
Naïve
+Collaborative Threading
![Page 57: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/57.jpg)
57
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Auto-tuned Stencil Performance(architecture specific optimizations)
Cache bypass cansignificantly improveBarcelona performance.
DMA, SIMD, and cacheblocking were essential onCell
+Explicit SIMDization
+SW Prefetching
+Unrolling
+Thread/Cache Blocking
+Padding
+NUMA
+Cache bypass / DMA
Naïve
+Collaborative Threading
![Page 58: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/58.jpg)
58
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for Stencil(out-of-the-box code)
Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than
multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict
misses will severely impairflop:byte ratio
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
IBM QS20Cell Blade
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
![Page 59: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/59.jpg)
59
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for Stencil(out-of-the-box code)
Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than
multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict
misses will severely impairflop:byte ratio
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
No naïve SPEimplementation
IBM QS20Cell Blade
![Page 60: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/60.jpg)
60
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for Stencil(out-of-the-box code)
Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than
multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict
misses will severely impairflop:byte ratio
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
No naïve SPEimplementation
IBM QS20Cell Blade
![Page 61: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/61.jpg)
61
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)
Cache blocking helpsensure flop:byte ratio is asclose as possible to 1/3
Clovertown has hugecaches but is pinned tolower BW ceiling
Cache management isessential whencapacity/thread is low1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
No naïve SPEimplementation
IBM QS20Cell Blade
![Page 62: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/62.jpg)
62
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for Stencil(SIMDization + cache bypass)
Make SIMDization explicit Technically, this swaps ILP
and SIMD ceilings Use cache bypass
instruction: movntpd Increases flop:byte ratio to
~0.5 on x86/Cell
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
IBM QS20Cell Blade
![Page 63: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/63.jpg)
63
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Roofline model for Stencil(SIMDization + cache bypass)
Make SIMDization explicit Technically, this swaps ILP
and SIMD ceilings Use cache bypass
instruction: movntpd Increases flop:byte ratio to
~0.5 on x86/Cell
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop/
s
4
8
16
32
64
128
1/8 1/4 1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out S
W pr
efetch
w/out N
UMA
w/out FMA
peak DP
w/out ILP
w/out SIMD
w/out N
UMAba
nk co
nflict
s
peak DP
mul/add imbalance
datas
et fits
in sn
oop f
ilter
peak DP
w/out SIMD
w/out ILP
mul/add imbalance
w/out S
W pr
efetch
w/out N
UMA
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
Sun T2+ T5140(Victoria Falls)
w/out SIMD
w/out ILP
IBM QS20Cell Blade
3 out o
f 4 m
achines
basica
lly on R
oofline
![Page 64: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/64.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
64
BERKELEY PAR LAB
Summary
![Page 65: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/65.jpg)
65
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Out-of-the-box Performance
Ideal productivity (just type ‘make’) Kernels sorted by arithmetic intensity Maximum performance with any concurrency Note: Cell = 4 PPE threads (no SPEs) Surprisingly, most
architectures gotsimilar performance
![Page 66: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/66.jpg)
66
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Portable Performance
Portable (C only) auto-tuning Sacrifice some productivity upfront, amortize by reuse Dramatic increases in performance on Barcelona and Victoria Falls Clovertown is increasingly memory bound Cell = 4 PPE threads
(no SPEs)
![Page 67: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/67.jpg)
67
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Architecture Specific Performance
ISA specific auto-tuning (reduced portability & productivity) explicit:
SPE parallelization with DMAs SIMDization(SSE+Cell) cache bypass
Cell gets all itsperformance from theSPEs (DP limits perf)
Barcelona gets halfits performance fromSIMDization
![Page 68: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/68.jpg)
68
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Power Efficiency
Used a digital power meter to measure sustained power under load Efficiency = Sustained Performance / Sustained Power Victoria Falls’ power efficiency is severely limited by FBDIMM power Despite Cell’s weak double precision, it delivers the highest power
efficiency
![Page 69: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/69.jpg)
69
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Summary(auto-tuning)
Auto-tuning provides portable performance Auto-tuning with architecture (ISA) specific
optimizations are essential on some machines
The roofline model qualifies performance Also tells us which optimizations are important As well as how much further improvement is possible
![Page 70: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/70.jpg)
70
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Summary(architectures)
Clovertown severely bandwidth limited
Barcelona delivers good performance with architecture specific auto-tuning bandwidth limited after ISA optimizations
Victoria Falls challenged by interaction of TLP with shared caches often limited by in-core performance
Cell Broadband Engine performance entirely from architecture specific auto-tuning bandwidth limited on SpMV DP FP limited on Stencil(slightly) and LBMHD(severely) delivers the best power efficiency
![Page 71: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/71.jpg)
71
EECSElectrical Engineering and
Computer Sciences BERKELEY PAR LAB
Acknowledgements
Research supported by: Microsoft and Intel funding (Award #20080469) DOE Office of Science under contract number DE-AC02-05CH11231 NSF contract CNS-0325873 Sun Microsystems - Niagara2 / Victoria Falls machines AMD - access to Quad-core Opteron (barcelona) access Forschungszentrum Jülich - access to QS20 Cell blades IBM - virtual loaner program to QS20 Cell blades
![Page 72: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR](https://reader035.vdocument.in/reader035/viewer/2022071417/6114d0b99f6b3337c847a4da/html5/thumbnails/72.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECSElectrical Engineering and
Computer Sciences
72
BERKELEY PAR LAB
Questions ?Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick,James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on EmergingMulticore Platforms", Supercomputing (SC), 2007.
Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick,"Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms",International Parallel & Distributed Processing Symposium (IPDPS), 2008.Best Paper, Application Track
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter,Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “StencilComputation Optimization and Autotuning on State-of-the-Art MulticoreArchitecture”, Supercomputing (SC) (to appear), 2008.