Download - Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)
![Page 1: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/1.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs
(details in paper at SC07)
Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2, James Demmel1,2
1University of California Berkeley2Lawrence Berkeley National Laboratory3Georgia Institute of Technology
![Page 2: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/2.jpg)
BIPSBIPS Overview
Multicore is the de facto performance solution for the next decade
Examined Sparse Matrix Vector Multiplication (SpMV) kernel Important HPC kernel Memory intensive Challenging for multicore
Present two auto-tuned threaded implementations: Pthread, cache-based implementation Cell local store-based implementation
Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 IBM Cell Broadband Engine
Compare with leading MPI implementation(PETSc) with an auto-tuned serial kernel (OSKI)
Show Cell delivers good performance and efficiency, while Niagara2 delivers good performance and productivity
![Page 3: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/3.jpg)
BIPSBIPS Sparse Matrix Vector Multiplication
Sparse Matrix Most entries are 0.0 Performance advantage in only
storing/operating on the nonzeros Requires significant meta data
Evaluate y=Ax A is a sparse matrix x & y are dense vectors
Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop)
A x y
![Page 4: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/4.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Dataset (Matrices)Multicore SMPs
Test Suite
![Page 5: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/5.jpg)
BIPSBIPS Matrices Used
Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M
Dense
Protein FEM /Spheres
FEM /Cantilever
WindTunnel
FEM /Harbor QCD FEM /
Ship Economics Epidemiology
FEM /Accelerator Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
![Page 6: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/6.jpg)
BIPSBIPS Multicore SMP Systems
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
![Page 7: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/7.jpg)
BIPSBIPSMulticore SMP Systems
(memory hierarchy)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Conventional Cache-based
Memory Hierarchy
Disjoint Local Store
Memory Hierarchy
![Page 8: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/8.jpg)
BIPSBIPSMulticore SMP Systems
(2 implementations)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Cache+Pthreads SpMV
implementation
Local Store SpMV
Implementation
![Page 9: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/9.jpg)
BIPSBIPSMulticore SMP Systems
(cache)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
16MB(vectors fit)
4MB
4MB(local store)
4MB
![Page 10: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/10.jpg)
BIPSBIPSMulticore SMP Systems
(peak flops)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
75 Gflop/s(TLP + ILP + SIMD)
17 Gflop/s(TLP + ILP)
29 Gflop/s(TLP + ILP + SIMD)
11 Gflop/s(TLP)
![Page 11: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/11.jpg)
BIPSBIPSMulticore SMP Systems
(peak read bandwidth)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
21 GB/s 21 GB/s
51 GB/s43 GB/s
![Page 12: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/12.jpg)
BIPSBIPSMulticore SMP Systems
(NUMA)
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar S
witc
h
Fully Buffered DRAM
4MB
Sha
red
L2 (1
6 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(fi
ll)90
GB
/s(w
ritet
hru)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
Unifo
rm M
emor
y Ac
cess
Non-
Unifo
rm M
emor
y Ac
cess
![Page 13: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/13.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Naïve Implementation for Cache-based Machines
Performance across suiteAlso, included a median performance number
![Page 14: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/14.jpg)
BIPSBIPS vanilla C Performance
0.00.10.20.30.40.50.60.70.80.91.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.00.10.20.30.40.50.60.70.80.91.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.00.10.20.30.40.50.60.70.80.91.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Vanilla C implementation Matrix stored in CSR (compressed
sparse row) Explored compiler options - only
the best is presented here x86 core delivers > 10x
performance of Niagara2 thread
![Page 15: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/15.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
SPMD, strong scalingOptimized for multicore/threadingVariety of shared memory programming models are acceptable(not just Pthreads)More colors = more optimizations = more work
Pthread Implementationfor Cache-Based Machines
![Page 16: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/16.jpg)
BIPSBIPS Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86
machines
![Page 17: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/17.jpg)
BIPSBIPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
8x cores = 1.9x performance 4x cores = 1.5x performance
64x threads = 41x performance Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86
machines
![Page 18: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/18.jpg)
BIPSBIPS
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Naïve Parallel Performance
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
1.4% of peak flops29% of bandwidth
4% of peak flops20% of bandwidth
25% of peak flops39% of bandwidth
Simple row parallelization Pthreads (didn’t have to be) 1P Niagara2 > 2x5x 2P x86
machines
![Page 19: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/19.jpg)
BIPSBIPS Case for Auto-tuning
How do we deliver good performance across all these architectures, across all matrices without exhaustively optimizing every combination
Auto-tuning (in general) Write a Perl script that generates all possible optimizations Heuristically, or exhaustively searches the optimizations Existing SpMV solution: OSKI (developed at UCB)
This work: Tuning geared toward multi-core/-threading generates SSE/SIMD intrinsics, prefetching, loop
transformations, alternate data structures, etc… “prototype for parallel OSKI”
![Page 20: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/20.jpg)
BIPSBIPSPerformance
(+NUMA & SW Prefetching)
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 NUMA aware allocation Tag prefetches with the
appropriate temporal locality Tune for the optimal distance
![Page 21: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/21.jpg)
BIPSBIPS Matrix Compression
For memory bound kernels, minimizing memory traffic should maximize performance
Compress the meta data Exploit structure to eliminate meta data
Heuristic: select the compression thatminimizes the matrix size: power of 2 register blocking CSR/COO format 16b/32b indices etc…
Side effect: matrix may be minimized to the point where it fits entirely in cache
![Page 22: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/22.jpg)
BIPSBIPSPerformance
(+matrix compression)
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Very significant on some missed the boat on others
![Page 23: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/23.jpg)
BIPSBIPSPerformance(+cache & TLB blocking)
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 Cache blocking geared towards
sparse accesses
![Page 24: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/24.jpg)
BIPSBIPSPerformance
(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
![Page 25: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/25.jpg)
BIPSBIPS
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Performance(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
4% of peak flops52% of bandwidth
20% of peak flops66% of bandwidth
52% of peak flops54% of bandwidth
![Page 26: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/26.jpg)
BIPSBIPS
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
1.0
2.0
3.0
4.0
5.0
6.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Performance(more DIMMs,firmware fix …)
+More DIMMs, Rank configuration, etc…
+Cache/TLB Blocking
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Intel Clovertown AMD Opteron
Sun Niagara2
3 essential optimizations: Parallelize, Prefetch, Compress
4 essential optimizations: Parallelize, NUMA, Prefetch, Compress
2 essential optimizations: Parallelize, Compress
![Page 27: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/27.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
CommentsPerformance
Cell Implementation
![Page 28: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/28.jpg)
BIPSBIPS Cell Implementation
No vanilla C implementation (aside from the PPE) In some cases, what were optional optimizations on cache based
machines, are requirements for correctness on Cell
Even SIMDized double precision is extremely weak Scalar double precision is unbearable Minimum register blocking is 2x1 (SIMDizable) Can increase memory traffic by 66%
Optional Cache blocking is now required local store blocking Spatial and temporal locality is captured by software when
the matrix is optimized In essence, the high bits of column indices are grouped into DMA
lists Branchless implementation
Despite the following performance numbers, Cell was still handicapped by double precision
![Page 29: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/29.jpg)
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
16 SPEs8 SPEs1 SPE
Performance(all)
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
![Page 30: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/30.jpg)
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
16 SPEs8 SPEs1 SPE
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
Hardware Efficiency
IBM Cell Broadband Engine39% of peak flops89% of bandwidth
4% of peak flops52% of bandwidth
20% of peak flops66% of bandwidth
52% of peak flops54% of bandwidth
![Page 31: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/31.jpg)
BIPSBIPS
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
16 SPEs8 SPEs1 SPE
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
Intel Clovertown AMD Opteron
Sun Niagara2
Productivity
IBM Cell Broadband Engine
3 essential optimizations: Parallelize, Prefetch, Compress
4 essential optimizations: Parallelize, NUMA, Prefetch, Compress
2 essential optimizations: Parallelize, Compress
5 essential optimizations: Parallelize, NUMA, DMA, Compress, Cache(LS) Block
![Page 32: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/32.jpg)
BIPSBIPS
0.000
0.200
0.400
0.600
0.800
1.000
0 8 16 24 32 40 48 56 64threads
Parallel Efficiency
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4cores
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1 2 3 4 5 6 7 8cores
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores
Parallel Efficiency
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
![Page 33: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/33.jpg)
BIPSBIPS
0.000
0.200
0.400
0.600
0.800
1.000
0 8 16 24 32 40 48 56 64threads
Parallel Efficiency
Parallel Efficiency(breakdown)
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4cores
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1 2 3 4 5 6 7 8cores
Parallel Efficiency
0.000
0.200
0.400
0.600
0.800
1.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16cores
Parallel Efficiency
Intel Clovertown AMD Opteron
Sun Niagara2 IBM Cell Broadband Engine
multicore
multicore
multicore
multicore
multisocket
multisocket
multisocket
MT
![Page 34: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/34.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Multicore MPI ImplementationThis is the default approach to programming multicore
![Page 35: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/35.jpg)
BIPSBIPS Multicore MPI Implementation
Used PETSc with shared memory MPICH Used OSKI (developed @ UCB) to optimize each thread = good autotuned shared memory MPI implementation
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
DenseProteinFEM-SphrFEM-Cant
TunnelFEM-Har
QCD
FEM-ShipEconomEpidem
FEM-Accel
CircuitWebbase
LP
Median
GFlop/s
MPI(autotuned) Pthreads(autotuned)Naïve Single Thread
Intel Clovertown AMD Opteron
![Page 36: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/36.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Summary
![Page 37: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/37.jpg)
BIPSBIPS
Median Performance
0.00.51.01.52.02.53.03.54.04.55.05.56.06.57.0
IntelClovertown
AMD Opteron Sun Niagara2 IBM Cell
GFlop/s
Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)
Median Power Efficiency
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
22.0
24.0
IntelClovertown
AMD Opteron Sun Niagara2 IBM Cell
MFlop/s/Watt
Naïve PETSc+OSKI(autotuned) Pthreads(autotuned)
Median Performance & Efficiency
1P Niagara2 was consistently better than 2P x86 machines (more potential bandwidth)
Cell delivered by far the best performance (better utilization)
Used digital power meter to measure sustained system power
FBDIMM is power hungry (12W/DIMM) Clovertown(330W) Niagara2 (350W) power
![Page 38: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/38.jpg)
BIPSBIPS Summary
Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.
Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency
90% of memory bandwidth (most get ~50%) High power efficiency Easily understood performance Extra traffic = lower performance (future work can address this)
Our multicore specific autotuned implementation significantly outperformed an autotuned MPI implementation
It exploited: Matrix compression geared towards multicore, rather than single NUMA Prefetching
![Page 39: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/39.jpg)
BIPSBIPS Acknowledgments
UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns)
Sun Microsystems Niagara2 access
Forschungszentrum Jülich Cell blade cluster access
![Page 40: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/40.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Questions?
![Page 41: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/41.jpg)
BIPSBIPS
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Backup Slides
![Page 42: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/42.jpg)
BIPSBIPS Parallelization
Matrix partitioned by rows and balanced by the number of nonzeros
SPMD like approach A barrier() is called before and after the SpMV kernel Each sub matrix stored separately in CSR Load balancing can be challenging
# of threads explored in powers of 2 (in paper)
A
x
y
![Page 43: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/43.jpg)
BIPSBIPS Exploiting NUMA, Affinity
Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data
Bind each sub matrix and the thread to process it together
Explored libnuma, Linux, and Solaris routines Adjacent blocks bound to adjacent cores
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Opteron
DDR2 DRAM
Single ThreadMultiple Threads,
One memory controllerMultiple Threads,
Both memory controllers
![Page 44: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/44.jpg)
BIPSBIPS Cache and TLB Blocking
Accesses to the matrix and destination vector are streaming But, access to the source vector can be random Reorganize matrix (and thus access pattern) to maximize reuse. Applies equally to TLB blocking (caching PTEs)
Heuristic: block destination, then keep addingmore columns as long as the number ofsource vector cache lines(or pages) touchedis less than the cache(or TLB). Apply allprevious optimizations individually to eachcache block.
Search: neither, cache, cache&TLB
Better locality at the expense of confusingthe hardware prefetchers.
A
x
y
![Page 45: Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)](https://reader037.vdocument.in/reader037/viewer/2022103103/56814d70550346895dbac59e/html5/thumbnails/45.jpg)
BIPSBIPS Banks, Ranks, and DIMMs
In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.
Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)
Addressing/Bank conflicts become increasingly likely
Add more DIMMs, configuration of ranks can help Clovertown system was already fully populated