minimizing communication in numerical linear algebra demmel sparse-matrix-vector-multiplication...
TRANSCRIPT
![Page 1: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/1.jpg)
Minimizing Communication in Numerical Linear Algebra
www.cs.berkeley.edu/~demmel
Sparse-Matrix-Vector-Multiplication (SpMV)
Jim DemmelEECS & Math Departments, UC Berkeley
![Page 2: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/2.jpg)
Outline
• Motivation for Automatic Performance Tuning
• Results for sparse matrix kernels– Sparse Matrix Vector Multiplication (SpMV)– Sequential and Multicore results
• OSKI = Optimized Sparse Kernel Interface
• Need to understand tuning of single SpMV to understand opportunities and limits for tuning entire sparse solvers Summer School Lecture 7
2
![Page 3: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/3.jpg)
Berkeley Benchmarking and OPtimization (BeBOP)
• Prof. Katherine Yelick• Current members
– Kaushik Datta, Mark Hoemmen, Marghoob Mohiyuddin, Shoaib Kamil, Rajesh Nishtala, Vasily Volkov, Sam Williams, …
• Previous members– Hormozd Gahvari, Eun-Jim Im, Ankit Jain, Rich Vuduc,
many undergrads, …
• Many results here from current, previous students• bebop.cs.berkeley.edu
Summer School Lecture 7
3
![Page 4: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/4.jpg)
Automatic Performance Tuning
• Goal: Let machine do hard work of writing fast code• What does tuning of dense BLAS, FFTs, signal processing,
have in common?– Can do the tuning off-line: once per architecture, algorithm– Can take as much time as necessary (hours, a week…)– At run-time, algorithm choice may depend only on few
parameters (matrix dimensions, size of FFT, etc.)
• Can’t always do tuning off-line– Algorithm and implementation may strongly depend on data
only known at run-time– Ex: Sparse matrix nonzero pattern determines both best data
structure and implementation of Sparse-matrix-vector-multiplication (SpMV)
– Part of search for best algorithm just be done (very quickly!) at run-time
Summer School Lecture 7
4
![Page 5: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/5.jpg)
Source: Accelerator Cavity Design Problem (Ko via Husbands)
5
![Page 6: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/6.jpg)
Linear Programming Matrix
…
Summer School Lecture 7
6
![Page 7: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/7.jpg)
A Sparse Matrix You Encounter Every Day
7
![Page 8: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/8.jpg)
Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)
for each row i
for k=ptr[i] to ptr[i+1] do
y[i] = y[i] + val[k]*x[ind[k]]
SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)
for each row i
for k=ptr[i] to ptr[i+1] do
y[i] = y[i] + val[k]*x[ind[k]]
Only 2 flops per 2 mem_refs:Limited by getting data from memorySummer School Lecture 7
8
![Page 9: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/9.jpg)
Example: The Difficulty of Tuning
• n = 21200• nnz = 1.5 M• kernel: SpMV
• Source: NASA structural analysis problem
9
![Page 10: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/10.jpg)
Example: The Difficulty of Tuning
• n = 21200• nnz = 1.5 M• kernel: SpMV
• Source: NASA structural analysis problem
• 8x8 dense substructure: exploit this to limit #mem_refs 10
![Page 11: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/11.jpg)
Taking advantage of block structure in SpMV
• Bottleneck is time to get matrix from memory– Only 2 flops for each nonzero in matrix
• Don’t store each nonzero with index, instead store each nonzero r-by-c block with index– Storage drops by up to 2x, if rc >> 1, all 32-bit
quantities– Time to fetch matrix from memory decreases
• Change both data structure and algorithm– Need to pick r and c– Need to change algorithm accordingly
• In example, is r=c=8 best choice?– Minimizes storage, so looks like a good idea…
![Page 12: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/12.jpg)
Speedups on Itanium 2: The Need for Search
Reference
Best: 4x2
Mflop/s
Mflop/s
12
![Page 13: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/13.jpg)
Register Profile: Itanium 2
190 Mflop/s
1190 Mflop/s
13
![Page 14: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/14.jpg)
Register Profiles: IBM and Intel IA-64Power3 - 17% Power4 - 16%
Itanium 2 - 33%Itanium 1 - 8%
252 Mflop/s
122 Mflop/s
820 Mflop/s
459 Mflop/s
247 Mflop/s
107 Mflop/s
1.2 Gflop/s
190 Mflop/s
![Page 15: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/15.jpg)
Register Profiles: Sun and Intel x86
Ultra 2i - 11% Ultra 3 - 5%
Pentium III-M - 15%Pentium III - 21%
72 Mflop/s
35 Mflop/s
90 Mflop/s
50 Mflop/s
108 Mflop/s
42 Mflop/s
122 Mflop/s
58 Mflop/s
![Page 16: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/16.jpg)
Another example of tuning challenges
• More complicated non-zero structure in general
• N = 16614• NNZ = 1.1M
16
![Page 17: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/17.jpg)
Zoom in to top corner
• More complicated non-zero structure in general
• N = 16614• NNZ = 1.1M
17
![Page 18: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/18.jpg)
3x3 blocks look natural, but…
• More complicated non-zero structure in general
• Example: 3x3 blocking– Logical grid of 3x3 cells
• But would lead to lots of “fill-in”
18
![Page 19: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/19.jpg)
Extra Work Can Improve Efficiency!
• More complicated non-zero structure in general
• Example: 3x3 blocking– Logical grid of 3x3 cells– Fill-in explicit zeros– Unroll 3x3 block multiplies– “Fill ratio” = 1.5
• On Pentium III: 1.5x speedup!– Actual mflop rate 1.52 =
2.25 higher19
![Page 20: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/20.jpg)
Automatic Register Block Size Selection
• Selecting the r x c block size– Off-line benchmark
• Precompute Mflops(r,c) using dense A for each r x c
• Once per machine/architecture– Run-time “search”
• Sample A to estimate Fill(r,c) for each r x c– Run-time heuristic model
• Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)
Summer School Lecture 7
20
![Page 21: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/21.jpg)
Accurate and Efficient Adaptive Fill Estimation
• Idea: Sample matrix– Fraction of matrix to sample: s [0,1]– Control cost = O(s * nnz ) by controlling s
• Search at run-time: the constant matters!• Control s automatically by computing statistical
confidence intervals, by monitoring variance• Cost of tuning
– Lower bound: convert matrix in 5 to 40 unblocked SpMVs
– Heuristic: 1 to 11 SpMVs• Tuning only useful when we do many SpMVs
– Common case, eg in sparse solvers 21
![Page 22: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/22.jpg)
Accuracy of the Tuning Heuristics (1/4)
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)See p. 375 of Vuduc’s thesis for matrices
22
![Page 23: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/23.jpg)
Accuracy of the Tuning Heuristics (2/4)
23
![Page 24: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/24.jpg)
Accuracy of the Tuning Heuristics (2/4)DGEMV
24
![Page 25: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/25.jpg)
Upper Bounds on Performance for blocked SpMV
• P = (flops) / (time)– Flops = 2 * nnz(A)
• Upper bound on speed: Two main assumptions– 1. Count memory ops only (streaming)– 2. Count only compulsory, capacity misses: ignore conflicts
• Account for line sizes• Account for matrix size and nnz
• Charge minimum access “latency” i at Li cache & mem
– e.g., Saavedra-Barrera and PMaC MAPS benchmarks
1mem11
1memmem
Misses)(Misses)(Loads
HitsHitsTime
iiii
iii
Summer School Lecture 7
25
![Page 26: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/26.jpg)
Example: Bounds on Itanium 2
26
![Page 27: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/27.jpg)
Example: Bounds on Itanium 2
27
![Page 28: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/28.jpg)
Example: Bounds on Itanium 2
28
![Page 29: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/29.jpg)
Summary of Other Performance Optimizations
• Optimizations for SpMV– Register blocking (RB): up to 4x over CSR– Variable block splitting: 2.1x over CSR, 1.8x over RB– Diagonals: 2x over CSR– Reordering to create dense structure + splitting: 2x over
CSR– Symmetry: 2.8x over CSR, 2.6x over RB– Cache blocking: 2.8x over CSR– Multiple vectors (SpMM): 7x over CSR– And combinations…
• Sparse triangular solve– Hybrid sparse/dense data structure: 1.8x over CSR
• Higher-level kernels– A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB– A·x: 2x over CSR, 1.5x over RB– [A·x, A·x, A·x, .. , Ak·x] …. more to say later
29
![Page 30: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/30.jpg)
Example: Sparse Triangular Factor
• Raefsky4 (structural problem) + SuperLU + colmmd
• N=19779, nnz=12.6 M
Dense trailing triangle: dim=2268, 20% of total nz
Can be as high as 90+%!1.8x over CSR
30
![Page 31: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/31.jpg)
Cache Optimizations for AAT*x
• Cache-level: Interleave multiplication by A, AT
– Only fetch A from memory once
• Register-level: aiT to be rc block row, or diag
row
n
i
Tii
Tn
T
nT xaax
a
a
aaxAA1
1
1 )(
dot product“axpy”
… …
31
![Page 32: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/32.jpg)
Example: Combining Optimizations (1/2)
• Register blocking, symmetry, multiple (k) vectors– Three low-level tuning parameters: r, c, v
v
kX
Y A
cr
+=
*
32
![Page 33: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/33.jpg)
Example: Combining Optimizations (2/2)
• Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB]– Symmetric, blocked, 1 vector
• Up to 2.6x over nonsymmetric, blocked, 1 vector
– Symmetric, blocked, k vectors• Up to 2.1x over nonsymmetric, blocked, k vecs.• Up to 7.3x over nonsymmetric, nonblocked, 1, vector
– Symmetric Storage: up to 64.7% savings
33
![Page 34: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/34.jpg)
Potential Impact on Applications: Omega3P
• Application: accelerator cavity design [Ko]• Relevant optimization techniques
– Symmetric storage– Register blocking– Reordering, to create more dense blocks
• Reverse Cuthill-McKee ordering to reduce bandwidth– Do Breadth-First-Search, number nodes in reverse order visited
• Traveling Salesman Problem-based ordering to create blocks
– Nodes = columns of A– Weights(u, v) = no. of nz u, v have in common– Tour = ordering of columns– Choose maximum weight tour– See [Pinar & Heath ’97]
• 2.1x speedup on IBM Power 4
34
![Page 35: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/35.jpg)
Source: Accelerator Cavity Design Problem (Ko via Husbands)
35
![Page 36: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/36.jpg)
Post-RCM Reordering
36
![Page 37: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/37.jpg)
100x100 Submatrix Along Diagonal
Summer School Lecture 7
37
![Page 38: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/38.jpg)
Before: Green + RedAfter: Green + Blue
“Microscopic” Effect of RCM Reordering
Summer School Lecture 7
38
![Page 39: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/39.jpg)
“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + RedAfter: Green + Blue
Summer School Lecture 7
39
![Page 40: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/40.jpg)
(Omega3P)
40
![Page 41: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/41.jpg)
Optimized Sparse Kernel Interface - OSKI
• Provides sparse kernels automatically tuned for user’s matrix & machine– BLAS-style functionality: SpMV, Ax & ATy, TrSV– Hides complexity of run-time tuning– Includes new, faster locality-aware kernels: ATAx, Akx
• Faster than standard implementations– Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x
• For “advanced” users & solver library writers– Available as stand-alone library (OSKI 1.0.1h, 6/07)– Available as PETSc extension (OSKI-PETSc .1d, 3/06)– Bebop.cs.berkeley.edu/oski
Summer School Lecture 7
41
![Page 42: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/42.jpg)
How the OSKI Tunes (Overview)
Benchmarkdata
1. Build forTargetArch.
2. Benchmark
Heuristicmodels
1. EvaluateModels
Generatedcode
variants
2. SelectData Struct.
& Code
Library Install-Time (offline) Application Run-Time
To user:Matrix handlefor kernelcalls
Workloadfrom program
monitoring
Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.
HistoryMatrix
![Page 43: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/43.jpg)
How to Call OSKI: Basic Usage
• May gradually migrate existing apps– Step 1: “Wrap” existing data structures– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );
43
![Page 44: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/44.jpg)
How to Call OSKI: Basic Usage
• May gradually migrate existing apps– Step 1: “Wrap” existing data structures– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );
44
![Page 45: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/45.jpg)
How to Call OSKI: Basic Usage
• May gradually migrate existing apps– Step 1: “Wrap” existing data structures– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )
oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);/* Step 2 */
45
![Page 46: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/46.jpg)
How to Call OSKI: Tune with Explicit Hints
• User calls “tune” routine– May provide explicit tuning hints (OPTIONAL)
oski_matrix_t A_tunable = oski_CreateMatCSR( … );
/* … */
/* Tell OSKI we will call SpMV 500 times (workload hint) */oski_SetHintMatMult(A_tunable, OP_NORMAL, , x_view, , y_view, 500);/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);
oski_TuneMat(A_tunable); /* Ask OSKI to tune */
for( i = 0; i < 500; i++ )
oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);
46
![Page 47: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/47.jpg)
How the User Calls OSKI: Implicit Tuning
• Ask library to infer workload– Library profiles all kernel calls– May periodically re-tune
oski_matrix_t A_tunable = oski_CreateMatCSR( … );
/* … */
for( i = 0; i < 500; i++ ) {
oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);oski_TuneMat(A_tunable); /* Ask OSKI to tune */
}
Summer School Lecture 7
47
![Page 48: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/48.jpg)
Multicore SMPs Used for Tuning SpMV
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
48
![Page 49: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/49.jpg)
Multicore SMPs with Conventional cache-based memory hierarchy
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
49
![Page 50: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/50.jpg)
Multicore SMPs with local store-based memory hierarchy
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2+ T5140 (Victoria Falls)
50
![Page 51: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/51.jpg)
Multicore SMPs with CMT = Chip-MultiThreading
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
![Page 52: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/52.jpg)
Multicore SMPs: Number of threads
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
8 threads 8 threads
16* threads128 threads
*SPEs only
![Page 53: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/53.jpg)
Multicore SMPs: peak double precision flops
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
75 GFlop/s 74 Gflop/s
29* GFlop/s19 GFlop/s
*SPEs only
![Page 54: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/54.jpg)
Multicore SMPs: total DRAM bandwidth
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
21 GB/s (read)
10 GB/s (write)21 GB/s
51 GB/s42 GB/s (read)
21 GB/s (write)
*SPEs only
![Page 55: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/55.jpg)
Multicore SMPs with Non-Uniform Memory Access - NUMA
AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)
IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)
*SPEs only
![Page 56: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/56.jpg)
Set of 14 test matrices
• All bigger than the caches of our SMPs
Dense
ProteinFEM /
SpheresFEM /
CantileverWind
TunnelFEM /Harbor
QCDFEM /Ship
Economics Epidemiology
FEM /Accelerator
Circuit webbase
LP
2K x 2K Dense matrixstored in sparse format
Well Structured(sorted by nonzeros/row)
Poorly Structuredhodgepodge
Extreme Aspect Ratio(linear programming)
Summer School Lecture 7
56
![Page 57: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/57.jpg)
SpMV Performance: Naive parallelization
• Out-of-the box SpMV performance on a suite of 14 matrices
• Scalability isn’t great:
Compare to # threads
8 8 128 16
Naïve Pthreads
Naïve
57
![Page 58: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/58.jpg)
SpMV Performance: NUMA and Software Prefetching
NUMA-aware allocation is essential on NUMA SMPs.
Explicit software prefetching can boost bandwidth and change cache replacement policies
used exhaustive search
58
![Page 59: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/59.jpg)
SpMV Performance: “Matrix Compression”
Compression includes register blocking other formats smaller indices
Use heuristic rather than search
59
![Page 60: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/60.jpg)
SpMV Performance: cache and TLB blocking
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
60
![Page 61: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/61.jpg)
SpMV Performance: Architecture specific optimizations
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
61
![Page 62: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/62.jpg)
62
SpMV Performance: max speedup
• Fully auto-tuned SpMV performance across the suite of matrices
• Included SPE/local store optimized version
• Why do some optimizations work better on some architectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
2.7x2.7x 4.0x4.0x
2.9x2.9x 35x35x
![Page 63: Minimizing Communication in Numerical Linear Algebra demmel Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,](https://reader036.vdocument.in/reader036/viewer/2022062314/56649e395503460f94b2a9ce/html5/thumbnails/63.jpg)
EXTRA SLIDES
Summer School Lecture 7
63