Download - UCB Sparse Tutorial 1
-
8/10/2019 UCB Sparse Tutorial 1
1/18
Sparse Matrix Techniques(Tutorial)
X. Sherry LiLawrence Berkeley National Lab
Math 290 / CS 298, UCB
Jan. 31, 2007
-
8/10/2019 UCB Sparse Tutorial 1
2/18
01/31/07 Math 290 / CS 298 2
OutlineOutline
Part I
Computer representations of sparse matrices
Sparse matrix-vector multiply with various storages
Performance optimizations
Part II
Techniques for sparse factorizations
(e.g., SuperLU solver)
-
8/10/2019 UCB Sparse Tutorial 1
3/18
01/31/07 Math 290 / CS 298 3
Sparse Storage SchemesSparse Storage Schemes
Notation
N dimension
NNZ number of nonzeros
Assume arbitrary sparsity pattern
triplets format ({i, j, val}) is not sufficient . . .
Storage: 2*NNZ integers, NNZ reals
Not easy to randomly access one row or column
Linked list format provides flexibility, but not friendly on modernarchitectures . . .
Cannot call BLAS directly
-
8/10/2019 UCB Sparse Tutorial 1
4/18
01/31/07 Math 290 / CS 298 4
Compressed Row Storage (CRS)Compressed Row Storage (CRS)
Store nonzeros row by row contiguously
Example: N = 7, NNZ = 19
3 arrays:
Storage: NNZ reals, NNZ+N+1 integers 1 a
2 b
c d 3
e 4 f
5 g
h i 6 jk l 7
nzval 1 a 2 b c d 3 e 4 f 5 g h i 6 j k l 7
colind 1 4 2 5 1 2 3 2 4 5 5 7 4 5 6 7 3 5 7
rowptr 1 3 5 8 11 13 17 20
1 3 5 8 11 13 17 20
-
8/10/2019 UCB Sparse Tutorial 1
5/18
01/31/07 Math 290 / CS 298 5
SpMV (y = Ax) with CRSSpMV (y = Ax) with CRS
dot product
No locality for x
Vector length usually short
Memory-bound: 3 reads, 2 flops
nzval 1 a 2 b c d 3 e 4 f 5 g h i 6 j k l 7
colind 1 4 2 5 1 2 3 2 4 5 5 7 4 5 6 7 3 5 7
rowptr 1 3 5 8 11 13 17 20
1 3 5 8 11 13 17 20
do i = 1, N . . . row i of A
sum = 0.0 do j = rowptr(i), rowptr(i+1) 1 sum = sum + nzval(j) * x(colind(j)) enddo y(i) = sumenddo
-
8/10/2019 UCB Sparse Tutorial 1
6/18
01/31/07 Math 290 / CS 298 6
Compressed Column Storage (CCS)Compressed Column Storage (CCS)
Also known as Harwell-Boeingformat
Store nonzeros columnwise contiguously 3 arrays:
Storage: NNZ reals, NNZ+N+1 integers 1 a
2 b
c d 3
e 4 f
5 g
h i 6 jk l 7
nzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7
rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7
colptr 1 3 6 8 11 16 17 20
-
8/10/2019 UCB Sparse Tutorial 1
7/18
01/31/07 Math 290 / CS 298 7
SpMV (y = Ax) with CCSSpMV (y = Ax) with CCS
SAXPY
No locality for y
Vector length usually short
Memory-bound: 3 reads, 1 write, 2 flops
y(i) = 0.0, i = 1N
do j = 1, N . . . column j of A t = x(j) do i = colptr(j), colptr(j+1) 1 y(rowind(i))= y(rowind(i))+ nzval(i) * t enddoenddo
nzval 1 c 2 d e 3 k a 4 h b f 5 i l 6 g j 7
rowind 1 3 2 3 4 3 7 1 4 6 2 4 5 6 7 6 5 6 7
colptr 1 3 6 8 11 16 17 20
-
8/10/2019 UCB Sparse Tutorial 1
8/18
01/31/07 Math 290 / CS 298 8
Jagged Diagonal Storage (JDS)Jagged Diagonal Storage (JDS)
Also known as ITPACK, or Ellpack storage [Saad, Kincaid et al.]
Force all rows to have the same length as the longest row,
then columns are stored contiguously
2 arrays: nzval(N,L) and colind(N,L), where L = max row length
N*L reals, N*L integers
Usually L
-
8/10/2019 UCB Sparse Tutorial 1
9/18
01/31/07 Math 290 / CS 298 9
SpMV with JDSSpMV with JDS
Neither dot nor SAXPY
Good for vector processor: long vector length (N)
Extra memory, flops for padded zeros, especially bad if row lengthsvary a lot
y(i) = 0.0, i = 1Ndo j = 1, L do i = 1, N y(i) = y(i) + nzval(i, j) * x(colind(i, j)) enddo
enddo 1 a 0 02 b 0 0
c d 3 0
e 4 f 0
5 g 0 0
h i 6 j
k l 7 0
-
8/10/2019 UCB Sparse Tutorial 1
10/18
01/31/07 Math 290 / CS 298 10
Segmented-Sum [Blelloch et al.]Segmented-Sum [Blelloch et al.]
Data structure is an augmented form of CRS
Computational structure is similar to JDS
Each row is treated as a segmentin a long vector
Underlined elements denote the beginning of each segment
(i.e., a row in A)
Dimension: S * L ~ NNZ, where L is chosen to approximate thehardware vector length
1 a
2 b
c d 3
e 4 f5 g
h i 6 j
k l 7
1 d 5 j
a 3 g k
2 e h lb 4 i 7
c f 6
-
8/10/2019 UCB Sparse Tutorial 1
11/18
01/31/07 Math 290 / CS 298 11
SpMV with Segmented-SumSpMV with Segmented-Sum
2 arrays: nzval(S, L) and colind(S, L), where S*L ~ NNZ
NNZ reals, NNZ integers
Good for vector processors
SpMV is performed bottom-up, with each row-sum (dot) of Ax stored inthe beginning of each segment
Similar to JDS, but with more control logic in inner-loop
1 a
2 b
c d 3
e 4 f
5 g
h i 6 j
k l 7
1 d 5 j
a 3 g k
2 e h l
b 4 i 7
c f 6
do i = S, 1 do j = 1, L . . .
enddoenddo
-
8/10/2019 UCB Sparse Tutorial 1
12/18
01/31/07 Math 290 / CS 298 12
Performance (megaflop rate) [Gaeke et al.]Performance (megaflop rate) [Gaeke et al.]
Test matrix: N = 10000, NNZ = 177782, random pattern
~18 nonzeros per row on average JDS does 4.6x more operations
1.6 Gflops1.5 Gflops667 MflopsPeak flop rate
200 MHz1.5 GHz333 MHzClock rate
165295Seg-Sum
632
137
17
4
27
6
JDS
(effective)
11020929CRS
VIRAMPentium 4Ultra 2imachine
-
8/10/2019 UCB Sparse Tutorial 1
13/18
01/31/07 Math 290 / CS 298 13
Optimization TechniquesOptimization Techniques
Matrix reordering
For CRS SpMV, can improve x-vector locality by reducing thebandwidth of matrix A
Example: reverse Cuthill-McKee (breadth-first search)
Observed 2-3x improvement [Toledo, et al.]
-
8/10/2019 UCB Sparse Tutorial 1
14/18
01/31/07 Math 290 / CS 298 14
Optimization TechniquesOptimization Techniques
Register blocking
Find dense blocks of size r-by-c in A
(If needed, allow some zeros to be filled in)
A*x is proceeded block by block
keep c elements of x and r elements of y in registers x element re-used r times, y element re-used c times
Amount of indexed load and store is reduced
Obtained up to 2.5x improvement [Vuduc et al.]
-
8/10/2019 UCB Sparse Tutorial 1
15/18
01/31/07 Math 290 / CS 298 15
SPARSITY [Im, Yelick]SPARSITY [Im, Yelick]
-
8/10/2019 UCB Sparse Tutorial 1
16/18
01/31/07 Math 290 / CS 298 16
Performance Improvement [Vuduc et al.]Performance Improvement [Vuduc et al.]
-
8/10/2019 UCB Sparse Tutorial 1
17/18
01/31/07 Math 290 / CS 298 17
Other RepresentationsOther Representations
Block entry formats (e.g., multiple degrees of freedom areassociated with a single physical location)
Constant block size (BCRS)
Varying block sizes (VBCRS)
Skyline (or profile) storage (SKS)
Lower triangle stored row by rowUpper triangle stored column by column
In each row (column), first nonzero
defines a profile
All entries within the profile
(some may be zeros) are stored
-
8/10/2019 UCB Sparse Tutorial 1
18/18
01/31/07 Math 290 / CS 298 18
ReferencesReferences
Templates for the solution of linear systems
Barrett, et al., SIAM, 1994
BeBOP: http://bebop.cs.berkeley.edu/
Sparse BLAS standard:
http://www.netlib.org/blas/blast-forum