cs 770g - parallel algorithms in scientific computing · 2001-05-25 · scientific computing may 28...
TRANSCRIPT
CS 770G - Parallel Algorithms in Scientific Computing
May 28 , 2001Lecture 6
Dense Matrix Computation II:Solving Linear Systems
2
References
• Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis, Benjamin Cummings.
• Numerical Linear Algebra for High-Performance Computers
Dongarra, Duff, Sorensen, van der Vorst, SIAM.
• A portion of the notes comes from Prof. J. Demmel’sCS267 course at UC Berkeley.
3
Review of Gaussian Elimination (GE) for Solving Ax=b
• Add multiples of each row to later rows to make A upper triangular.• Solve resulting triangular system Ux = c by substitution.
… for each column i… zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1
… for each row j below row ifor j = i+1 to n
… add a multiple of row i to row jfor k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
4
Refine GE Algorithm (1)• Initial version:
• Remove computation of constant A(j,i)/A(i,i) from inner loop:
… for each column i… zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1
… for each row j below row ifor j = i+1 to n
… add a multiple of row i to row jfor k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
for i = 1 to n-1for j = i+1 to n
m = A(j,i)/A(i,i)for k = i to n
A(j,k) = A(j,k) - m * A(i,k)
5
Refine GE Algorithm (2)
• Last version:
• Don’t compute what we already know: zeros below diagonal in column i
for i = 1 to n-1for j = i+1 to n
m = A(j,i)/A(i,i)for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)
for i = 1 to n-1for j = i+1 to n
m = A(j,i)/A(i,i)for k = i to n
A(j,k) = A(j,k) - m * A(i,k)
6
Refine GE Algorithm (3)
• Last version:
• Store multipliers m below diagonal in zeroed entries for later use:
for i = 1 to n-1for j = i+1 to n
m = A(j,i)/A(i,i)for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)
for i = 1 to n-1for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)
7
Refine GE Algorithm (4)• Last version:
• Express using matrix operations (BLAS)
for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i)A(i+1:n,i+1:n) = A(i+1:n , i+1:n )
- A(i+1:n , i) * A(i , i+1:n)
for i = 1 to n-1for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)
8
What GE Really Computes
• Call the strictly lower triangular matrix of multipliers M, and let L = I+M.
• Call the upper triangle of the final matrix U.• Lemma (LU Factorization): If the above algorithm terminates
(does not divide by zero) then A = L*U.• Solving A*x=b using GE
– Factorize A = L*U using GE (cost = 2/3 n3 flops)– Solve L*y = b for y, using substitution (cost = n2 flops)– Solve U*x = y for x, using substitution (cost = n2 flops)
• Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired
for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)
9
Problems with Basic GE Algorithm• What if some A(i,i) is zero? Or very small?
– Result may not exist, or be “unstable”, so need to pivot• Current computation all BLAS 1 or BLAS 2, but we know
that BLAS 3 (matrix multiply) is fastestfor i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)
- A(i+1:n , i) * A(i , i+1:n)
PeakBLAS 3
BLAS 2BLAS 1
10
Pivoting in Gaussian Elimination° A = [ 0 1 ] fails completely, even though A is “easy”
[ 1 0 ]
° Illustrate problems in 3-decimal digit arithmetic:
A = [ 1e-4 1 ] and b = [ 1 ], correct answer to 3 places is x = [ 1 ][ 1 1 ] [ 2 ] [ 1 ]
° Result of LU decomposition is
L = [ 1 0 ] = [ 1 0 ] … No roundoff error yet[ fl(1/1e-4) 1 ] [ 1e4 1 ]
U = [ 1e-4 1 ] = [ 1e-4 1 ] … Error in 4th decimal place[ 0 fl(1-1e4*1) ] [ 0 -1e4 ]
Check if A = L*U = [ 1e-4 1 ] … (2,2) entry entirely wrong[ 1 0 ]
° Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|<5° Numerical instability° Computed solution x totally inaccurate
° Cure: Pivot (swap rows of A) so entries of L and U bounded
11
Gaussian Elimination with Partial Pivoting (GEPP)
• Partial Pivoting: swap rows so that each multiplier satisfies: |L(i,j)| = |A(j,i)/A(i,i)| <= 1
for i = 1 to n-1find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)|
… i.e. largest entry in rest of column iif |A(k,i)| = 0
exit with a warning that A is singular, or nearly soelseif k != i
swap rows i and k of Aend ifA(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1]A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)
• Lemma: This algorithm computes A = P*L*U, where P is a permutation matrix• Since each entry of |L(i,j)| <= 1, this algorithm is considered numerically stable.• For details see LAPACK code at www.netlib.org/lapack/single/sgetf2 and Dongarra’s book.
12
Summary• You have seen an example of how a typical matrix
operation (an important one) can be reduced to using lower level BLAS routines that would have been optimized for a given hardware platform
• Any other matrix operation that you may think of (matrix-matrix multiply, Cholesky factorization, QR, Householder method, etc) can be constructed from BLAS subprograms in a similar fashion and in fact have been constructed in the package called LAPACK.
• Note that only Level 1 and 2 BLAS routines have been used in LU decomposition. Efficiency considerations?
13
Overview of LAPACK
• Standard library for dense/banded linear algebra– Linear systems: A*x=b
– Least squares problems: minx || A*x-b ||2– Eigenvalue problems: Ax = λx, Ax = λBx
– Singular value decomposition (SVD): A = UΣVT
• Algorithms reorganized to use BLAS3 as much as possible.
• Basis of math libraries on many computers.
• Many algorithmic innovations remain.
14
Performance of LAPACK (n=1000)
15
Performance of LAPACK (n=100)
16
Summary, Cont’d
• Need to devise algorithms that can make use of Level 3 BLAS (matrix-matrix) routines, for several reasons:– Level 3 routines are known to run much more
efficiently due to larger ratio of computation to memory references/communication
– Parallel algorithms on distributed memory machines will require that we decompose the original matrix into blocks which reside in each processor (similar to HW1)
– Parallel algorithms will require that we minimize the surface-to volume ratio of our decompositions, and blocking becomes the natural approach.
17
Converting BLAS2 to BLAS3 in GEPP
• Blocking– Use to optimize matrix-multiplication.– Harder here because of data dependencies in GEPP.
• Delayed Updates– Save updates to “trailing matrix” from several consecutive BLAS2
updates.– Apply many saved updates simultaneously in one BLAS3 operation.
• Same idea works for much of dense linear algebra– Open questions remain.
• Need to choose a block size b (k in Dongarra’s book)– Algorithm will save and apply b updates.– b must be small enough so that active submatrix consisting of b
columns of A fits in cache.– b must be large enough to make BLAS3 fast.
18
Blocked Algorithms - LU Factorization
=
==
33
2322
131211
333231
2221
11
333231
232221
131211
UUUUUU
LLLLL
L
AAAAAAAAA
LUA
113131
112121
111111
ULAULAULA
===
2232123132
2222122122
121112
ULULAULULA
ULA
+=+=
=
33332332133133
2322132123
131113
ULULULAULULA
ULA
++=+=
=
19
Blocked Algorithms - LU Factorization
113131
112121
111111
ULAULAULA
===
2232123132
2222122122
121112
ULULAULULA
ULA
+=+=
=
33332332133133
2322132123
131113
ULULULAULULA
ULA
++=+=
=
• With these relationships, we can develop different algorithms by choosing the order in which operations are performed.
• Blocksize, b, needs to be chosen carefully. b=1 produces the usual algorithm, which b>1 will improve performance on a single-processor
• Three natural variants:– Left-Looking LU– Right-Looking LU– Crout LU
20
Blocked Algorithms - LU Factorization
• Notice that the LU decomposition in this sub-block is independent of the portion you have already completed
=
==
131211
31
21
11
333231
232221
131211 UUU
LLL
AAAAAAAAA
LUA
• Assume you have already done the first row and column of the GEPP
3332
2322
AAAA
• And you have the sub-block below left to work on
21
Blocked Algorithms - LU Factorization
• Now you can re-arrange the block equations by substituting:
• For simplicity, change notation of sub-block to
+
=
=
222212211121
12111111
22
1211
2221
11
2221
1211
ULULULULUL
PUUU
LLL
PAAAA
• Notice that, once you have done Gaussian Elimination on A11 and A21, you have already obtained L11, U11, and L21.
12212222
121
1112
~)(
ULAAALU
−←← −
• And repeat the procedure recursively.
22
Blocked Algorithms - LU Factorization
• A graphical view of what is going on is given by:
23
Blocked Algorithms - LU Factorization
Left-Looking LU Right-Looking LU Crout LU
• Variations in algorithm are due to the order in which submatrix operations are performed. Slight advantages to Crout’s algorithm (hybrid of the first two) .
Pre-computed sub-blocks
Currently being operated on sub-blocks
24
Review: BLAS 3 (Blocked) GEPP
for ib = 1 to n-1 step b … Process matrix b columns at a timeend = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’… let LL denote the strict lower triangular part of A(ib:end , ib:end) + IA(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of UA(end+1:n , end+1:n ) = A(end+1:n , end+1:n )
- A(end+1:n , ib:end) * A(ib:end , end+1:n)… apply delayed updates with single matrix-multiply… with inner dimension b
BLAS 3
25
Parallel Algorithms for Dense Matrices
• All that follows is applicable to dense or fullmatrices only. Square matrices discussed, but arguments valid for rectangular matrices as well.
• Typical parallelization steps:– Decomposition: identify parallel work and partitioning.– Mapping: which procs execute which portion of the
work.– Assignment: load balance work among procs.– Organization: communication and synchronization.
26
Parallel Algorithms for Dense Matrices
• The proc that owns a given portion of a matrix is responsible for doing all of the computation that involved that portion of the matrix.
• This is the sensible thing to so, since communication is minimized (although, due to data dependencies within the matrix, it will still be necessary)
• The question is: how should we subdivide a matrix so that parallel efficiency is maximized? There are various options.
27
Different Data Layouts for Parallel GE
The winner!
Bad load balance:P0 idle after firstn/4 steps
Load balanced, but can’t easilyuse BLAS2 or BLAS3
Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneck
Complicated addressing
28
Row and Column Block Cyclic Layout
• Matrix is composed of brow-by-bcol blocks.• Procs are distributed in a 2D array indexed by (pi,
pj), 0 pi < Prow, 0 pj < Pcol.• Ai,j is mapped to proc (pi, pj) using the formulae:
• In the figure, p=4, Prow=Pcol=brow=bcol=2.• Pcol-fold parallelism in any column, and calls to the
BLAS2 and BLAS3 on matrices of size brow-by-bcol.
• Serial bottleneck is eased.• Need not be symmetric in rows and columns.
Pcol mod)/(floorProw mod)/(floor
bcoljpjbrowipi
==
29
Row and Column Block Cyclic Layout
• In LU factorization, distribution of work becomes uneven as the computation progresses.• Larger block sizes result in greater load imbalance but reduce frequency of communication between
procs.• Block size controls these tradeoffs.• Some procs need to do more work between synchronization points than others (e.g. partial pivoting
over rows in a single block-col, other procs stay idle. The computation of each block row of the U factorization requires the solution of a lower triangular system across procs in a single row).
• Processor decomposition controls this type of tradeoff .
30
Parallel GE with a 2D Block Cyclic Layout
• Block size, b, in the algorithm and the block sizes brow and bcol in the layout satisfy b=brow=bcol.
• Shaded regions indicate busy processors or communication performed.
• Unnecessary to have a barrier between each step of the algorithm, e.g.. step 9, 10, and 11 can be pipelined.
• See Dongarra’s book for more details.
31
32 Matrix multiply of
green = green - blue * pink
33
34
Parallel Matrix Transpose
• The transpose of a matrix A is defined as:
njiAA ijjiT ≤≤= ,0,,,
• All elements below diagonal move above the diagonal and vice versa.
• Assume it takes 1 unit time to exchange a pair of matrix elements. Sequential time of transposing an n × n matrix is:
2
2 nnTs−=
• Consider parallel architectures organized in both a 2D meshand hypercube structures.
35
Parallel Matrix Transpose - 2D Mesh
1P
0P
2P
3P 7P
6P
5P
4P 8P
9P
10P
11P 15P
14P
13P
12P1P0P 2P 3P
7P6P5P4P
8P9P 10P 11P
15P14P13P12P
• Elements/blocks on lower-left part of matrix move up to the diagonal, and then right to their final location. Each step taken requires communication.
• Elements/blocks on upper-right part of matrix move down to the diagonal, and then left to their final location. Each step taken requires communication.
Initial Matrix Final Matrix
36
Parallel Matrix Transpose - 2D Mesh
• If each of the p procs contains a single number, after all of these communication steps, the matrix has been transposed.
• However, if each proc contains a sub-block of the matrix, after all blocks have been communicated to their final locations, they need to be locally transposed. Each sub-block will contain (n/√p)×(n/√p)elements and the cost of communication will be higher than before.
• Cost of communication is dominated by elements/blocks that reside in the top-right and bottom-left corners. They have to take an approximate number of hops equal to 2√p.
1P
0P
2P
3P 7P
6P
5P
4P 8P
9P
10P
11P 15P
14P
13P
12P1P0P 2P 3P
7P6P5P4P
8P9P 10P 11P
15P14P13P12P
Initial Matrix Final Matrix
37
Parallel Matrix Transpose - 2D Mesh• Each block contains elements, so it takes at most
ppntt ws )/(2 2+pn /2
for all blocks to move to their final destinations. After that, the local blocks need to be transposed, which can be done in an amount of time approximately equal to )2/(2 pn
• Thus, a total wall clock time equals to
pntpt
pnT wsP
22
222
++=
which is higher than the sequential complexity (order n^2). This algorithm, on a 2D mesh is not cost optimal. The same is true regardless of whether store-and-forward or cut-through routing schemes are used.
)( 2 pnTTOTAL Θ=
• Summing over all p processors, the total time consumed by the parallel algorithm is of order
38
Parallel Matrix Transpose - Hypercube
1P0P 2P 3P
7P6P5P4P
8P9P 10P 11P
15P14P13P12P
1P0P
2P 3P
7P6P
5P4P
8P 9P
10P 11P
15P14P
13P12P
1P
0P
2P
3P 7P
6P
5P
4P 8P
9P
10P
11P 15P
14P
13P
12P• This algorithm is called recursive subdivision
and maps naturally onto a hypercube.• After the blocks have all been transposed, the
elements inside each block (local to a proc) still need to be transposed.
• Wall clock time Total timep
pntt
pnT wsP log)(2
22
++= )log( 2 pnTTOTAL Θ=
39
Parallel Matrix Transpose - Hypercube
• With cut-through routing, the timing improves slightly to
)log(log)2( 22
pnTptpnttT TOTALhwsP Θ=⇒++=
which is still suboptimal.• Using striped partitioning (a.k.a column blocked layout) and cut-through
routing on a hypercube, the total time becomes cost-optimal.
)(log21)1(
22
22
nTpptpntpt
pnT TOTALhwsP Θ=⇒++−+=
• Note that this type of partitioning may be cost-optimal for the transpose operation. But not necessarily for other matrix operations, such as LU factorization and matrix-matrix multiply.