cs 770g - parallel algorithms in scientific computing · 2001-05-25 · scientific computing may 28...

CS 770G - Parallel Algorithms in Scientific Computing

May 28 , 2001Lecture 6

Dense Matrix Computation II:Solving Linear Systems

2

References

• Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis, Benjamin Cummings.

• Numerical Linear Algebra for High-Performance Computers

Dongarra, Duff, Sorensen, van der Vorst, SIAM.

• A portion of the notes comes from Prof. J. Demmel’sCS267 course at UC Berkeley.

3

Review of Gaussian Elimination (GE) for Solving Ax=b

• Add multiples of each row to later rows to make A upper triangular.• Solve resulting triangular system Ux = c by substitution.

… for each column i… zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1

… for each row j below row ifor j = i+1 to n

… add a multiple of row i to row jfor k = i to n

A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

4

Refine GE Algorithm (1)• Initial version:

• Remove computation of constant A(j,i)/A(i,i) from inner loop:

… for each column i… zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1

… for each row j below row ifor j = i+1 to n

… add a multiple of row i to row jfor k = i to n

A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

for i = 1 to n-1for j = i+1 to n

m = A(j,i)/A(i,i)for k = i to n

A(j,k) = A(j,k) - m * A(i,k)

5

Refine GE Algorithm (2)

• Last version:

• Don’t compute what we already know: zeros below diagonal in column i


m = A(j,i)/A(i,i)for k = i+1 to n

A(j,k) = A(j,k) - m * A(i,k)


m = A(j,i)/A(i,i)for k = i to n

A(j,k) = A(j,k) - m * A(i,k)

6

Refine GE Algorithm (3)

• Last version:

• Store multipliers m below diagonal in zeroed entries for later use:


m = A(j,i)/A(i,i)for k = i+1 to n

A(j,k) = A(j,k) - m * A(i,k)


A(j,i) = A(j,i)/A(i,i)for k = i+1 to n

A(j,k) = A(j,k) - A(j,i) * A(i,k)

7

Refine GE Algorithm (4)• Last version:

• Express using matrix operations (BLAS)

for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i)A(i+1:n,i+1:n) = A(i+1:n , i+1:n )

- A(i+1:n , i) * A(i , i+1:n)


A(j,i) = A(j,i)/A(i,i)for k = i+1 to n

A(j,k) = A(j,k) - A(j,i) * A(i,k)

8

What GE Really Computes

• Call the strictly lower triangular matrix of multipliers M, and let L = I+M.

• Call the upper triangle of the final matrix U.• Lemma (LU Factorization): If the above algorithm terminates

(does not divide by zero) then A = L*U.• Solving A*x=b using GE

– Factorize A = L*U using GE (cost = 2/3 n3 flops)– Solve L*y = b for y, using substitution (cost = n2 flops)– Solve U*x = y for x, using substitution (cost = n2 flops)

• Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired

for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)

9

Problems with Basic GE Algorithm• What if some A(i,i) is zero? Or very small?

– Result may not exist, or be “unstable”, so need to pivot• Current computation all BLAS 1 or BLAS 2, but we know

that BLAS 3 (matrix multiply) is fastestfor i = 1 to n-1

A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)

- A(i+1:n , i) * A(i , i+1:n)

PeakBLAS 3

BLAS 2BLAS 1

10

Pivoting in Gaussian Elimination° A = [ 0 1 ] fails completely, even though A is “easy”

[ 1 0 ]

° Illustrate problems in 3-decimal digit arithmetic:

A = [ 1e-4 1 ] and b = [ 1 ], correct answer to 3 places is x = [ 1 ][ 1 1 ] [ 2 ] [ 1 ]

° Result of LU decomposition is

L = [ 1 0 ] = [ 1 0 ] … No roundoff error yet[ fl(1/1e-4) 1 ] [ 1e4 1 ]

U = [ 1e-4 1 ] = [ 1e-4 1 ] … Error in 4th decimal place[ 0 fl(1-1e4*1) ] [ 0 -1e4 ]

Check if A = L*U = [ 1e-4 1 ] … (2,2) entry entirely wrong[ 1 0 ]

° Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|<5° Numerical instability° Computed solution x totally inaccurate

° Cure: Pivot (swap rows of A) so entries of L and U bounded

11

Gaussian Elimination with Partial Pivoting (GEPP)

• Partial Pivoting: swap rows so that each multiplier satisfies: |L(i,j)| = |A(j,i)/A(i,i)| <= 1

for i = 1 to n-1find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)|

… i.e. largest entry in rest of column iif |A(k,i)| = 0

exit with a warning that A is singular, or nearly soelseif k != i

swap rows i and k of Aend ifA(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1]A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)

• Lemma: This algorithm computes A = P*L*U, where P is a permutation matrix• Since each entry of |L(i,j)| <= 1, this algorithm is considered numerically stable.• For details see LAPACK code at www.netlib.org/lapack/single/sgetf2 and Dongarra’s book.

12

Summary• You have seen an example of how a typical matrix

operation (an important one) can be reduced to using lower level BLAS routines that would have been optimized for a given hardware platform

• Any other matrix operation that you may think of (matrix-matrix multiply, Cholesky factorization, QR, Householder method, etc) can be constructed from BLAS subprograms in a similar fashion and in fact have been constructed in the package called LAPACK.

• Note that only Level 1 and 2 BLAS routines have been used in LU decomposition. Efficiency considerations?

13

Overview of LAPACK

• Standard library for dense/banded linear algebra– Linear systems: A*x=b

– Least squares problems: minx || A*x-b ||2– Eigenvalue problems: Ax = λx, Ax = λBx

– Singular value decomposition (SVD): A = UΣVT

• Algorithms reorganized to use BLAS3 as much as possible.

• Basis of math libraries on many computers.

• Many algorithmic innovations remain.

14

Performance of LAPACK (n=1000)

15

Performance of LAPACK (n=100)

16

Summary, Cont’d

• Need to devise algorithms that can make use of Level 3 BLAS (matrix-matrix) routines, for several reasons:– Level 3 routines are known to run much more

efficiently due to larger ratio of computation to memory references/communication

– Parallel algorithms on distributed memory machines will require that we decompose the original matrix into blocks which reside in each processor (similar to HW1)

– Parallel algorithms will require that we minimize the surface-to volume ratio of our decompositions, and blocking becomes the natural approach.

17

Converting BLAS2 to BLAS3 in GEPP

• Blocking– Use to optimize matrix-multiplication.– Harder here because of data dependencies in GEPP.

• Delayed Updates– Save updates to “trailing matrix” from several consecutive BLAS2

updates.– Apply many saved updates simultaneously in one BLAS3 operation.

• Same idea works for much of dense linear algebra– Open questions remain.

• Need to choose a block size b (k in Dongarra’s book)– Algorithm will save and apply b updates.– b must be small enough so that active submatrix consisting of b

columns of A fits in cache.– b must be large enough to make BLAS3 fast.

18

Blocked Algorithms - LU Factorization

=

==

33

2322

131211

333231

2221

11

333231

232221

131211

UUUUUU

LLLLL

L

AAAAAAAAA

LUA

113131

112121

111111

ULAULAULA

===

2232123132

2222122122

121112

ULULAULULA

ULA

+=+=

=

33332332133133

2322132123

131113

ULULULAULULA

ULA

++=+=

=

19


113131

112121

111111

ULAULAULA

===

2232123132

2222122122

121112

ULULAULULA

ULA

+=+=

=

33332332133133

2322132123

131113

ULULULAULULA

ULA

++=+=

=

• With these relationships, we can develop different algorithms by choosing the order in which operations are performed.

• Blocksize, b, needs to be chosen carefully. b=1 produces the usual algorithm, which b>1 will improve performance on a single-processor

• Three natural variants:– Left-Looking LU– Right-Looking LU– Crout LU

20


• Notice that the LU decomposition in this sub-block is independent of the portion you have already completed

=

==

131211

31

21

11

333231

232221

131211 UUU

LLL

AAAAAAAAA

LUA

• Assume you have already done the first row and column of the GEPP

3332

2322

AAAA

• And you have the sub-block below left to work on

21


• Now you can re-arrange the block equations by substituting:

• For simplicity, change notation of sub-block to

+

=

=

222212211121

12111111

22

1211

2221

11

2221

1211

ULULULULUL

PUUU

LLL

PAAAA

• Notice that, once you have done Gaussian Elimination on A11 and A21, you have already obtained L11, U11, and L21.

12212222

121

1112

~)(

ULAAALU

−←← −

• And repeat the procedure recursively.

22


• A graphical view of what is going on is given by:

23


Left-Looking LU Right-Looking LU Crout LU

• Variations in algorithm are due to the order in which submatrix operations are performed. Slight advantages to Crout’s algorithm (hybrid of the first two) .

Pre-computed sub-blocks

Currently being operated on sub-blocks

24

Review: BLAS 3 (Blocked) GEPP

for ib = 1 to n-1 step b … Process matrix b columns at a timeend = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’… let LL denote the strict lower triangular part of A(ib:end , ib:end) + IA(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of UA(end+1:n , end+1:n ) = A(end+1:n , end+1:n )

- A(end+1:n , ib:end) * A(ib:end , end+1:n)… apply delayed updates with single matrix-multiply… with inner dimension b

BLAS 3

25

Parallel Algorithms for Dense Matrices

• All that follows is applicable to dense or fullmatrices only. Square matrices discussed, but arguments valid for rectangular matrices as well.

• Typical parallelization steps:– Decomposition: identify parallel work and partitioning.– Mapping: which procs execute which portion of the

work.– Assignment: load balance work among procs.– Organization: communication and synchronization.

26

Parallel Algorithms for Dense Matrices

• The proc that owns a given portion of a matrix is responsible for doing all of the computation that involved that portion of the matrix.

• This is the sensible thing to so, since communication is minimized (although, due to data dependencies within the matrix, it will still be necessary)

• The question is: how should we subdivide a matrix so that parallel efficiency is maximized? There are various options.

27

Different Data Layouts for Parallel GE

The winner!

Bad load balance:P0 idle after firstn/4 steps

Load balanced, but can’t easilyuse BLAS2 or BLAS3

Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneck

Complicated addressing

28

Row and Column Block Cyclic Layout

• Matrix is composed of brow-by-bcol blocks.• Procs are distributed in a 2D array indexed by (pi,

pj), 0 pi < Prow, 0 pj < Pcol.• Ai,j is mapped to proc (pi, pj) using the formulae:

• In the figure, p=4, Prow=Pcol=brow=bcol=2.• Pcol-fold parallelism in any column, and calls to the

BLAS2 and BLAS3 on matrices of size brow-by-bcol.

• Serial bottleneck is eased.• Need not be symmetric in rows and columns.

Pcol mod)/(floorProw mod)/(floor

bcoljpjbrowipi

==

29

Row and Column Block Cyclic Layout

• In LU factorization, distribution of work becomes uneven as the computation progresses.• Larger block sizes result in greater load imbalance but reduce frequency of communication between

procs.• Block size controls these tradeoffs.• Some procs need to do more work between synchronization points than others (e.g. partial pivoting

over rows in a single block-col, other procs stay idle. The computation of each block row of the U factorization requires the solution of a lower triangular system across procs in a single row).

• Processor decomposition controls this type of tradeoff .

30

Parallel GE with a 2D Block Cyclic Layout

• Block size, b, in the algorithm and the block sizes brow and bcol in the layout satisfy b=brow=bcol.

• Shaded regions indicate busy processors or communication performed.

• Unnecessary to have a barrier between each step of the algorithm, e.g.. step 9, 10, and 11 can be pipelined.

• See Dongarra’s book for more details.

32 Matrix multiply of

green = green - blue * pink

34

Parallel Matrix Transpose

• The transpose of a matrix A is defined as:

njiAA ijjiT ≤≤= ,0,,,

• All elements below diagonal move above the diagonal and vice versa.

• Assume it takes 1 unit time to exchange a pair of matrix elements. Sequential time of transposing an n × n matrix is:

2

2 nnTs−=

• Consider parallel architectures organized in both a 2D meshand hypercube structures.

35

Parallel Matrix Transpose - 2D Mesh

1P

0P

2P

3P 7P

6P

5P

4P 8P

9P

10P

11P 15P

14P

13P

12P1P0P 2P 3P

7P6P5P4P

8P9P 10P 11P

15P14P13P12P

• Elements/blocks on lower-left part of matrix move up to the diagonal, and then right to their final location. Each step taken requires communication.

• Elements/blocks on upper-right part of matrix move down to the diagonal, and then left to their final location. Each step taken requires communication.

Initial Matrix Final Matrix

36

Parallel Matrix Transpose - 2D Mesh

• If each of the p procs contains a single number, after all of these communication steps, the matrix has been transposed.

• However, if each proc contains a sub-block of the matrix, after all blocks have been communicated to their final locations, they need to be locally transposed. Each sub-block will contain (n/√p)×(n/√p)elements and the cost of communication will be higher than before.

• Cost of communication is dominated by elements/blocks that reside in the top-right and bottom-left corners. They have to take an approximate number of hops equal to 2√p.

1P

0P

2P

3P 7P

6P

5P

4P 8P

9P

10P

11P 15P

14P

13P

12P1P0P 2P 3P

7P6P5P4P

8P9P 10P 11P

15P14P13P12P

Initial Matrix Final Matrix

37

Parallel Matrix Transpose - 2D Mesh• Each block contains elements, so it takes at most

ppntt ws )/(2 2+pn /2

for all blocks to move to their final destinations. After that, the local blocks need to be transposed, which can be done in an amount of time approximately equal to )2/(2 pn

• Thus, a total wall clock time equals to

pntpt

pnT wsP

22

222

++=

which is higher than the sequential complexity (order n^2). This algorithm, on a 2D mesh is not cost optimal. The same is true regardless of whether store-and-forward or cut-through routing schemes are used.

)( 2 pnTTOTAL Θ=

• Summing over all p processors, the total time consumed by the parallel algorithm is of order

38

Parallel Matrix Transpose - Hypercube

1P0P 2P 3P

7P6P5P4P

8P9P 10P 11P

15P14P13P12P

1P0P

2P 3P

7P6P

5P4P

8P 9P

10P 11P

15P14P

13P12P

1P

0P

2P

3P 7P

6P

5P

4P 8P

9P

10P

11P 15P

14P

13P

12P• This algorithm is called recursive subdivision

and maps naturally onto a hypercube.• After the blocks have all been transposed, the

elements inside each block (local to a proc) still need to be transposed.

• Wall clock time Total timep

pntt

pnT wsP log)(2

22

++= )log( 2 pnTTOTAL Θ=

39

Parallel Matrix Transpose - Hypercube

• With cut-through routing, the timing improves slightly to

)log(log)2( 22

pnTptpnttT TOTALhwsP Θ=⇒++=

which is still suboptimal.• Using striped partitioning (a.k.a column blocked layout) and cut-through

routing on a hypercube, the total time becomes cost-optimal.

)(log21)1(

22

22

nTpptpntpt

pnT TOTALhwsP Θ=⇒++−+=

• Note that this type of partitioning may be cost-optimal for the transpose operation. But not necessarily for other matrix operations, such as LU factorization and matrix-matrix multiply.

cs 770g - parallel algorithms in scientific computing · 2001-05-25 · scientific computing may 28...

Documents