matrix computation: iterative methods ii cg & its...

CME342

Parallel Methods in Numerical Analysis

Matrix Computation: Iterative Methods II

Outline:

• CG & its parallelization.

• Sparse Matrix-vector Multiplication.

1

Basic iterative methods:

Ax = b

r = b−Ax (residual)

• Split the matrix A = M −N

• Solve Mxk+1 = Nxk + b

xk+1 = xk +M−1(b−Ax) = xk +M−1r

• Jacobi: M = D, N = L+ U

• GS: M = D + L, N = U

2

Summary: Jacobi vs GS

• Jacobi is parallel.

• GS is sequential because of the data depen-

dency. Can be fixed by coloring technique.

• Comparison:

Jacobi GSParallelism Good PoorConvergence Slow Fast

Which one is faster on parallel machines?

3

Jacobi vs RB Gauss-Seidel

• RB GS converges twice as fast as Jacobi, but

requires twice as many parallel steps; about

the same run time in practice.

• Parallel efficiency alone is not sufficient to

determine overall performance.

• We also need fast converging algorithms.

4

Conjugate Gradient Method

• For A=SPD, i.e.

. symmetric: A = AT .

. positive definite: xTAx > 0, ∀x 6= 0.

• Consider the quadratic function:

Φ(x) =1

2xTAx− xT b

. Its unique minimum x satisfies:

Ax = b

. Hence, solving Ax = b ⇔ min Φ(x).

• Given xk−1 and search direction dk, let

xk = xk−1 + αdk,

where α = step length.

5

CG (cont.)

k-1

x k

x

x

d k_

• Determine α by min. Φ(xk) along dk:

minα

Φ(xk−1 + αdk).

• By differentiation, αk = αopt is given by

αk =(rk−1)Tdk

(dk)TAdk

where rk−1 = b−Axk−1 = residual vector.

• The search direction dk is chosen such that it

is A-orthogonal to all previous directions, i.e.

(dk)TAdj = 0, j = 0, . . . , k − 1.

6

Using this fact, it can be shown that:

αk =(rk−1)T rk−1

(dk)TAdk

CG Algorithm

k = 0; r0 = b−Ax0; ρ0 = ‖r0‖22;

while (√ρk > ε‖r0‖2) do

k = k + 1;

ρk−1 = (rk−1)T rk−1;

if (k = 1)

d1 = r0;

else

βk−1 = ρk−1/ρk−2;

dk = rk−1 + βk−1dk−1;

endif

ek = Adk;

αk = ρk−1/(dk)T ek;

xk = xk−1 + αkdk;

rk = rk−1 − αkek;

ρk = (rk)T rk;

end

7

Properties of CG

• No need to form A. Only Av is needed.

• Minimal storage:

. Short recurrence for the update of xk.

. Need only store: xk, rk, dk, Adk.

• Minimal error:

. Define Kk(A; r0), Krylov subspace of dimen-sion k:

Kk(A; r0) ≡ {r0, Ar0, . . . , Ak−1r0}.It can be proved that:

Kk(A; r0) = span{r0, r1, . . . , rk−1}= span{d1, d2, . . . , dk}.

. xk ∈ Kk(A; r0) min. the error ek ≡ xk − x inthe A-norm:

‖xk − x‖A ≤ ‖x̃− x‖A ∀ x̃ ∈ Kk(A; r0),

where ‖w‖A =√wTAw.

8

Properties of CG (cont.)

• In exact arithmetic, CG achieves exact solu-tion in at most n steps (finite termination).

• In practice, an accurate approx. is obtainedlong before n steps. So, CG is usually usedas an iterative method.

• Convergence of CG:

. Upper bound:

‖x− xk‖A ≤(√

κ− 1√κ+ 1

)k‖x− x0‖A,

where κ is the condition number of A:

κ ≡ ‖A‖‖A−1‖ =λmax(A)

λmin(A).

. In practice, the rate of convergence is usu-ally much faster than the upper bound, es-pecially in the case when the eigenvalues ofA are located in clusters.

• 2D mesh with N unknowns, Poisson eqn:CG converges in O(

√N) steps, same as opti-

mal SOR.

9

Parallel CG

p = # of processors.

• DAXPY: scalar × vector + vector (BLAS I)

. 3 daxpy operations: dk, xk, rk.

. O(n/p) flops.

. No communication.

• Inner products (BLAS I)

. 2 inner products: (rk−1)T rk−1, (ek)Tdk.

. O(n/p) flops.

. Communication for the reduction process(collecting partial sums from all processors).

• Matrix-vector product (Essentially BLAS I)

. Major cost of CG.

. Efficiency depends on sparse structure of A.

• Minor note: {xk} are not needed in the CGprocess. We may delay the update of xk, . . . , xk+j

by storing dk, . . . , dk+j.

10

Overlapping Communication

• Want to overlap communication time with

useful computations.

• In standard CG, there are 2 synchronization

points (inner products).

• Possible to reduce to 1 synch. point by rear-

ranging terms.

• A version is given in next slide.

• 1 additional daxpy operation.

• 2 inner products can be computed in parallel

→ 1 synch. point.

• Main disadvantage is the possible numerical

instability.

11

Variant CG Algorithm

r0 = b−Ax0;

q−1 = p−1 = 0;β−1 = 0;

s0 = Ar0;

ρ0 = (r0)T r0;µ0 = (s0)T r0; α0 = ρ0/µ0;

for k = 0,1, . . .

pk = rk + βk−1pk−1;

qk = sk + βk−1qk−1;

xk+1 = xk + αkpk;

rk+1 = rk − αkqk;

check convergence; continue if necessary

sk+1 = Ark+1;

ρk+1 = (rk+1)T rk+1;

µk+1 = (sk+1)T rk+1;

βk = ρk+1/ρk;

αk+1 = ρk+1/(µk+1 − ρk+1βk/αk);

end

12

CG for non symmetric matrices

• Ax = b, where A is not SPD

• ATAx = AT b (minimal residual)

• x = ATy

solve AATy = b (minimal error)

get x from y

• When A is square, the condition number of

ATA is bigger than the condition number of

A: slower convergence

13

Krylov Subspace Methods

• A can be any nonsymmetric matrix.

• Let Vk = {v1, . . . , vk} be a basis for Kk(A; r0).Look for approx. solution xk ∈ Kk(A; r0), i.e.,

xk = Vky,

for some y ∈ IRk.

• Four projection-type approaches to pick xk:

. Ritz-Galerkin: (CG, Lanczos, FOM, GENCG)

b−Axk ⊥ Kk(A; r0).

. Minimum Residual: (GMRES, MINRES)

minxk∈Kk

‖b−Axk‖2.

. Petrov-Galerkin: (BiCG, BiCGSTAB, QMR)

b−Axk ⊥ Sk.e.g. Sk = Kk(AT ; s0).

. Minimum Error: (SYMMLQ, GMERR)

minxk∈Kk‖x− xk‖2.

14

Sparse Matrix Computations

• Computations only carried out at nonzero en-tries. Thus, only nonzero entries are stored.

Storage formats:

• Coordinate format:A = {I, J, V AL}, I, J, V AL, are nnz × 1 array;nnz = # of nonzeros.

• Compress sparse row format:A = {I, J, V AL}V AL = nnz × 1 array of nonzero entriesJ = nnz × 1 array of column indices.I = n × 1 array; ith entries of which pointsto the 1st entry of the ith row in V AL & J.

• Ellpack-Itpack format:m = max. # of nonzeros in any row.A = {V AL, J}V AL = n×m array of nonzero entries.J = n×m array of column indices.

15

Example: storage formats

A =

1 0 0 23 4 0 00 5 6 07 0 8 9

• Coordinate format:

I = [1 1 2 2 3 3 4 4 4]J = [1 4 1 2 2 3 1 3 4]

V AL = [1 2 3 4 5 6 7 8 9]

• CSR format:

V AL = [1 2 3 4 5 6 7 8 9]J = [1 4 1 2 2 3 1 3 4]I = [1 3 5 7]

• Ellpack-Itpack format:

V AL =

1 2 −3 4 −5 6 −7 8 9

, J =

1 4 −11 2 −12 3 −11 3 4

16

Sparse Matrix-Vector Multiplication

• For iterative methods such as Jacobi, Gauss-

Seidel, CG, etc, the major computational cost

per iteration is Av.

• Sparse matrices usually have a special struc-

ture which should be exploited for max effi-

ciency.

• Consider 3 cases:

. Matrices whose nonzero entries lie on a few

diagonals.

. Unstructured matrices; the locations of the

nonzero entries can be anywhere.

. Banded matrices; the nonzero elements are

confined within a band.

• Key Idea: data mapping.

17

Pentadiagonal Matrices

• Decompose the matrix into 3 parts:

= + +

II IIII

• Partition A by rows: Pi contains row (i-1)(n/p)+1

to i(n/p), and vector elements (i-1)(n/p)+1

to i(n/p).

• I requires no communication.

• II requires comm between neighboring pro-

cessors, i.e, Pi comm with Pi−1 for vector

element (i-1)(n/p) & Pi+1 for i(n/p)+1.

Timing:

Tcomm = 2(ts + tw),

18

where ts=startup time & tw=time per word.

Assume hypercube architecture.

Pentadiagonal Matrices (cont.)

m=sqrt(n)

i+p/mP

i-p/m

Pi

P

• III: p ≤√n:

Comm between neighboring processors ex-

changing√n elements.

Timing:Tcomm = 2(ts + tw

√n).

(including Tcomm for II)

• III: p >√n:

Need comm with Pi±p/√n for n/p elements.

Timing:

19

Tcomm = 2(ts + twn/p+ th log p),

where th=time per hop between processors.

Checkerboard (domain) Partitioning

P

P

P

1

4 5 6

987

sqrt(n)/p

32P P

P P

P P

• Each processor stores√n/p clusters of

√n/p

matrix rows, i.e. n/p elements.

• Need comm only along boundary data with

length = 4√n/p.

Timing:

Tcomm = 4(ts + tw√n/p).

20

Unstructured Sparse Matrices

• Partition by rows.

• Require the entire vector in general. Hence

all-to-all broadcast.

• Perform local matrix-vector multiplication.

• Parallel run time (hypercube):

Tp = tcmn

p+ ts log p+ twn

= O(n).

tc=computational time per data,

m=max number of nonzero per row.

(Sequential run time = O(n))

• Hence, no gain in parallel.

21

Checkerboard Matrix Partitioning

• Hypercube:

Tcomm = ts log p+3

2

(twn log p)√p

.

• Average computation = mnp ,

m = average number of nonzeros per row.

• Parallel run time, speedup, efficiency:

Tp = tcmn

p+ ts log p+

3

2twn log p√p

S =mtcpn

mtcn+ tsp log p+ (3twn√p/2) log p

E =T1

pTp

=mtc

mtc + (tsp log p)/n+ (3√p log p)/2

• E −→/ 1 as n→∞ ⇒ unscalable.

• Tp = O(log p) +O(n log p√p )

better than sequential run time for large p.

22

Planar Graph

• Suppose A sparse matrix with a symmetric

structure, i.e.

ai,j 6= 0 iff aj,i 6= 0.

• Let G(A) be the graph of A. A scalable paral-

lel implementation of matrix-vector multiply

exists if G(A) is planar.

• If G(A) is planar, communication occurs only

at the boundaries of each processor’s parti-

tion.

• Hence, by reducing the number of partitions,

or equivalently, increasing the size of the par-

titions, it is possible to increase the compu-

tation to comm. ratio of the processors.

• Partition a graph to minimize interprocessor

comm. is NP-hard.

23

Banded Unstructured Sparse Matrices

• Suppose bandwidth = w, Ellpack-Itpack for-

mat, partition the matrix by rows, average #

of nonzeros = m.

• Assume n/p� w:

i+wp/(2n)P

w

i-wp/(2n)P

Pi

• Need vector element indices from i-(w-1)/2

to i+(w-1)/2 ⇒ comm. with (w − 1)p/n pro-

cessors, i.e. Pi comm. with

Pi−wp/(2n) · · ·Pi+wp/2n.

24

matrix computation: iterative methods ii cg & its...

Documents