matrix computation: iterative methods ii cg & its...
TRANSCRIPT
CME342
Parallel Methods in Numerical Analysis
Matrix Computation: Iterative Methods II
Outline:
• CG & its parallelization.
• Sparse Matrix-vector Multiplication.
1
Basic iterative methods:
Ax = b
r = b−Ax (residual)
• Split the matrix A = M −N
• Solve Mxk+1 = Nxk + b
xk+1 = xk +M−1(b−Ax) = xk +M−1r
• Jacobi: M = D, N = L+ U
• GS: M = D + L, N = U
2
Summary: Jacobi vs GS
• Jacobi is parallel.
• GS is sequential because of the data depen-
dency. Can be fixed by coloring technique.
• Comparison:
Jacobi GSParallelism Good PoorConvergence Slow Fast
Which one is faster on parallel machines?
3
Jacobi vs RB Gauss-Seidel
• RB GS converges twice as fast as Jacobi, but
requires twice as many parallel steps; about
the same run time in practice.
• Parallel efficiency alone is not sufficient to
determine overall performance.
• We also need fast converging algorithms.
4
Conjugate Gradient Method
• For A=SPD, i.e.
. symmetric: A = AT .
. positive definite: xTAx > 0, ∀x 6= 0.
• Consider the quadratic function:
Φ(x) =1
2xTAx− xT b
. Its unique minimum x satisfies:
Ax = b
. Hence, solving Ax = b ⇔ min Φ(x).
• Given xk−1 and search direction dk, let
xk = xk−1 + αdk,
where α = step length.
5
CG (cont.)
k-1
x k
x
x
d k_
• Determine α by min. Φ(xk) along dk:
minα
Φ(xk−1 + αdk).
• By differentiation, αk = αopt is given by
αk =(rk−1)Tdk
(dk)TAdk
where rk−1 = b−Axk−1 = residual vector.
• The search direction dk is chosen such that it
is A-orthogonal to all previous directions, i.e.
(dk)TAdj = 0, j = 0, . . . , k − 1.
6
Using this fact, it can be shown that:
αk =(rk−1)T rk−1
(dk)TAdk
CG Algorithm
k = 0; r0 = b−Ax0; ρ0 = ‖r0‖22;
while (√ρk > ε‖r0‖2) do
k = k + 1;
ρk−1 = (rk−1)T rk−1;
if (k = 1)
d1 = r0;
else
βk−1 = ρk−1/ρk−2;
dk = rk−1 + βk−1dk−1;
endif
ek = Adk;
αk = ρk−1/(dk)T ek;
xk = xk−1 + αkdk;
rk = rk−1 − αkek;
ρk = (rk)T rk;
end
7
Properties of CG
• No need to form A. Only Av is needed.
• Minimal storage:
. Short recurrence for the update of xk.
. Need only store: xk, rk, dk, Adk.
• Minimal error:
. Define Kk(A; r0), Krylov subspace of dimen-sion k:
Kk(A; r0) ≡ {r0, Ar0, . . . , Ak−1r0}.It can be proved that:
Kk(A; r0) = span{r0, r1, . . . , rk−1}= span{d1, d2, . . . , dk}.
. xk ∈ Kk(A; r0) min. the error ek ≡ xk − x inthe A-norm:
‖xk − x‖A ≤ ‖x̃− x‖A ∀ x̃ ∈ Kk(A; r0),
where ‖w‖A =√wTAw.
8
Properties of CG (cont.)
• In exact arithmetic, CG achieves exact solu-tion in at most n steps (finite termination).
• In practice, an accurate approx. is obtainedlong before n steps. So, CG is usually usedas an iterative method.
• Convergence of CG:
. Upper bound:
‖x− xk‖A ≤(√
κ− 1√κ+ 1
)k‖x− x0‖A,
where κ is the condition number of A:
κ ≡ ‖A‖‖A−1‖ =λmax(A)
λmin(A).
. In practice, the rate of convergence is usu-ally much faster than the upper bound, es-pecially in the case when the eigenvalues ofA are located in clusters.
• 2D mesh with N unknowns, Poisson eqn:CG converges in O(
√N) steps, same as opti-
mal SOR.
9
Parallel CG
p = # of processors.
• DAXPY: scalar × vector + vector (BLAS I)
. 3 daxpy operations: dk, xk, rk.
. O(n/p) flops.
. No communication.
• Inner products (BLAS I)
. 2 inner products: (rk−1)T rk−1, (ek)Tdk.
. O(n/p) flops.
. Communication for the reduction process(collecting partial sums from all processors).
• Matrix-vector product (Essentially BLAS I)
. Major cost of CG.
. Efficiency depends on sparse structure of A.
• Minor note: {xk} are not needed in the CGprocess. We may delay the update of xk, . . . , xk+j
by storing dk, . . . , dk+j.
10
Overlapping Communication
• Want to overlap communication time with
useful computations.
• In standard CG, there are 2 synchronization
points (inner products).
• Possible to reduce to 1 synch. point by rear-
ranging terms.
• A version is given in next slide.
• 1 additional daxpy operation.
• 2 inner products can be computed in parallel
→ 1 synch. point.
• Main disadvantage is the possible numerical
instability.
11
Variant CG Algorithm
r0 = b−Ax0;
q−1 = p−1 = 0;β−1 = 0;
s0 = Ar0;
ρ0 = (r0)T r0;µ0 = (s0)T r0; α0 = ρ0/µ0;
for k = 0,1, . . .
pk = rk + βk−1pk−1;
qk = sk + βk−1qk−1;
xk+1 = xk + αkpk;
rk+1 = rk − αkqk;
check convergence; continue if necessary
sk+1 = Ark+1;
ρk+1 = (rk+1)T rk+1;
µk+1 = (sk+1)T rk+1;
βk = ρk+1/ρk;
αk+1 = ρk+1/(µk+1 − ρk+1βk/αk);
end
12
CG for non symmetric matrices
• Ax = b, where A is not SPD
• ATAx = AT b (minimal residual)
• x = ATy
solve AATy = b (minimal error)
get x from y
• When A is square, the condition number of
ATA is bigger than the condition number of
A: slower convergence
13
Krylov Subspace Methods
• A can be any nonsymmetric matrix.
• Let Vk = {v1, . . . , vk} be a basis for Kk(A; r0).Look for approx. solution xk ∈ Kk(A; r0), i.e.,
xk = Vky,
for some y ∈ IRk.
• Four projection-type approaches to pick xk:
. Ritz-Galerkin: (CG, Lanczos, FOM, GENCG)
b−Axk ⊥ Kk(A; r0).
. Minimum Residual: (GMRES, MINRES)
minxk∈Kk
‖b−Axk‖2.
. Petrov-Galerkin: (BiCG, BiCGSTAB, QMR)
b−Axk ⊥ Sk.e.g. Sk = Kk(AT ; s0).
. Minimum Error: (SYMMLQ, GMERR)
minxk∈Kk‖x− xk‖2.
14
Sparse Matrix Computations
• Computations only carried out at nonzero en-tries. Thus, only nonzero entries are stored.
Storage formats:
• Coordinate format:A = {I, J, V AL}, I, J, V AL, are nnz × 1 array;nnz = # of nonzeros.
• Compress sparse row format:A = {I, J, V AL}V AL = nnz × 1 array of nonzero entriesJ = nnz × 1 array of column indices.I = n × 1 array; ith entries of which pointsto the 1st entry of the ith row in V AL & J.
• Ellpack-Itpack format:m = max. # of nonzeros in any row.A = {V AL, J}V AL = n×m array of nonzero entries.J = n×m array of column indices.
15
Example: storage formats
A =
1 0 0 23 4 0 00 5 6 07 0 8 9
• Coordinate format:
I = [1 1 2 2 3 3 4 4 4]J = [1 4 1 2 2 3 1 3 4]
V AL = [1 2 3 4 5 6 7 8 9]
• CSR format:
V AL = [1 2 3 4 5 6 7 8 9]J = [1 4 1 2 2 3 1 3 4]I = [1 3 5 7]
• Ellpack-Itpack format:
V AL =
1 2 −3 4 −5 6 −7 8 9
, J =
1 4 −11 2 −12 3 −11 3 4
16
Sparse Matrix-Vector Multiplication
• For iterative methods such as Jacobi, Gauss-
Seidel, CG, etc, the major computational cost
per iteration is Av.
• Sparse matrices usually have a special struc-
ture which should be exploited for max effi-
ciency.
• Consider 3 cases:
. Matrices whose nonzero entries lie on a few
diagonals.
. Unstructured matrices; the locations of the
nonzero entries can be anywhere.
. Banded matrices; the nonzero elements are
confined within a band.
• Key Idea: data mapping.
17
Pentadiagonal Matrices
• Decompose the matrix into 3 parts:
= + +
II IIII
• Partition A by rows: Pi contains row (i-1)(n/p)+1
to i(n/p), and vector elements (i-1)(n/p)+1
to i(n/p).
• I requires no communication.
• II requires comm between neighboring pro-
cessors, i.e, Pi comm with Pi−1 for vector
element (i-1)(n/p) & Pi+1 for i(n/p)+1.
Timing:
Tcomm = 2(ts + tw),
18
where ts=startup time & tw=time per word.
Assume hypercube architecture.
Pentadiagonal Matrices (cont.)
m=sqrt(n)
i+p/mP
i-p/m
Pi
P
• III: p ≤√n:
Comm between neighboring processors ex-
changing√n elements.
Timing:Tcomm = 2(ts + tw
√n).
(including Tcomm for II)
• III: p >√n:
Need comm with Pi±p/√n for n/p elements.
Timing:
19
Tcomm = 2(ts + twn/p+ th log p),
where th=time per hop between processors.
Checkerboard (domain) Partitioning
P
P
P
1
4 5 6
987
sqrt(n)/p
32P P
P P
P P
• Each processor stores√n/p clusters of
√n/p
matrix rows, i.e. n/p elements.
• Need comm only along boundary data with
length = 4√n/p.
Timing:
Tcomm = 4(ts + tw√n/p).
20
Unstructured Sparse Matrices
• Partition by rows.
• Require the entire vector in general. Hence
all-to-all broadcast.
• Perform local matrix-vector multiplication.
• Parallel run time (hypercube):
Tp = tcmn
p+ ts log p+ twn
= O(n).
tc=computational time per data,
m=max number of nonzero per row.
(Sequential run time = O(n))
• Hence, no gain in parallel.
21
Checkerboard Matrix Partitioning
• Hypercube:
Tcomm = ts log p+3
2
(twn log p)√p
.
• Average computation = mnp ,
m = average number of nonzeros per row.
• Parallel run time, speedup, efficiency:
Tp = tcmn
p+ ts log p+
3
2twn log p√p
S =mtcpn
mtcn+ tsp log p+ (3twn√p/2) log p
E =T1
pTp
=mtc
mtc + (tsp log p)/n+ (3√p log p)/2
• E −→/ 1 as n→∞ ⇒ unscalable.
• Tp = O(log p) +O(n log p√p )
better than sequential run time for large p.
22
Planar Graph
• Suppose A sparse matrix with a symmetric
structure, i.e.
ai,j 6= 0 iff aj,i 6= 0.
• Let G(A) be the graph of A. A scalable paral-
lel implementation of matrix-vector multiply
exists if G(A) is planar.
• If G(A) is planar, communication occurs only
at the boundaries of each processor’s parti-
tion.
• Hence, by reducing the number of partitions,
or equivalently, increasing the size of the par-
titions, it is possible to increase the compu-
tation to comm. ratio of the processors.
• Partition a graph to minimize interprocessor
comm. is NP-hard.
23
Banded Unstructured Sparse Matrices
• Suppose bandwidth = w, Ellpack-Itpack for-
mat, partition the matrix by rows, average #
of nonzeros = m.
• Assume n/p� w:
i+wp/(2n)P
w
i-wp/(2n)P
Pi
• Need vector element indices from i-(w-1)/2
to i+(w-1)/2 ⇒ comm. with (w − 1)p/n pro-
cessors, i.e. Pi comm. with
Pi−wp/(2n) · · ·Pi+wp/2n.
24