projection-based iterative methods - uppsala university

60
Projection-based iterative methods – p. 1/49

Upload: others

Post on 27-Mar-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Shortly on projectors
Projectors and properties
Definitions: Consider Cn and a mapping P : Cn → Cn. P is called a projector if P 2 = P If P is a projector, then I − P is also such:
(I − P )2 = I − 2P + P 2 = I − P.
N (P ) = {x ∈ Cn : Px = 0} (null space (kernel) of P ) R(P ) = {Ax : x ∈ Cn} (range of P ).
A subspace S is called invariant under a square matrix A whenever AS ∈ S.
– p. 3/49
P1: N (P ) R(P ) = {0} Indeed,
if x ∈ R(P ) ⇒ ∃y : y = Px ⇒ Py = P 2x = Px ⇒ y = x ⇒ x = Px
If x ∈ N (P ) ⇒ Px = 0 ⇒ x = Px ⇒ x = 0.
P2: N (P ) = R(I − P )
x ∈ N (P ) ⇒ Px = 0. Then x = Ix− Px = (I − P )x. y ∈ R(I − P ) ⇒ y = (I − P )y ⇒ Py = 0.
P3: Cn = R(P ) ⊕N (P ).
P4: Given two subspaces K and L of same dimension m, the following two
conditions are mathematically equivalent:
(i) No nonzero vector in K is orthogonal to L
(ii) ∀x ∈ Cn∃ unique vector y : y ∈ K,x− y ∈ L.
Proof (i)⇒(ii): K
L⊥ = {∅} ⇒ Cn = K ⊕
L⊥ ⇒ ∀x ∈ Cn : x = y + z, where y ∈ K and z ∈ L⊥. Thus, z = x− y ⇒ (ii).
– p. 4/49
P5: Orthogonal and oblique projectors
P is orthogonal if N (P ) = R(P )⊥. Otherwise P is oblique. Thus, if P is orthogonal onto K, then Px ∈ K and (I − P )x ⊥ K. Equivalently, ((I − P )x,y) = 0, ∀y ∈ K.
K
x
Px
P6: If P is orthogonal, then P = 1.
P7: Any orthogonal projector has only two eigenvalues 0 and 1. Any vector
from R(P ) is an eigenvector to λ = 1. Any vector from N (P ) is an eigenvector to λ = 0.
– p. 6/49
Theorem 1 Let P be orthogonal onto K. Then for any vector x ∈ Cn there holds
min y∈K
x− y2 = x− Px2. (1)
Proof For any y ∈ K, Px− y ∈ K, Px ∈ K, (I − P )x ⊥ K
x− y22 = (x− Px) + (Px− y)22 =
x− Px22 + Px− y22 + 2(x− Px, Px− y) = x− Px22 + Px− y22.
Therefore, x− y22 ≥ x− Px22 ∀y ∈ K and the minimum is reached for y = Px.
Corollary 1 Let K ⊂ Cn and x ∈ Cn be given. Then min y
x− y2 = x− y∗2 is equivalent to y∗ ∈ K and x− y∗ ⊥ K.
– p. 7/49
Minimal residual method (MINRES)
Lanczos method
Arnoldi method
based iterative solution methods
· · ·
Want to solve b−Ax = 0,b,x ∈ Rn, A ∈ Rn×n
Instead, choose two subspaces L ⊂ Rn and K ⊂ Rn and
∗ find x ∈ x(0) +K, such that b−Ax ⊥ L
K - search space L - subspace of constraints
∗ - basic projection step
There are two major classes of projection methods:
orthogonal - if K ≡ L,
– p. 11/49
Given x0, K and L,
∗ find x ∈ x(0) +K, such that b−Ax ⊥ L
Notations: x = x0 + δ (δ – correction) r0 = b−Ax0 (r0 – residual)
∗ find δ ∈ K, such that r0 −Aδ ⊥ L
– p. 12/49
Matrix formulation
Choose a basis in K and L: V = {v1,v2, · · · ,vm} and W = {w1,w2, · · · ,wm}. Then, x = x0 + δ = x0 + V y for some y ∈ Rm.
The orthogonality condition can be written as
(∗∗) WT (r0 −AV y)
WT r0 = WTAV y
y = (WTAV )−1WT r0
x = x0 + V (WTAV )−1WT r0
In practice, m < n, even m n, for instance, m = 1. The matrix WTAV will be small and, hopefully, with a nice structure.
!!! WTAV should be invertible.
Given x(0); x = x(0)
Choose basis V in K and W in L
Compute r = b−Ax
y = (WTAV )−1WT r
x = x+ V y
Degrees of freedom: m,K,L, V,W . Clearly, if K ≡ L, then V = W .
To do next: (1) Consider two important cases: L = K and L = AK
(2) Make a special choice of K.
– p. 14/49
Property 1:
Theorem 2 Let A be square, L = AK. Then a vector x is an oblique projection on K
orthogonally to AK with a starting vector x0 if and only if x minimizes the 2-norm of
the residual over x0 +K, i.e.,
r−Ax2 = min x∈x0+K
r−Ax2. (2)
Referred to as minimal residual methods CR, GCG, GMRES, ORTHOMIN
– p. 15/49
Property 1:
– p. 16/49
Property 2:
Theorem 3 Let A be symmetric positive definite, i.e., it defines a scalar product (A·, ·) and a norm · A. Let L = K, i.e., r0 −Ax ⊥ K. Then a vector x is an orthogonal
projection onto K with a starting vector x0 if and only if it minimizes the A-norm of
the error e = x∗ − x over x0 +K, i.e.,
x∗ − xA = min x∈x0+K
x∗ − xA. (3)
– p. 17/49
Example:m = 1
Consider two vectors: d and e. Let K = span{d} and L = span{e}. Then x = x0 + αd (δ = αd) and the orthogonality condition reads as:
r0 −Aδ ⊥ e ⇒ (r0 −Aδ, e) = 0 ⇒ α(Ad, e) = (r0, e) ⇒ α = (r0, e)
(Ad, e) .
If d = e - Steepest Descent method (minimization on a line. If we minimize over a plane - ORTHOMIN.
– p. 18/49
Choice ofK:
Krylov subspace methods
L = AK = AKm(A, r0) ⇒ GMRES
– p. 19/49
Dealt with in the ’Numerical Linear Algebra’ course.
– p. 20/49
The Conjugate Gradient method
– p. 21/49
Let L = K or L = AK: The missing part is how to con-
struct a basis inK
Recall: Choose a basis in K and L: V = {v1,v2, · · · ,vm} and W = {w1,w2, · · · ,wm}. Then, x = x0 + δ = x0 + V y for some y ∈ Rm.
The orthogonality condition can be written as
(∗∗) WT (r0 −AV y)
WT r0 = WTAV y
y = (WTAV )−1WT r0
– p. 22/49
Arnoldi’s method for general matrices
Consider Km(A,v) = {v, Av, A2v, · · · , Am−1v}, generated by some matrix A
and vector v. 1. Choose a vector v1 such that v1 = 1
2. For j = 1, 2, · · · ,m 3. For i = 1, 2, · · · , j 4. hij = (Avj ,vi)
5. End
i=1 hijvi
7. hj+1,j = wj 8. If hj+1,j = 0, stop 9. vj+1 = wj/hj+1,j
10. End
The algorithm breaks down in step j, i.e., hj+1,j = 0, if and only if the minimal polynomial of A is of degree j.
– p. 23/49
V m = {v1,v2, · · · ,vm} is an orthonormal basis in Km(A,v)
AV m = V mHm +wm+1e T m
V m V

Since V m+1 ⊥ {v1,v2, · · · ,vm} then it follows that (V m)TAV m = Hm. Hm is an upper-Hessenberg matrix.
– p. 25/49
Arnoldi’s method for symmetric matrices
Let now A be real symmetric matrix. Then the Arnoldi method reduces to the
Lanczos method.
Recall: Hm = (V m)TAV m
If A is symmetric, then Hm must be symmetric too, i.e., Hm is three-diagonal
Hm =
βi+1v i+1 = Avi − γiv
i − βiv i−1
Initialize: r(0) = Ax(0) − b, g(0) = r(0)
For k = 0, 1, · · · , until convergence
τk = (r(k),r(k))
r(k+1) = r(k) + τkAgk
βk = (r(k+1),r(k+1))
end
– p. 27/49
CG: computer implementation
x = x0 r = A*x-b delta0 = (r,r) g = -r Repeat: h = A*g
tau = delta0/(g,h) x = x + tau*g r = r + tau*h delta1 = (r,r) if delta1 <= eps, stop beta = delta1/delta0; delta0 = delta1 g = -r + beta*g
Computational complexity (arithmetic cost per iteration): O(n)
- one matrix-vector multiply - three vector updates - two scalar products
– p. 28/49
Properties of the CG method, convergence
Finite termination property: there are no breakdowns of the CG algorithm. Reasoning: if gj = 0 then τk is not defined. the vectors gj are computed from the formula gk = r(k) + βkg
k−1. Then
0
, ⇒ r(k)0, i.e., the
solution is already found. As soon as x(k) 6= xexact, then r(k) 6= 0 and then gk+1) 6= 0. However, we can generate at most n mutually orthogonal vectors in Rn, thus, CG has a finite termination property.
Convergence of the CG method
Theorem: In exact arithmetic, CG has the property that xexact = x(m) for some m ≤ n, where n is the order of A.
– p. 29/49
Rate of convergence of the CG method
Theorem: Let A is symmetric and positive definite. Suppose that for some set S, containing all eigenvalues of A, for some
polynomial P (λ) ∈ Π1 k
and some constant M there holds max λ∈S
P (λ) ≤ M.
ekA ≤ 2
ekA ≤ εe0A
k > 1
– p. 31/49
Basic GMRES
Choose v(1) to be the normalized residual r(0) = b−Ax(0). Any vector x ∈ x(0) +K is of the form x = x(0) + Vmy. Then
b−Ax = b−A(x(0) + Vmy)
= r(0) −AVmy
b−Ax2 = βe1 −Hmy2.
– p. 32/49
Basic GMRES
1. Compute r(0) = b−Ax(0), β = r(0)2 and v(1) = r(0)/β
2. For k = 1, 2, · · · ,m 3. Compute w(k) = Av(k)
4. For i = 1, 2, · · · , k 5. hik = (w(k),v(i))
6. w(k) = w(k) − hikv (i)
7. End 8. hk+1,k = w(k)2; if hk+1,k = 0, set m = k, goto 11 9. v(k+1) = w(k)/hk+1,k
10. End 11. Define the (m+ 1)×m Hessenberg matrix Hm = {hik}, 1 ≤ i ≤ m+ 1, 1 ≤ k ≤ m
12. Compute y(m) as the minimizer of βe1 −Hmy2 and x(m) = x(0) + Vmy(m)
– p. 33/49
No breakdown of GMRES
As m increases, storage and work per iteration increase fast. Remedies:
Restart (keep m constant) Truncate the orthogonalization process
The norm of the residual in the GMRES method is monotonically decreasing. However, the convergence may stagnate. The rate of convergence of GMRES cannot be determined so easy as that of CG.
The convergence history depends on the initial guess.
– p. 34/49
GMRES: convergence
Theorem: Let A be diagonalizable, A = X−1ΛX where Λ = diag{λ1, · · · , λn} contains the eigenvalues of A. Define
m = min p∈Π1
|p(λi)|.
Then, the residual norm at the mth step of GMRES satisfies
r(m) ≤ κ(X)mr(0),
– p. 35/49
On what hardware platform?
What can we parallelize?
Can we modify the method to avoid bottlenecks?
What data distribution shall we utilize? How much knowledge about the problem shall we utilize in the parallel implementation?
If we have managed to parallelise CG, did we do the job?
– p. 36/49
TDB − NLA How to parallelize the CG method?
Recall: x = x0 r = A*x-b delta0 = (r,r) g = -r Repeat: h = A*g matrix-vector
tau = delta0/(g,h) scalar product x = x + tau*g vector update r = r + tau*h vector update delta1 = (r,r) scalar product if delta1 <= eps, stop beta = delta1/delta0, delta0=delta1 g = -r + beta*g vector update
What approaches have been taken in some of the software libraries?
– p. 37/49
’Matrix-given strategy’
PETSc – Portable, Extensible Toolkit for Scientific Computation Last version: 3.3 (CUDA)
Slides borrowed from: Ambra Giovannini SuperComputing Applications and Innovation Department www.corsi.cineca.it/courses/scuolaEstiva2/petscModule/petsc_2011.pdf
– p. 38/49
PETSc Portable Extensible Toolkit for Scientific ComputationPETSc – Portable, Extensible Toolkit for Scientific Computation Is a suite of data structures and routines for the scalable (parallel) solution of scientific applications mainly modelled by partial differential equations.
ANL – Argonne National Laboratory Begun September 1991
pp y y p q
Begun September 1991 Uses the MPI standard for all message-passing communication C, Fortran, and C++ Consists of a variety of libraries; each library manipulates a particular family of objects and the operations one would like to perform on the objectsperform on the objects PETSc has been used for modelling in all of these areas: Acoustics, Aerodynamics, Air Pollution, Arterial Flow, Brain Surgery, Cancer Surgeryy g y g y and Treatment, Cardiology, Combustion, Corrosion, Earth Quakes, Economics, Fission, Fusion, Magnetic Films, Material Science, Medical Imaging, Ocean Dynamics, PageRank, Polymer Injection Molding, Seismology, Semiconductors, ...
Ambra Giovannini 1
Introduction to PETSc
Relationship between libraries
Ambra Giovannini 2
Introduction to PETSc
PETSc numerical component
Ambra Giovannini 4
Introduction to PETSc
Fundamental objects for storing field solutions, right-hand sides, etc. Each process locally owns a subvector of contiguously numbered global indices
Features Has a direct interface to the valuesHas a direct interface to the values Supports all vector space operations
VecDot(), VecNorm(), VecScale(), …VecDot(), VecNorm(), VecScale(), … Also unusual ops, e.g. VecSqrt(), VecInverse() Automatic communication during assemblyg y Customizable communication (scatters)
Ambra Giovannini 10
Introduction to PETSc
Numerical vector operations
Ambra Giovannini 16
Introduction to PETSc
Matrices What are PETSc matrices?What are PETSc matrices?
Fundamental objects for storing linear operators Each process locally owns a submatrix of contiguous rows
Features S t d t tSupports many data types
AIJ, Block AIJ, Symmetric AIJ, Block Diagonal, etc. Supports structures for many packagesSupports structures for many packages
Spooles, MUMPS, SuperLU, UMFPack, DSCPack A matrix is defined by its interface, the operations that you can y , p y perform with it, not by its data structure
Ambra Giovannini 21
Introduction to PETSc
Numerical matrix operations
Ambra Giovannini 27
Introduction to PETSc
Matrix AIJ format
The default matrix representation within PETSc isThe default matrix representation within PETSc is the general sparse AIJ format (Yale sparse matrix or Compressed Sparse Row, CSR)
The nonzero elements are stored by rows Array of corresponding column numbers y p g Array of pointers to the beginning of each row
Note: The diagonal matrix entries t d ith th t f th
Ambra Giovannini 29
Introduction to PETSc
Parallel sparse matrices Each process locally owns a submatrix of contiguously numberedEach process locally owns a submatrix of contiguously numbered global rows. Each submatrix consists of diagonal and off-diagonal parts.
P0P0
P1
P2
Ambra Giovannini 32
Introduction to PETSc
KSP: linear equations solvers The object KSP provides uniform and efficient access to all of theThe object KSP provides uniform and efficient access to all of the package’s linear system solvers
KSP is intended for solving nonsingular systems of the form Ax = b.
KSPCreate(MPI Comm comm, KSP *ksp) KSPSetOperators(KSP ksp, Mat Amat, Mat Pmat,
MatStructure flag) KSPSolve(KSP ksp, Vec b, Vec x) KSPGetIterationNumber(KSP ksp, int *its) KSPDestroy(KSP ksp)
Ambra Giovannini 39
Introduction to PETSc
PETSc KSP methods
Ambra Giovannini 40
PETSc cont.
External packages:
Hypre - a library for solving large, sparse linear systems of equations on massively parallel computers, LLNL
SuperLU - a general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations on high performance machines, Berkeley, Jim Demmel and coworkers
SuperLU_DIST - SuperLU for distributed memory
Sundials - SUite of Nonlinear and DIfferential/ALgebraic equation Solvers,
FFTW - a C -library for computing the discrete Fourier transform (DFT) in various dimensions, of arbitrary input size, real and complex data
Metis, ParMetis -
Added: – Hybrid Chebychev – pipelined GMRES which performs one non-blocking reduction per iteration instead of two blocking reductions – flexible BiCGStab, tolerates a nonlinear preconditioner – improved flexible BiCGStab, tolerates a nonlinear preconditioner and performs one reduction every other iteration
– p. 39/49
’Problem-given strategy’
– p. 40/49
’Problem-given strategy’
– p. 41/49
FEM-DD-unpreconditioned CG
Assume A and b are distributed and an initial guess x(0) is given, which is replicated.
g(0) = b−Ax(0)
(1) h = Ad(k)
(2) τ = δ0/(h,d(k))
(5) r = replicate(g(k+1))
– p. 42/49
Why local comunications are not enough?
Consider the solution of Ax = b by the standard conjugate gradient, where
A =
.
The exact solution is x = [1, 1, · · · , 1]T . Starting with x0 = [0, 0, · · · , 0]T one finds that after k iterations
xk =
[ k
]T
for 1 ≤ k ≤ n− 1 and xn = x. Hence, the information travels one step at a time from left to right and it takes n
steps before the last component has changed at all.
– p. 43/49
Include preconditioning:C−1Ax = C−1 b
Unpreconditioned CG Preconditioned CG x = x0 x = x0 r = A*x-b r = A*x-b; C*h = r delta0 = (r,r) delta0 = (r,h) g = -r g = -h Repeat: h = A*g Repeat: h = A*g
tau = delta0/(g,h) tau = delta0/(g,h) x = x + tau*g x = x + tau*g r = r + tau*h r = r + tau*h; C*h = r delta1 = (r,r) delta1 = (r,h) if delta1 <= eps, stop if delta1 <= eps, stop beta = delta1/delta0 beta = delta1/delta0 g = -r + beta*g g = -h + beta*g
– p. 44/49
FEM-DD- preconditioned CG
Assume A, B and b are distributed and the initial guess x(0) is replicated.
g(0) = Ax(0) − b, g(0) = replicate(g(0))
h = Bg(0)
(1) h = Ad(k)
(2) τ = δ0/(h,d(k))
(4) g(k+1) = g(k) + τh, g(k+1) = replicate(g(k+1))
(5) h = Bg(k+1),
(7) β = δ1/δ0, δ0 = δ1
(8) d(k+1) = −h+ βd(k)
– p. 45/49
II: Distributed memory paradigm
Each machine keeps the entire mesh and DoF handler locally, but only a share of the global matrix, sparsity pattern, and solution vector is stored on each machine.
The mesh and DoFhandler are also distributed, i.e. each processor stores only a share of the cells and degrees of freedom. No processor has knowledge of the entire mesh, matrix, or solution, and in fact problems solved in this mode are usually so large (say, 100s of millions to billions of degrees of freedom) that no processor can or should store even a single solution vector.
– p. 46/49
Preconditioned Chronopoulos/Gear CG
r0 = b−A ∗ x0
C ∗ u0 = r0; w0 = A ∗ u0 α = (r0, u0)/(w0, u0), β0 = 0, γ0 = (r0, u0)
Repeat: pi = ui + βipi−1 vector update si = wi + βisi−1 vector update xi+1 = xi + αipi vector update ri+1 = ri − αisi vector update Cui+1 = ri+1 system solve wi+1 = Aui+1 matrix-vector γi+1 = (ri+1, ui+1) scalar product δ = (wi+1, ui+1) scalar product βi+1 = γi+1/γi αi+1 = γi+1/(δ − βi+1γi+1/αi)
The communication phase for both dot-products from the algorithm can be combined in a single global reduction. The update for xi can be postponed and used to overlap the global reduction. However, even for small parallel machines the runtime of a single vector update will not be enough to fully cover the latency of the global communication.
– p. 47/49
Pipelined Chronopoulos/Gear CG
Hiding global synchronization latency in the PCG algorithm P. Ghysels, W. Vanroose, December 2012
r0 = b−A ∗ x0; w0 = A ∗ u0
Repeat: γi = (ri, ri) scalar product δ = (wi, ri) scalar product qi+1 = Awi+1 matrix-vector if i > 0 then βi = γi/γi−1, αi = γi/(δ − βiγi/αi−1)
else βi = 0, αi = γi/δ
zi = qi + βizi−1 vector update si = wi + βisi−1 vector update pi = ri + βipi−1 vector update xi+1 = xi + αipi vector update ri+1 = ri − αisi vector update wi+1 = wi − αizi vector update
– p. 48/49
CG - more efficient global ommunications?
Danger: what is going on with the NUMERICAL STABILITY of the method?
However, with increasing s, the stability of the s-step Krylov basis deteriorates.
We report on extensive numerical tests that show the stability of the pipelined CG and CR methods.
Stability is also negatively impacted in the pipelined methods by the extra multiplication with the matrix A (ri, ui and wi are replaced by ri = b−Axi, ui = M−1ri and wi = Aui every 50-th iteration.)
A possible improvement might be to add a shift in the matrix-vector product, similar to what is also done in s-step Krylov methods and for pipelined GMRES. The CG iteration can provide information on the spectrum of A, which can be used to determine good shifts.
To summarize: need a preconditioner; not only local communications for good numerical performance.
– p. 49/49