conj grad slides 2

30
Conjugate Gradient Method direct and indirect methods positive denite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University

Upload: khongthetind

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 1/30

Conjugate Gradient Method

• direct and indirect methods

• positive definite linear systems

• Krylov sequence

• spectral analysis of Krylov sequence

• preconditioning

Prof. S. Boyd, EE364b, Stanford University

Page 2: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 2/30

Three classes of methods for linear equations

methods to solve linear system Ax = b, A ∈ Rn×n

•dense direct (factor-solve methods)

– runtime depends only on size; independent of data, structure, orsparsity

– work well for n up to a few thousand

• sparse direct (factor-solve methods)

– runtime depends on size, sparsity pattern; (almost) independent of data

– can work well for n up to 104 or 105 (or more)– requires good heuristic for ordering

Prof. S. Boyd, EE364b, Stanford University 1

Page 3: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 3/30

• indirect (iterative methods)

– runtime depends on data, size, sparsity, required accuracy– requires tuning, preconditioning, . . .– good choice in many cases; only choice for n = 106 or larger

Prof. S. Boyd, EE364b, Stanford University 2

Page 4: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 4/30

Symmetric positive definite linear systems

SPD system of equations

Ax = b, A ∈ Rn×n

, A = AT 

≻ 0

examples

• Newton/interior-point search direction: ∇2φ(x)∆x = −∇φ(x)

• least-squares normal equations: (AT A)x = AT b

•regularized least-squares: (AT A + µI )x = AT b

• minimization of convex quadratic function (1/2)xT Ax − bT x

• solving (discretized) elliptic PDE (e.g., Poisson equation)

Prof. S. Boyd, EE364b, Stanford University 3

Page 5: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 5/30

• analysis of resistor circuit: Gv = i

– v is node voltage (vector), i is (given) source current– G is circuit conductance matrix

Gij =

total conductance incident on node i i = j−(conductance between nodes i and j) i = j

Prof. S. Boyd, EE364b, Stanford University 4

Page 6: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 6/30

CG overview

• proposed by Hestenes and Stiefel in 1952 (as direct method)

•solves SPD system Ax = b

– in theory (i.e., exact arithmetic) in n iterations– each iteration requires a few inner products in Rn, and one

matrix-vector multiply z → Az

• for A dense, matrix-vector multiply z → Az costs n2, so total cost isn3, same as direct methods

• get advantage over dense if matrix-vector multiply is cheaper than n2

•with roundoff error, CG can work poorly (or not at all)

• but for some A (and b), can get good approximate solution in ≪ niterations

Prof. S. Boyd, EE364b, Stanford University 5

Page 7: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 7/30

Solution and error

• x⋆ = A−1b is solution

• x⋆ minimizes (convex function) f (x) = (1/2)xT Ax − bT x

• ∇f (x) = Ax − b is gradient of  f 

• with f ⋆

= f (x⋆

), we have

f (x) − f ⋆ = (1/2)xT Ax − bT x − (1/2)x⋆T Ax⋆ + bT x⋆

= (1/2)(x − x⋆)T A(x − x⋆)

= (1/2)x − x⋆2A

i.e., f (x) − f ⋆ is half of squared A-norm of error x − x⋆

Prof. S. Boyd, EE364b, Stanford University 6

Page 8: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 8/30

• a relative measure (comparing x to 0):

τ  =f (x) − f ⋆

f (0)

−f ⋆

=x − x⋆2A

x⋆

2A

(fraction of maximum possible reduction in f , compared to x = 0)

Prof. S. Boyd, EE364b, Stanford University 7

Page 9: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 9/30

Residual

• r = b − Ax is called the residual at x

•r =

−∇f (x) = A(x⋆

−x)

• in terms of  r, we have

f (x)

−f ⋆ = (1/2)(x

−x⋆)T A(x

−x⋆)

= (1/2)rT A−1r

= (1/2)r2A−1

• a commonly used measure of relative accuracy: η = r/b• τ  ≤ κ(A)η2 (η is easily computable from x; τ  is not)

Prof. S. Boyd, EE364b, Stanford University 8

Page 10: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 10/30

Krylov subspace

(a.k.a. controllability subspace)

Kk = span{b , A b , . . . , Ak−1

b}= { p(A)b |  p polynomial, deg p < k}

we define the Krylov sequence  x(1)

, x(2)

, . . . as

x(k) = argminx∈Kk

f (x) = argminx∈Kk

x − x⋆2A

the CG algorithm (among others) generates the Krylov sequence

Prof. S. Boyd, EE364b, Stanford University 9

Page 11: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 11/30

Properties of Krylov sequence

• f (x(k+1)) ≤ f (x(k)) (but r can increase)

• x(n) = x⋆ (i.e., x⋆ ∈ Kn even when Kn = Rn)

• x(k) = pk(A)b, where pk is a polynomial with deg pk < k

• less obvious: there is a two-term recurrence 

x(k+1) = x(k) + αkr(k) + β k(x(k) − x(k−1))

for some αk, β k (basis of CG algorithm)

Prof. S. Boyd, EE364b, Stanford University 10

Page 12: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 12/30

Cayley-Hamilton theorem

characteristic polynomial of  A:

χ(s) = det(sI − A) = sn

+ α1sn−1

+ · · · + αn

by Caley-Hamilton theorem

χ(A) = An

+ α1An−1

+ · · · + αnI  = 0

and so

A−1 = −(1/αn)A

n−1 − (α1/αn)A

n−2 − · · · − (αn−1/αn)I 

in particular, we see that x⋆ = A−1b ∈ Kn

Prof. S. Boyd, EE364b, Stanford University 11

Page 13: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 13/30

Spectral analysis of Krylov sequence

• A = QΛQT , Q orthogonal, Λ = diag(λ1, . . . , λn)

• define y = QT x, b = QT b, y⋆ = QT x⋆

• in terms of  y, we have

f (x) = f (y) = (1/2)xT QΛQT x − bT QQT x

= (1/2)yT Λy − bT y

=n

i=1(1/2)λiy2i

−biyi

so y⋆i = bi/λi, f ⋆ = −(1/2)

ni=1 b2i /λi

Prof. S. Boyd, EE364b, Stanford University 12

Page 14: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 14/30

Krylov sequence in terms of  y

y(k) = argminy∈Kk

f (y), Kk = span{b, Λb , . . . , Λk−1b}

y(k)i = pk(λi)bi, deg pk < k

 pk = argmindeg p<k

ni=1

b2i

(1/2)λi p(λi)2 − p(λi)

Prof. S. Boyd, EE364b, Stanford University 13

Page 15: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 15/30

f (x(k)) − f ⋆ = f (y(k)) − f ⋆

= mindeg p<k(1/2)

n

i=1

¯b

2

i

(λi p(λi)

−1)2

λi

= mindeg p<k

(1/2)n

i=1

y⋆2i λi(λi p(λi) − 1)2

= mindeg q≤k, q(0)=1

(1/2)n

i=1

y⋆2i λiq(λi)2

= mindeg q≤k, q(0)=1(1/2)

n

i=1

¯b

2

i

q(λi)2

λi

Prof. S. Boyd, EE364b, Stanford University 14

Page 16: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 16/30

τ k =mindeg q≤k, q(0)=1

ni=1 y⋆2

i λiq(λi)2

ni=1 y⋆2

i λi

≤ mindeg q≤k, q(0)=1

max

i=1,...,nq(λi)2

• if there is a polynomial q of degree k, with q(0) = 1, that is small on

the spectrum of  A, then f (x(k)

) − f ⋆

is small

• if eigenvalues are clustered in k groups, then y(k) is a good approximatesolution

• if solution x⋆ is approximately a linear combination of  k eigenvectors of A, then y(k) is a good approximate solution

Prof. S. Boyd, EE364b, Stanford University 15

Page 17: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 17/30

A bound on convergence rate

• taking q as Chebyshev polynomial of degree k, that is small on interval[λmin, λmax], we get

τ k ≤√

κ − 1√κ + 1

k

, κ = λmax/λmin

• convergence can be much faster than this, if spectrum of  A is spreadbut clustered

Prof. S. Boyd, EE364b, Stanford University 16

Page 18: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 18/30

Small example

A ∈ R7×7, spectrum shown as filled circles; p1, p2, p3, p4, and p7 shown

0 2 4 6 8 10−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x

          1

   −     x

     p           (     x           )

Prof. S. Boyd, EE364b, Stanford University 17

Page 19: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 19/30

Convergence

0 1 2 3 4 5 6 7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k

    τ        k

Prof. S. Boyd, EE364b, Stanford University 18

Page 20: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 20/30

Residual convergence

0 1 2 3 4 5 6 7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k

     η        k

Prof. S. Boyd, EE364b, Stanford University 19

Page 21: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 21/30

Larger example

• solve Gv = i, resistor network with 105 nodes

• average node degree 10; around 106 nonzeros in G

• random topology with one grounded node

• nonzero branch conductances uniform on [0, 1]

• external current i uniform on [0, 1]

• sparse Cholesky factorization of  G requires too much memory

Prof. S. Boyd, EE364b, Stanford University 20

Page 22: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 22/30

Residual convergence

0 10 20 30 40 50 6010

−8

10−6

10−4

10−2

100

102

104

k

     η        k

Prof. S. Boyd, EE364b, Stanford University 21

Page 23: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 23/30

CG algorithm

(follows C. T. Kelley)

x := 0, r := b, ρ0 := r2

for k = 1, . . . , N  max

quit if √

ρk−1 ≤ ǫbif  k = 1 then p := r; else p := r + (ρk−1/ρk−2) p

w := Apα := ρk−1/pT wx := x + αpr := r − αwρk :=

r2

Prof. S. Boyd, EE364b, Stanford University 22

Page 24: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 24/30

Efficient matrix-vector multiply

• sparse A

• structured (e.g., sparse) plus low rank

• products of easy-to-multiply matrices

• fast transforms (FFT, wavelet, . . . )

• inverses of lower/upper triangular (by forward/backward substitution)

• fast Gauss transform, for Aij = exp(−vi − vj2/σ2) (via multipole)

Prof. S. Boyd, EE364b, Stanford University 23

Page 25: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 25/30

Shifting

• suppose we have guess x of solution x⋆

• we can solve Az = b − Ax using CG, then get x⋆ = x + z

• in this case x(k) = x + z(k) = argminx∈x+Kk

f (x)

(x + Kk is called shifted Krylov subspace )

• same as initializing CG alg with x := x, r := b − Ax

• good for ‘warm start’, i.e., solving Ax = b starting from a good initialguess (e.g., the solution of another system Ax = b, with A ≈ A, b ≈ b)

Prof. S. Boyd, EE364b, Stanford University 24

Page 26: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 26/30

Preconditioned conjugate gradient algorithm

• idea: apply CG after linear change of coordinates x = T y, det T  = 0

• use CG to solve T T 

AT y = T T 

b; then set x⋆

= T −1

y⋆

• T  or M  = T T T  is called preconditioner 

• in naive implementation, each iteration requires multiplies by T  and T 

(and A); also need to compute x⋆ = T −1y⋆ at end

• can re-arrange computation so each iteration requires one multiply byM  (and A), and no final solve x⋆ = T −1y⋆

• called preconditioned conjugate gradient  (PCG) algorithm

Prof. S. Boyd, EE364b, Stanford University 25

Page 27: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 27/30

Choice of preconditioner

• if spectrum of  T T AT  (which is the same as the spectrum of  M A) isclustered, PCG converges fast

• extreme case: M  = A−1

•trade-off between enhanced convergence, and extra cost of 

multiplication by M  at each step

• goal is to find M  that is cheap to multiply, and approximate inverse of A (or at least has a more clustered spectrum than A)

Prof. S. Boyd, EE364b, Stanford University 26

Page 28: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 28/30

Some generic preconditioners

• diagonal: M  = diag(1/A11, . . . , 1/Ann)

• incomplete/approximate Cholesky factorization: use M  = A−1, whereA = LLT  is an approximation of  A with cheap Cholesky factorization

– compute Cholesky factorization of  A, A = LLT 

– at each iteration, compute M z = L−T L−1z via forward/backward

substitution

• examples

– A is central k-wide band of  A– L obtained by sparse Cholesky factorization of  A, ignoring small

elements in A, or refusing to create excessive fill-in

Prof. S. Boyd, EE364b, Stanford University 27

Page 29: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 29/30

Larger example

residual convergence with and without diagonal preconditioning

0 10 20 30 40 50 6010

−8

10−6

10−4

10−2

100

102

104

k

     η        k

Prof. S. Boyd, EE364b, Stanford University 28

Page 30: Conj Grad Slides 2

8/6/2019 Conj Grad Slides 2

http://slidepdf.com/reader/full/conj-grad-slides-2 30/30

CG summary

• in theory (with exact arithmetic) converges to solution in n steps

– the bad news: due to numerical round-off errors, can take more than

n steps (or fail to converge)– the good news: with luck (i.e., good spectrum of  A), can get good

approximate solution in ≪ n steps

•each step requires z

→Az multiplication

– can exploit a variety of structure in A– in many cases, never form or store the matrix A

•compared to direct (factor-solve) methods, CG is less reliable, data

dependent; often requires good (problem-dependent) preconditioner• but, when it works, can solve extremely large systems

Prof. S. Boyd, EE364b, Stanford University 29