golub-kahan iterative bidiagonalization and determining ...0 50 100 150 200 250 300 350 400 10−20...

Golub-Kahan iterative bidiagonalization and

determining the noise level in the data

Iveta Hnětynková ∗,∗∗, Martin Plešinger ∗∗,∗∗∗, Zdeněk Strakoš ∗,∗∗

* Charles University, Prague

** Academy of Sciences of the Czech republic, Prague

*** Technical University, Liberec

with thanks to P. C. Hansen, M. Kilmer and many others

SIAM ALA, Monterey, October 2009

1

Six tons large scale real world ill-posed problem:

2

Solving large scale discrete ill-posed problems is frequently based upon

orthogonal projections-based model reduction using Krylov sub-

spaces, see, e.g., hybrid methods. This can be viewed as

approximation of a Riemann-Stieltjes distribution function

via matching moments.

3

Consider the Riemann-Stieltjes distribution function ω(λ) with the

n points of increase associated with the HPD matrix B and the

normalized inital vector s , (or with the transfer function given by

the Laplace transform of a linear dynamical system determined by

B, s ). Then

s∗(λI − B)−1s =n

∑

j=1

ωj

λ − λj≡ Fn(λ) ,

where λj, j = 1, . . . , n denote the eigenvalues of B and ωj the

squared size of the component of s in the corresponding invariant

subspace respectively.

4

The continued fraction on the right hand side is given by

Fn(λ) ≡Rn(λ)

Pn(λ)

≡1

λ − γ1 −δ22

λ − γ2 −δ23

λ − γ3 − . . .. . .

λ − γn−1 −δ2n

λ − γn

and the entries γ1, . . . , γn and δ2, . . . , δn form the Jacobi matrix

5

Tn ≡

γ1 δ2δ2 γ2

. . .. . . . . . δn

δn γn

, δℓ > 0 , ℓ = 2, . . . , n .

Consider the kth Gauss-Christoffel quadrature approximation ω(k)(λ)

of the Riemann-Stieltjes distribution function ω(λ) . Its algebraic

degree is 2k − 1 , i.e., it matches the first 2k moments

ξℓ−1 =∫

λℓ−1 dω(λ) =k

∑

j=1

ω(k)j {λ

(k)j }

ℓ−1, ℓ = 1, . . . ,2k .

6

The nodes and weights of ω(k)(λ) are given by the eigenvalues and

the corresponding squared first elements of the normalized eigenvec-

tors of Tk .

Expansion of the continued fraction Fn(λ) in terms of the decreasing

powers of λ and the approximation by its kth convergent Fk(λ)

gives

Fn(λ) =2k∑

ℓ=1

ξℓ−1λℓ

+ O

(

1

λ2k+1

)

= Fk(λ) + O

(

1

λ2k+1

)

.

Here Fk(λ) approximates Fn(λ) with the error proportional to

λ−(2k+1), which represents the minimal partial realization matching

the first 2k moments, cf. [Stieltjes - 1894, Chebyshev - 1855].

7

Discrete ill-posed problem,

the smallest node and weight in approximation of ω(λ):

10−40

10−20

100

10−30

10−20

10−10

100

the leftmost node

wei

ght

ω(λ): nodes σj2, weights |(b/β

1,u

j)|2

nodes (θ1(k))2, weights |(p

1(k),e

1)|2

δ2noise

8

Outline

1. Problem formulation

2. Golub-Kahan iterative bidiagonalization, Lanczos tridiagonaliza-

tion, and approximation of the Riemann-Stieltjes distribution func-

tion

3. Propagation of the noise in the Golub-Kahan bidiagonalization

4. Determination of the noise level

5. Numerical illustration

6. Summary and future work

9

Consider an ill-posed square nonsingular linear algebraic system

A x ≈ b, A ∈ Rn×n, b ∈ Rn,

with the right-hand side corrupted by a white noise

b = bexact + bnoise 6= 0 ∈ Rn , ‖ bexact ‖ ≫ ‖ bnoise ‖ ,

and the goal to approximate xexact ≡ A−1 bexact.

The noise level δnoise ≡‖bnoise‖

‖bexact‖≪ 1 .

10

Properties (assumptions):

• matrices A, AT , AAT have a smoothing property;

• left singular vectors uj of A represent increasing frequencies

as j increases;

• the system A x = bexact satisfies the discrete Picard condition.

Discrete Picard condition (DPC):

On average, the components |(bexact, uj)| of the true right-hand

side bexact in the left singular subspaces of A decay faster than

the singular values σj of A, j = 1, . . . , n .

White noise:

The components |(bnoise, uj)|, j = 1, . . . , n do not exhibit any trend.

11

Problem Shaw: Noise level, Singular values, and DPC:

[Hansen – Regularization Tools]

0 10 20 30 40 50 60

10−40

10−30

10−20

10−10

100

singular value number

σj, double precision arithmetic

σj, high precision arithmetic

|(bexact, uj)|, high precision arithmetic

0 50 100 150 200 250 300 350 40010

−20

10−15

10−10

10−5

100

105

singular value number

σj

|(b, uj)| for δ

noise = 10−14

|(b, uj)| for δ

noise = 10−8

|(b, uj)| for δ

noise = 10−4

12

Regularization is used to suppress the effect of errors in the data

and extract the essential information about the solution.

In hybrid methods, see [O’Leary, Simmons – 81], [Hansen – 98], or

[Fiero, Golub Hansen, O’Leary – 97], [Kilmer, O’Leary – 01], [Kilmer,

Hansen, Español – 06], [O’Leary, Simmons – 81], the outer bidiago-

nalization is combined with an inner regularization – e.g., truncated

SVD (TSVD), or Tikhonov regularization – of the projected small

problem (i.e. of the reduced model).

The bidiagonalization is stopped when the regularized solution of the

reduced model matches some selected stopping criteria.

13

Stopping criteria are typically based, amongst others, see [Björk –

88], [Björk, Grimme, Van Dooren – 94], on

• estimation of the L-curve [Calvetti, Golub, Reichel – 99], [Calvetti,

Morigi, Reichel, Sgallari – 00], [Calvetti, Reichel – 04];

• estimation of the distance between the exact and regularized so-

lution [O’Leary – 01];

• the discrepancy principle [Morozov – 66], [Morozov – 84];

• cross validation methods [Chung, Nagy, O’Leary – 04], [Golub,

Von Matt – 97], [Nguyen, Milanfar, Golub – 01].

For an extensive study and comparison see [Hansen – 98], [Kilmer,

O’Leary – 01].

14

Stopping criteria based on information from residual vectors:

A vector x̂ is a good approximaton to xexact = A−1 bexact if the

corresponding residual approximates the (white) noise in the data

r̂ ≡ b − A x̂ ≈ bnoise .

Behavior of r̂ can be expressed in the frequency domain using

• discrete Fourier transform, see [Rust – 98], [Rust – 00],

[Rust, O’Leary – 08], or

• discrete cosine transform, see [Hansen, Kilmer, Kjeldsen – 06],

and then analyzed using statistical tools – cumulative periodograms.

15

This talk:

Under the given (quite natural) assumptions, the Golub-Kahan itera-

tive bidiagonalization reveals the noise level δnoise .

16

Outline


2. Golub-Kahan iterative bidiagonalization, Lanczos tridiago-

nalization, and approximation of the Riemann-Stieltjes dis-

tribution function





17

Golub-Kahan iterative bidiagonalization (GK) of A :

given w0 = 0 , s1 = b / β1 , where β1 = ‖b‖ , for j = 1,2, . . .

αj wj = AT sj − βj wj−1 , ‖wj‖ = 1 ,

βj+1 sj+1 = A wj − αj sj , ‖sj+1‖ = 1 .

Let Sk = [ s1, . . . , sk ] , Wk = [w1, . . . , wk ] be the associated

matrices with orthonormal columns.

18

Denote

Lk =

α1β2 α2

. . . . . .

βk αk

,

Lk+ =

α1β2 α2

. . . . . .

βk αkβk+1

=

[

Lk

eTk βk+1

]

,

the bidiagonal matrices containing the normalization coefficients. ThenGK can be written in the matrix form as

AT Sk = Wk LTk ,

A Wk =[

Sk, sk+1]

Lk+ = Sk+1 Lk+ .

19

GK is closely related to the Lanczos tridiagonalization of the sym-metric matrix A AT with the starting vector s1 = b / β1,

A AT Sk = Sk Tk + αk βk+1 sk+1 eTk ,

Tk = Lk LTk =

α21 α1 β1

α1 β1 α22 + β

22

. . .. . . . . . αk−1 βk

αk−1 βk α2k + β

2k

,

i.e. the matrix Lk from GK represents a Cholesky factor of thesymmetric tridiagonal matrix Tk from the Lanczos process.

20

Approximation of the distribution function:

The Lanczos tridiagonalization of the given (SPD) matrix B ∈ Rn×n

generates at each step k a non-decreasing piecewise constant distri-

bution function ω(k) , with the nodes being the (distinct) eigenvalues

of the Lanczos matrix Tk and the weights ω(k)j being the squared

first entries of the corresponding normalized eigenvectors [Hestenes,

Stiefel – 52].

The distribution functions ω(k)(λ) , k = 1, 2, . . . represent Gauss-

Christoffel quadrature (i.e. minimal partial realization) approxima-

tions of the distribution function ω(λ) , [Hestenes, Stiefel – 52], [Fis-

cher – 96], [Meurant, Strakoš – 06].

21

Consider the SVD

Lk = Pk Θk QkT ,

Pk = [p(k)1 , . . . , p

(k)k ] , Qk = [q

(k)1 , . . . , q

(k)k ] , Θk = diag (θ

(k)1 , . . . , θ

(k)n ) ,

with the singular values ordered in the increasing order,

0 < θ(k)1 < . . . < θ

(k)k .

Then Tk = Lk LTk = Pk Θ

2k P

Tk is the spectral decomposition of Tk ,

(θ(k)ℓ )

2 are its eigenvalues (the Ritz values of AAT) and

p(k)ℓ its eigenvectors (which determine the Ritz vectors of AA

T),

ℓ = 1, . . . , k .

22

Summarizing:

The GK bidiagonalization generates at each step k the distributionfunction

ω(k)(λ) with nodes (θ(k)ℓ )

2 and weights ω(k)ℓ = |(p

(k)ℓ , e1)|

2

that approximates the distribution function

ω(λ) with nodes σ2j and weights ωj = |(b/β1, uj)|2 ,

where σ2j , uj are the eigenpairs of A AT , for j = n, . . . , 1 .

Note that unlike the Ritz values (θ(k)ℓ )

2, the squared singular values

σ2j are enumerated in descending order.

23

Outline




tion

3. Propagation of the noise in the Golub-Kahan bidiagonaliza-

tion




24

GK starts with the normalized noisy right-hand side s1 = b / ‖b‖ .

Consequently, vectors sj contain information about the noise.

Can this information be used to determine the level of the noise in

the observation vector b ?

Consider the problem Shaw from [Hansen – Regularization Tools]

(computed via [A,b exact,x] = shaw(400)) with a noisy right-

hand side (the noise was artificially added using the MATLAB function

randn). As an example we set

δnoise ≡‖ bnoise ‖

‖ bexact ‖= 10−14 .

25

Components of several bidiagonalization vectors sj

computed via GK with double reorthogonalization:

0 200 400−0.2

−0.1

0

0.1

0.2

s1

0 200 400−0.2

−0.1

0

0.1

0.2

s6

0 200 400−0.2

−0.1

0

0.1

0.2

s11

0 200 400−0.2

−0.1

0

0.1

0.2

s16

0 200 400−0.2

−0.1

0

0.1

0.2

s17

0 200 400−0.2

−0.1

0

0.1

0.2

s18

0 200 400−0.2

−0.1

0

0.1

0.2

s19

0 200 400−0.2

−0.1

0

0.1

0.2

s20

0 200 400−0.2

−0.1

0

0.1

0.2

s21

0 200 400−0.2

−0.1

0

0.1

0.2

s22

26

The first 80 spectral coefficients of the vectors sj

in the basis of the left singular vectors uj of A:

20 40 60 80

100

UTs1

20 40 60 80

100

UTs6

20 40 60 80

100

UTs11

20 40 60 80

100

UTs16

20 40 60 80

100

UTs17

20 40 60 80

100

UTs18

20 40 60 80

100

UTs19

20 40 60 80

100

UTs20

20 40 60 80

100

UTs21

20 40 60 80

100

UTs22

27

Signal space – noise space diagrams:

s1 −> s

2s

5 −> s

6s

6 −> s

7s

10 −> s

11s

13 −> s

14s

15 −> s

16

s16

−> s17

s17

−> s18

s18

−> s19

s19

−> s20

s20

−> s21

s21

−> s22

sk (triangle) and sk+1 (circle) in the signal space span{u1, . . . , uk+1}

(horizontal axis) and the noise space span{uk+2, . . . , un} (vertical axis).

28

The noise is amplified with the ratio αk/βk+1 :

GK for the spectral components:

α1 (VTw1) = Σ(U

Ts1) ,

β2 (UTs2) = Σ(V

Tw1) − α1 (UTs1) ,

and for k = 2,3, . . .

αk(VTwk) = Σ(U

Tsk) − βk(VTwk−1) ,

βk+1(UTsk+1) = Σ(V

Twk) − αk(UTsk) .

29

Since dominance in Σ(UTsk) and (VTwk−1) is shifted by one com-

ponent, in αk (VTwk) = Σ(U

Tsk) − βk (VTwk−1) , one can not ex-

pect a significant cancelation, and therefore

αk ≈ βk .

Whereas Σ (V Twk) and (UT sk) do exhibit dominance in the direc-

tion of the same components. If this dominance is strong enough,

then the required orthogonality of sk+1 and sk in

βk+1 (UTsk+1) = Σ(V

Twk) − αk (UTsk) can not be achieved without

a significant cancelation, and one can expect

βk+1 ≪ αk .

.

30

Absolute values of the first 25 components of Σ(V Twk), αk(UTsk),

and βk+1(UTsk+1) for k = 7, β8/α7 = 0.0524 (left)

and for k = 12, β13/α12 = 0.677 (right),

5 10 15 20 2510

−20

10−15

10−10

10−5

100

| Σ VT w7 |

| α7 UT s

7 |

| β8 UT s

8 |

5 10 15 20 2510

−25

10−20

10−15

10−10

10−5

| Σ VT w12

|| α

12 UT s

12 |

| β13

UT s13

|

Shaw with the noise level δnoise = 10−14:

31

Outline




tion





32

Depending on the noise level, the smaller nodes of ω(λ) are com-

pletely dominated by noise, i.e., there exists an index Jnoise such

that for j ≥ Jnoise

|(b/β1, uj)|2 ≈ |(bnoise/β1, uj)|

2

and the weight of the set of the associated nodes is given by

δ2 ≡n

∑

j=Jnoise

|(bnoise/β1, uj)|2 .

Recall that the large nodes σ21, σ22, . . . are well-separated (relatively

to the small ones) and their weights on average decrease faster than

σ21, σ22 , see (DPC). Therefore the large nodes essentially control the

behavior of the early stages of the Lanczos process.

33

At any iteration step, the weight corresponding to the smallest node

(θ(k)1 )

2 must be larger than the sum of weights of all σ2j smaller

than this (θ(k)1 )

2 , see [Fischer, Freund – 94]. As k increases, some

(θ(k)1 )

2 eventually approaches (or becomes smaller than) the node

σ2Jnoise, and its weight becomes

|(p(k)1 , e1)|

2 ≈ δ2 ≈ δ2noise .

The weight |(p(k)1 , e1)|

2 corresponding to the smallest Ritz value

(θ(k)1 )

2 is strictly decreasing. At some iteration step it sharply starts

to (almost) stagnate on the level close to the squared noise level

δ2noise .

34

Square roots of the weights |(p(k)1 , e1)|

2, k = 1, 2, . . .,

0 5 10 15 20 25 3010

−15

10−10

10−5

100

iteration number k

δnoise

knoise

= 18 − 1 = 17


35


2, k = 1, 2, . . .,

0 5 10 15 20

10−4

10−3

10−2

10−1

100

iteration number k

δnoise

knoise

= 8 − 1 = 7


36

The smallest node and weight in approximation of ω(λ):

10−40

10−20

100

10−30

10−20

10−10

100

the leftmost node

wei

ght


1,u

j)|2


1(k),e

1)|2

δ2noise

37

The smallest node and weight in approximation of ω(λ):

10−40

10−20

100

10−8

10−6

10−4

10−2

100

the leftmost node

wei

ght


1,u

j)|2


1(k),e

1)|2

δ2noise

38

Outline




tion





39

Image deblurring problem, image size 324 × 470 pixels,

problem dimension n = 152280, the exact solution (left) and

the noisy right-hand side (right), δnoise = 3 × 10−3.

xexact bexact + bnoise

40


2, k = 1, 2, . . . (top)

and error history of LSQR solutions (bottom):

0 20 40 60 80 10010

−3

10−2

10−1

100

| ( p1(k) , e

1 ) |, || bnoise ||

2 / || bexact ||

2 = 0.003

| ( p1(k) , e

1 ) |

δnoise

0 20 40 60 80 100

103.7

103.8

103.9

Error history

|| xkLSQR − xexact ||

the minimal error, k = 41

41

The best LSQR reconstruction (left), xLSQR41 ,

and the corresponding componentwise error (right).

GK without any reorthogonalization!

LSQR reconstruction with minimal error, xLSQR41

Error of the best LSQR reconstruction, |xexact − xLSQR41

|

10

20

30

40

50

60

70

80

90

100

110

42

Outline



tion, and the approximation of the Riemann-Stieltjes distribution

function





43

Message:

Using GK, information about the noise can be obtained in a straight-

forward way.

Future work:

• Large scale problems;

• Behavior in finite precision arithmetic

(GK without reorthogonalization);

• Regularization;

• Denoising;

• Colored noise.

44

References

• Golub, Kahan: Calculating the singular values and pseudoinverse of a matrix,SIAM J. B2, 1965.

• Hansen: Rank-deficient and discrete ill-posed problems, SIAM MonographsMath. Modeling Comp., 1998.

• Hansen, Kilmer, Kjeldsen: Exploiting residual information in the parameterchoice for discrete ill-posed problems, BIT, 2006.

• Hnětynková, Strakoš: Lanczos tridiag. and core problem, LAA, 2007.• Meurant, Strakoš: The Lanczos and CG algorithms in finite precision arith-

metic, Acta Numerica, 2006.• Paige, Strakoš: Core problem in linear algebraic systems, SIMAX, 2006.• Rust: Truncating the SVD for ill-posed problems, Technical Report, 1998.• Rust, O’Leary: Residual periodograms for choosing regularization parameters

for ill-posed problems, Inverse Problems, 2008.• Hnětynková, Plešinger, Strakoš: The regularizing effect of the Golub-Kahan

iterative bidiagonalization and revealing the noise level, BIT, 2009.

• ...

45

Main message:

Whenever you see a blurred elephant which is a bit too noisy,

the best thing is to apply the GK iterative bidiagonalization.

Full version of the talk can be found at

www.cs.cas.cz/strakos

46

Thank you for your kind attention!

47

golub-kahan iterative bidiagonalization and determining ...0 50 100 150 200 250 300 350 400 10−20...

Documents