statistical inverse problems, model reduction and inverse...

Statistical Inverse Problems,

Model Reduction and

Inverse Crimes

Erkki Somersalo, Helsinki University of Technology, Finland

Firenze, March 22–26, 2004

CONTENTS OF THE LECTURES

1. Statistical inverse problems: A brief review

2. Model reduction, discretization invariance

3. Inverse crimes

Material based on the forthcoming book

Jari Kaipio and Erkki Somersalo: Computational and Statistical Inverse Prob-lems. Springer-Verlag (2004)

STATISTICAL INVERSE PROBLEMS

Bayesian paradigm, or “subjective probability”:

1. All variables are random variables

2. The randomness reflects the subject’s uncertainty of the actual values

3. The uncertainty is encoded into probability distributions of the variables

Notation: Random variables X, Y , E etc.

Realizations: If X : Ω → Rn, we denote

X(ω) = x ∈ Rn.

Probability densities:

PX ∈ B

πX(x)dx =∫

π(x)dx.

Hierarchy of the variables:

1. Unobservable variables of primary interest, X

2. Unobservable variables of secondary interest, E

3. Observable variables, Y

Example: Linear inverse problem with additive noise,

y = Ax + e, A ∈ Rm×n.

Stochastic extension:Y = AX + E.

Conditioning: Joint probability density of X and Y :

PX ∈ A, Y ∈ B

∫A×B

π(x, y)dx dy.

Marginal densities:

PX ∈ A

X ∈ A, Y ∈ Rm

∫A×Rm

π(x, y)dx dy,

in other words,

π(x) =∫

π(x, y)dy.

Conditional probability:

PX ∈ A | Y ∈ B

∫A×B

π(x, y)dx dy∫B

π(y)dy.

Shrink B into a single point y:

PX ∈ A | Y = y

π(x, y)π(y)

dx =∫

π(x | y)dx,

π(x | y) =π(x, y)π(y)

or π(x, y) = π(x | y)π(y).

Bayesian solution of an inverse problem: Given a measurement y = yobserved

of the observable variable Y , find the posterior density of X,

πpost(x) = π(x | yobserved).

Prior density, πpr(x) expresses all prior information independent of the mea-surement.

Likelihood density π(y | x) is the likelihood of a measurement outcome y givenx.

Bayes formula:

π(x | y) =πpr(x)π(y | x)

π(y).

Three steps of Bayesian inversion:

1. Construct the prior density

2. Construct the likelihood density

3. Extract useful information from the posterior density

Example: Linear model with additive noise,

Y = AX + E,

where the density πnoise is known. Fixing X = x yields

π(y | x) = πnoise(y −Ax),

and soπ(x | y) = πpr(x)πnoise(y −Ax).

Assume that X and E are mutually independent and Gaussian,

X ∼ N (x0,Γpr), E ∼ N (0,Γe),

where Γpr ∈ Rn×n and Γe ∈ Rm×m are symmetric positive (semi)definite.

πpr(x) ∝ exp(−1

2(x− x0)TΓ−1

pr (x− x0))

π(y | x) ∝ exp(−1

2(y −Ax)TΓ−1

e (y −Ax))

From Bayes formula, the posterior covariance is Gaussian,

π(x | y) ∼ N (x∗,Γpost),

x∗ = x0 + ΓprAT(AΓprA

T + Γe)−1(y −Ax0),

Γpost = Γpr − ΓprAT(AΓprA

T + Γe)−1AΓpr.

Special case: Assume that

x0 = 0, Γpr = γ2I, Γe = σ2I.

In this case,x∗ = AT(AAT + α2I)−1y, α =

known as Wiener filtered solution (m×m problem), or, equivalently,

x∗ = (ATA + α2I)−1ATy,

which is the Tikhonov regularized solution (n× n problem).

Engineering rule of thumb: If n < m, use Tikhonov, if m < n use Wiener.

(In practice, ATA or AAT should often not be calculated.)

Frequently asked question: How do you determine α?

Bayesian paradigm: Either

1. You know γ and σ; then α = σ/γ,

2. You don’t know them; make them part of the estimation problem.

This is the empirical Bayes approach.

Example: If γ in the previous example in unknown, write

πpr(x | γ) ∝ 1γn

exp(− 1

2γ2‖x‖2

and writeπpr(x, γ) = πpr(x | γ)πh(γ),

where πh is a hyperprior or hierarchical prior.

Determine π(x, γ | y).

BAYESIAN ESTIMATION

Classical inversion methods produce estimates of the unknown.

In contrast, Bayesian approach produces a probability density that can beused

• to produce estimates,

• to assess the quality of estimates (statistical and classical).

Example: Conditional mean (CM) and maximum a posteriori (MAP) esti-mates:

xCM =∫

xπ(x | y)dx,

xMAP = arg maxπ(x | y).

Calculating MAP esitmate is an optimization problem, CM estimate and in-tegration problem.

Monte Carlo integration: If n is large, quadrature methods not feasible.

MC methods: Assume that we have a sample,

S =x1, x2, . . . , xN

, xj ∈ Rn.

xCM =∫

xπ(x | y)dx ≈N∑

wjxj ,

wherewj = π(xj | y).

Importance sampling: Generate the sample S randomly.

Simple but inefficient (in particular when n is large).

A better idea: Generate the sample using the density π(x | y).

Ideal case: The points xj are distributed according to the density π(x | y),and

xCM =∫

xπ(x | y)dx ≈ 1N

N∑j=1

Markov chain Monte Carlo methods (MCMC): Generate the sample sequen-tially,

x0 → x1 → . . . xj → x+1 → . . . → xN .

Idea: Define a transition probability P (xj , Bj+1),

P (xj , Bj+1) = PXj+1 ∈ Bj+1, provided that Xj = xj

Assuming that Xj has probability density πj(xj),

Pxj+1 ∈ Bj+1

P (xj , Bj+1)πj(xj)dxj = πj+1(Bj+1).

Choose the transition kernel so that π(x | y) is invariant measure:

π(x | y)dx =∫

P (x′, B)π(x′ | y)dx′.

Then all the variables Xj are distributed according to π(x | y).

Best known algorithms:

Metropolis-Hastings, Gibbs sampler.

−2 −1 0 1 2−1

−0.5

−2 −1 0 1 2−1

−0.5

Gibbs sampler: Update one component at the time as follows:

Given xj = [xj1, x

j2, . . . , x

Draw xj+11 from t 7→ π(t, xj

2, . . . , xjn | y),

draw xj+12 from t 7→ π(xj+1

1 , t, xj3, . . . , x

jn | y),

draw xj+1n from t 7→ π(xj+1

1 , xj+12 , . . . , xj+1

n−1, t | y).

−2 −1 0 1 2−1

−0.5

Define a cost function Ψ : Rn × Rn → R.

The Bayes cost of an estimator x = x(y) is defined as

B(x) = EΨ(X, x(Y ))

∫ ∫Ψ(x, x(y))π(x, y)dx dy.

Further, we can write

B(x) =∫ ∫

Ψ(x, x)π(y | x)dy πpr(x)dx

B(x | x)πpr(x)dx = EB(x | x)

whereB(x | x) =

∫Ψ(x, x)π(y | x)dy

is the conditional Bayes cost.

The Bayes cost method: Fix Ψ and define the estimator xB so that

B(xB) ≤ B(x)

for all estimators x of x.

By Bayes formula,

B(x) =∫ ∫

Ψ(x, x)π(x | y)dx π(y)dy.

Since π(y) ≥ 0 and x(y) depends only on y,

xB(y) = arg min ∫

Ψ(x, x)π(x | y)dx

= arg min

Ψ(x, x)

∣∣ y

Mean square error criterion: Choose Ψ(x, x) = ‖x− x‖2, giving

B(x) = E‖X − X‖2

= trace

(corr(X − X)

where X = x(Y ), and

corr(X − X

(X − X)(X − X)T

∈ Rn×n.

This Bayes estimator is called the mean square estimator xMS. We have

xMS =∫

xπ(x | y) dx = xCM.

We have

E‖X − x‖2 | y

‖X‖2 | y

− 2E

Tx + ‖x‖2

= E‖X‖2 | y

∥∥EX | y

2∥∥ +∥∥E

∥∥2

≥ E‖X‖2 | y

∥∥EX | y

2∥∥,

and the equality holds only if

x(y) = EX | y = xCM.

Furthermore,E

X − xCM

X − E

Question: xCM is optimal, but is it informative?

0 0.5 1 1.50

CM MAP

(a)0 0.5 1 1.5

MAP CM

No estimate is foolproof. Optimality is subjective.

DISCRETIZED MODELS

Consider a linear model with additive noise,

y = Af + e, f ∈ H, y, e ∈ Rm.

Discretization, e.g. by collocation,

xn = [f(p1); f(p2); . . . ; f(pn)] ∈ Rn,

Af ≈ Anxn, An ∈ Rm×n.

Assume that the discretization scheme is convergent,

limn→∞

‖Af −Anxn‖ = 0.

Accurate discrete model:

y = ANxN + e, ‖ANxN −Af‖ < tol .

Stochastic extension:Y = ANXN + E,

where Y , XN and E are random variables.

Passing into a coarse mesh. Possible reasons:

1. 2D and 3D applications, problems too large

2. Real time applications

3. Inverse modelling based on prescribed meshing

Coarse mesh model with n < N ,

Af ≈ Anxn, ‖Anxn −Af‖ > tol .

Stochastic extension of the simple reduced model is

Y = AnXn + E.

Inverse crime:

• WriteY = Y = AnXn + E, (1)

and develop the inversion scheme based on this model,

• generate data with the simple reduced model and test the inversionmethod with this data.

Usually, inverse crime results are overly optimistic.

Questions:

1. How to model the discretization error?

2. How to model the prior information?

3. Is the inverse crime always significant?

PRIOR MODELLING

Assume a Gaussian model,

XN ∼ N (xN0 ,ΓN ),

i.e., the prior density is

πpr(xN ) ∝ exp(−1

2(xN − xN

)T(ΓN

)−1(xN − xN

Projection (e.g. interpolation, averaging or downsampling),

P : RN → Rn, XN 7→ Xn.

and therefore,Xn ∼ N (xn

0 ,Γn) = N (PxN0 , P ΓN PT).

However, this is not what we normally do!

Example: H = continuous functions on [0, 1].

Discretization by multiresolution bases. Let

ϕ(t) =

1, if 0 ≤ t < 1,0, if t < 0 or t ≥ 1.

Define V j , 0 ≤ j < ∞, V j ⊂ V j+1,

V j = spanϕj

k|1 ≤ k ≤ 2j,

whereϕj

k(t) = 2j/2ϕ(2jt− k − 1).

Discrete representation,

f j(t) =2j∑

xjkϕj

k(t) ∈ V j .

Projector P : xj 7→ xj−1

P = Ij−1 ⊗ e1 =1√2

1 1 0 0 . . . 0 00 0 1 1 . . . 0 0...

...0 0 0 0 . . . 1 1

∈ R2j−1×2j

Assume the prior information f ∈ C20 ([0, 1]).

Second order smoothness prior of XN , N = 2j :

2α‖LNxN‖2

)= exp

2(xN )T

[α(LN

LN = 22j

−2 1 0 . . . 01 −2 1

0 1 −2...

.... . . 1

0 . . . 1 −2

∈ RN×N .

The prior covariance is

ΓN =[α(LN

Passing to level n = 2j−1 = N/2:

Ln = 22(j−1)

−2 1 0 . . . 01 −2 1

0 1 −2...

.... . . 1

0 . . . 1 −2

= PLNPT ∈ Rn×n.

Natural candidate for the smoothness prior for Xn is

πpr(xn) ∝ exp(−1

2α‖Lnxn‖2

)= exp

2(xn)T

[α(Ln

But this is inconsistent, since

Γn =[α(Ln

6= P[α(LN

PT = Γn.

Numerical example:

Af(t) =∫ 1

K(t− s)f(s)ds, K(s) = e−κs2,

where κ = 15. Sampling:

yj = Af(tj) + ej , tj = (j − 1/2)/50, 1 ≤ j ≤ 50,

andE ∼ N (0, σ2I), σ = 2% of max

(Af(tj)

Smoothness prior

2α‖LNxN‖2

), N = 512.

Reduced model with n = 8.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−4

Figure 1: MAP estimate with N = 512, n = 8. Black dots correspond to Γn,red dots to Γn.

DISCRETIZATION ERROR

From fine mesh to coarse mesh: Complete error model

Y = ANXN + E (2)

= AnXn + (AN −AnP )XN + E

= AnXn + Ediscr + E.

Error covariance: Assume that E, XN are mutually independent,

E ∼ N (0,Γe), XN ∼ N (xN0 ,ΓN ).

The complete error E = Ediscr + E is Gaussian,

E ∼ N (e0, Γe),

e0 = (AN −AnP )xN0 ,

Γe = (AN −AnP )ΓN (AN −AnP )T + Γe.

Error variance:

‖E− e0‖2

‖Ediscr − e0‖2

‖E‖2

= = trace

((AN −AnP )ΓN (AN −AnP )T

)+ trace

)= var

(Ediscr

)+ var

The complete error model is noise dominated, if

var(Ediscr

)< var

and modelling error dominated if

var(Ediscr

)> var

Enhanced error model: Use the likelihood and prior

π(y | xn) ∝ exp(−1

2(y −Anxn − y0)TΓ−1

e (y −Anxn − y0))

πpr(xn) ∝ exp(−1

2(xn − xn

0 )T(Γn

)−1(xn − xn0 )

y0 = EY

= AnEXn+ e0

= AnPxN0 +

(AN −AnP

= ANxN0 .

MAP estimate, denoted by xneem is

xneem = argmin‖Ln

(xn − xn

)‖2 + ‖Le

(Anxn − y − y0

= argmin∥∥∥∥[

]xn −

Le(y − y0)

]∥∥∥∥2

whereLpr = chol

)−1, Le = chol

)−1.

This leads to a normal equation of size n× n.

Note: Enhanced error model is not the complete error model, because Xn iscorrelated with the complete error E through XN .

Complete error model: Assume, for a while that Xn and Y are zero mean. Wehave

Xn = PXN , Y = ANXN + E.

Variable Z = [Xn;Y ] is Gaussian, with mean an covariance

Xn(Xn)T

Y (Xn)T

[PΓNP PΓN (AN )T

ANΓN ANΓN (AN )T + Γe

From this, calculate the conditional density π(xn | y).

π(xn | y) ∼ N (xncem,Γn

xncem = PxN

0 + PΓNpr

(AN )T

[ANΓN

)T + Γe

]−1 (y −ANxN

Γncem = PΓN

prPT − PΓN

)T[ANΓN

)T + Γe

ANΓNprP

Note: The computation of xncem requires soving an m ×m system, indepen-

dently of n. (Compare to xneem).

Example: Full angle tomography.

X−ray source

Detector

Figure 2: True object and the discretized model.

Intensity decrease along a line segment d`:

dI = −Iµd`,

where µ = µ(p) ≥ 0, p ∈ Ω is the mass absorption.

Let I0 be the intensity of the transmitted X-ray.

The received intensity I is

I= −

µ(p)d`(p).

Inverse problem of X-ray tomography: Estimate µ : Ω → R+ from the valuesof its integrals along a set of straight lines passing through Ω.

Figure 3: Sinogram data.

Gaussian structural smoothness prior: Three weakly correlated subregions.Inside each region pixels mutually correlated.

20 40 60 80

Figure 4: Prior geometry

Construction of the prior: Pixel centers pj , 1 ≤ j ≤ N .

Divide the pixels in clicques C1, C2 and C3. In medical imaging, this is calledimage segmenting.

Define the neighbourhood systemN = Ni | 1 ≤ i ≤ N, Ni ⊂ 1, 2, . . . , N,where

j ∈ Ni if and only if pixels pi and pj are neighbours and in the same clicque.

Define the density of a Markov random field X as

πMRF(x) ∝ exp

−12α

N∑j=1

|xj − cj

∑i∈Nj

= exp(−1

2αxTBx

where the coupling constant cj depends of the clicque.

The matrix B is singular.

Remedy: Select few points pj | j ∈ I ′′, where I ′′ ⊂ I = 1, 2, . . . , N. LetI ′ = I \ I ′′.

Denote x = [x′;x′′].

The conditional density πMRF(x′ | x′′), (i.e., x′′ fixed), is a proper measurewith respect to x′.

Defineπpr(x) = πMRF(x′ | x′′)π0(x′′),

where π0 is Gaussian, e.g.,

π0 ∼ N (0, γ20I).

Figure 5: Four random draws from the prior density.

Data generated in a N = 84 × 84 mesh, inverse solutions computed in an = 42× 42 mesh.

Proper data y ∈ Rm and inverse crime data yic ∈ Rm:

y = ANxNtrue + e, yic = AnPxN

true + e,

where xNtrue is drawn from the prior density, e is a realization of

E ∼ N (0, σ2I),

σ2 = κm−1trace((AN −AnP )ΓN (AN −AnP )T

), 0.1 ≤ κ ≤ 10.

In other words,

0.1 ≤ κ =noise variance

discretization error variance≤ 10.

What is the structure of the discretization error? Can we approximate it byGaussian white noise?

5 10 15 20 25 30 35 40

Projection number

ΓA(k,k)

ΓA(k,k+1)

Figure 6: The diagonal and the first off-diagonal of discretization error covari-ance.

Error analysis:

1. Draw a sample xN1 , xN

2 , . . . , xNS , S = 500, from the prior density.

2. Choose the noise level σ = σ(κ) and generate data y1(κ), y2(κ), . . . , yS(κ),both proper and inverse crime version.

3. Calculate the estimates x(y1(κ)), x(y2(κ)), . . . , x(yS(κ).

4. Estimate the estimation error,

E‖X − X(κ)‖2

S∑j=1

‖x(yj(κ))− xj‖2.

Estimators: CM, CM with enhanced error model and truncated CGNR byMorozov discrepancy principle, discrepancy

δ2 = τE‖E‖2

= τmσ(κ)2, τ = 1.1

10−2

10−1

10−3

10−2

10−1

Noise level

CM Corr

Figure 7: Estimation errors with various noise levels. Dashed line isvar(Ediscr).

Error level 0.0029247

Error level 0.11093

Example: Estimate error: If x = x(y) is an estimator, define the relativeestimation error as

D(x) =E

‖X − X‖2

‖X‖2

Observe:D(0) = 1.

D(xCM) ≤ D(x)

for any estimator x.

Test case: Limited angle tomography, Reconstructions with truncated singularvalue decomposition (TSVD) versus CM estimate.

Calculate D(xTSVD) and D(xCM) by ensemble averaging (S = 500).

TSVD estimate:y = Ax + e.

SVD decomposition: A = UDV T, where

U = [u1, u2, . . . , um] ∈ Rm×m, V = [v1, v2, . . . , vn] ∈ Rm×n,

D = diag(d1, d2, . . . , dmin(n,m)) ∈ Rm×n, d1 ≥ d2 ≥ . . . ≥ dmin(n,m) ≥ 0.

xTSVD(y, r) =r∑

j y)vj ,

and the truncation parameter r is chosen, e.g., by the Morozov discrepancyprinciple,

‖y −AxTSVD(y, r)‖2 ≤ τE‖E‖2

< ‖y −AxTSVD(y, r − 1)‖2.

5 10 15 20 25 30 35 40

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

||^x − x||2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

||^x − x||2

sity CM

CONCLUSIONS

• The Bayesian approach is useful for incorporating complex prior infor-mation into inverse solvers.

• It is not a method of producing a single estimator - although it can beused as a tool for that, too.

• It facilitates error analysis of discretization, modelling and estimationby deterministic methods.

• Working with ensembles makes possible to analyze non-linear problemsas well (e.g. EIT, OAST).

statistical inverse problems, model reduction and inverse...

Documents

proof-search and countermodel generation for non-normal...

a statistical perspective of inverse and inverse

vocabulary trigonometry trigonometric ratio sine cosine...

technical datasheet - veracious inc · inverse...

inverse kinematics using dynamic joint parameters: inverse

third international conference on scale space and...

inverse volume rendering with material...

inverse stieltjes-like functions and inverse problems for...

inverse trigonometric functions - prashanth...

section 1.8 inverse functions. inverse functions

lecture 6 - section 7.7 inverse trigonometric functions...

4.7 inverse trig functions. inverse trig functions

inverse relations and functions obj: find the inverse of a...

curriculum vitae et studiorum di dimitri...

objectives ► the inverse sine function ► the inverse...

inverse and optimization problems in heat transfer inverse...

inverse trigonometric functions - national...

reseña de exchange 2007 leandro sgallari profesional en...

inverse functions – worksheet...

section 9.1 – inverse functions. does an inverse function...