lectures 1{2: bayesian inference and mcmc foundations

Lectures 1–2:

Bayesian inference and MCMC foundations

Youssef Marzouk

Department of Aeronautics and Astronautics

Center for Computational Engineering

Statistics and Data Science Center

Massachusetts Institute of Technology

http://uqgroup.mit.edu

[email protected]

19–22 March 2018

Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 1 / 53

http://uqgroup.mit.edu

Agenda

Plan for the lectures:

1 Lectures 1–2: Bayesian inference and MCMC foundations

Bayesian modeling

Computational approaches/demos

2 Lecture 3: Bayesian approach to inverse problems

What distinguishes inverse problems? Elements of a Bayesian inverse

problem formulation

Linear–Gaussian problems in detail

Computational issues: surrogate modeling/likelihood approximations,

parameter dimension reduction, . . .

3 Lecture 4: Bayesian optimal experimental design or some other

topic TBD


Statistical inference

Why is a statistical perspective useful in data assimilation?

To characterize uncertainty in the parameters and/or state of a

system

To understand how this uncertainty depends on the number and

quality of observations, features of the model, prior information, etc.

To make probabilistic predictions

To choose useful observations or experiments

To address questions of model error and model validity; to perform

model selection


Bayesian inference

Bayes’ rule

p(θ|y) =p(y |θ)p(θ)

p(y)

Key idea: model parameters θ are treated as random variables

(For simplicity, we let our random variables have densities)

Notation

θ are model parameters; y are the data; assume both to be

finite-dimensional unless otherwise indicated

p(θ) is the prior probability density

L(θ) := p(y |θ) is the likelihood function

p(θ|y) is the posterior probability density

p(y) is the evidence, or equivalently, the marginal likelihood


Bayesian inference example

Infer the bias θ ∈ [0, 1] of a coin, given flip outcomes (yi)ni=1 ∈ {0, 1}n.

Convention: outcome yi = 1 is “heads” and yi = 0 is “tails.”

Elements of the Bayesian formulation:

Likelihood function

Single observation: Yi ∼ Ber(θ), where θ := P [Yi = 1] . Hence:

P(yi |θ) = θyi (1− θ)(1−yi )

Multiple observations are conditionally independent given θ:

P (y1, . . . , yn|θ) =

n∏i=1

P(yi |θ) =

n∏i=1

θyi (1− θ)1−yi

Rewrite more compactly, with k :=∑ni=1 yi heads in n trials:

P(k |θ, n) =

(n

k

)θk(1− θ)n−k , i.e., K ∼ Binomial(θ, n)


Bayesian inference example

Infer the bias θ ∈ [0, 1] of a coin, given flip outcomes (yi)ni=1 ∈ {0, 1}n.

Convention: outcome yi = 1 is “heads” and yi = 0 is “tails.”

Elements of the Bayesian formulation:

Prior distribution

Use Θ ∼ Beta(β1, β2):

p(θ |β1, β2) ∝ θ β1−1(1− θ)β2−1

Uniform distribution is a special case: Beta(1, 1) = U(0, 1)

Posterior distribution

Posterior density follows from simple algebra:

p(θ|k , n, β1, β2) ∝ p(k |θ, n)p(θ |β1, β2) ∝ θk+β1−1(1− θ)n−k+β2−1

i.e., Θ|k , n ∼ Beta(β1 + k , β2 + n − k)

This happens to be a conjugate Bayesian model!


Coin flip example 1/2 [Sivia 2006]16 Parameter estimation I

Fig. 2.1 The evolution of the posterior pdf for the bias-weighting of a coin, prob(H |{data}, I ),as the number of data available increases. The figure on the top right-hand corner of each panelshows the number of data analysed; in the early panels, the H or T in the top left-hand cornershows whether the result of that (last) flip was a head or a tail.

from Sivia & Skilling, Data Analysis: a Bayesian Tutorial, Oxford UP (2006).

Coin flip example 1/2 [Sivia 2006]

16 Parameter estimation I

Fig. 2.1 The evolution of the posterior pdf for the bias-weighting of a coin, prob(H |{data}, I ),as the number of data available increases. The figure on the top right-hand corner of each panelshows the number of data analysed; in the early panels, the H or T in the top left-hand cornershows whether the result of that (last) flip was a head or a tail.


Coin flip example 2/2 [Sivia 2006]18 Parameter estimation I

Fig. 2.2 The effect of different priors, prob(H |I ), on the posterior pdf for the bias-weighting ofa coin. The solid line is the same as in Fig. 2.1, and is included for ease of comparison. The casefor two alternative priors, reflecting slightly different assumptions in the conditioning informationI , are shown with dashed and dotted lines.


Coin flip example 2/2 [Sivia 2006]

18 Parameter estimation I

Fig. 2.2 The effect of different priors, prob(H |I ), on the posterior pdf for the bias-weighting ofa coin. The solid line is the same as in Fig. 2.1, and is included for ease of comparison. The casefor two alternative priors, reflecting slightly different assumptions in the conditioning informationI , are shown with dashed and dotted lines.


Linear regression example

Bayesian linear regression

yi = θ0 + θ>1 xi + ξi , yi , θ0 ∈ R; xi , θ1 ∈ Rd ; ξi ∼ N(0, σ2)

Let the data consist of n observations Dn ≡ {(yi , xi)}ni=1Define the matrix X ∈ Rn×d with rows xi , y ∈ Rn as a column

vector of y1 . . . yn, θ = [θ0; θ1] ∈ Rd+1, 1 ∈ Rn as a vector of ones,

and In as the n-dimensional identity matrix

Bayesian model: likelihood and prior:

y | θ0, θ1 ∼ N(1θ0 + Xθ1, σ

2In)

θ ∼ N (µpr ,ΣΣΣpr )

Yields the joint (d + 1–dimensional) posterior distribution of

constant term θ0 and “slopes” θ1


Bayesian inference

Summaries of the posterior distribution

What information to extract?

Posterior mean of θ; maximum a posteriori (MAP) estimate of θ

Posterior covariance or higher moments of θ

Quantiles

Credibile intervals: C (y) such that P [θ ∈ C (y) | y ] = 1− α.

Credible intervals are not uniquely defined above; thus consider, for

example, the HPD (highest posterior density) region.

Posterior realizations: for direct assessment, or to estimate posterior

expectations


Bayesian and frequentist paradigms

Understanding both perspectives is useful and important. . .

Key differences between these two statistical paradigms

Frequentists do not assign probabilities to unknown parameters θ.

One can write likelihoods pθ(y) ≡ p(y |θ) but not priors p(θ) or

posteriors. θ is not a random variable.

In the frequentist viewpoint, there is no single preferred

methodology for inverting the relationship between parameters and

data. Instead, consider various estimators θ(y) of θ.

The estimator θ is a random variable. Why? Frequentist paradigm

considers y to result from a random and repeatable experiment.


Bayesian and frequentist paradigms

Key differences (continued)

Evaluate quality of θ through various criteria: bias, variance,

mean-square error, consistency, efficiency, . . .

One common frequentist approach is maximum likelihood

estimation: θML = argmaxθ p(y |θ). (View p(y |θ) as a family of

distributions indexed by θ.)

Link to Bayesian approach: MAP estimate maximizes a “penalized

likelihood.”

What about Bayesian versus frequentist prediction of ynew ⊥⊥ y | θ?

Frequentist: use “plug-in” estimate of θ

Bayesian: posterior prediction via integration


Statistical problems

Canonical statistical problems

Density estimation: observe realizations {y (i)} of a random variable

Y and use them to learn the probability distribution (density) of Y .

Parametric (e.g., pθ(y)) and nonparametric approaches.

Regression: observe dependence of a response or output variable Y

on a covariate or input variable X . Consider a model p(y |x , θ); learn

θ and predict future y |x .

Classification: like regression, but response variable ranges over a

finite set.

Not all statistical problems fall cleanly into one of these three categories.

But core aspects of these problems are worth studying!


Bayesian inference

Likelihood functions (initial summary)

In general, p(y |θ) = pθ(y) is a probabilistic model for the data

Preview: in inverse problems, the likelihood function is where the

forward model appears, along with a noise model and (if applicable)

an expression for model discrepancy

Preview: in filtering, the likelihood function might be simpler (e.g.,

direct noisy observations of the state)


Bayesian inference

Prior distributions (initial summary)

Much can be written about choosing priors.

Intuitive idea: assign lower probability to neighborhoods of θ that

you don’t expect to see, higher probability to neighborhoods of θ

that you do expect to see.

Preview: in ill-posed parameter estimation problems, e.g., inverse

problems, prior information plays a key role!

Preview: in filtering problems, the prior is often the result of

“applying” the dynamics to an earlier distribution on the state


Bayesian inference

Hierarchical modeling

One of the key flexibilities of the Bayesian construction!

Hierarchical modeling has important implications for the design of

efficient MCMC samplers (later in the lecture)

Examples:

1 Unknown noise variance2 Unknown scale of the prior (cf. choosing the regularization

parameter in an inverse problem)3 Many more, as dictated by the physical and statistical models at hand


Hierarchical modeling example

State-space model with static parameters

Z0 Z1 Z2 Z3 ZN

Y0 Y1 Y2 Y3 YN

Θ

1

Ingredients of the Bayesian model:Transition density πZk |Zk−1,Θ

Observation density (likelihood) πYk |ZkPrior on static parameters πΘ

Prior on initial condition πZ0

Posterior density:

πZ0:N ,Θ | y0:N ∝ πΘπZ0

(N∏k=1

πZk |Zk−1,Θ

) N∏j=1

πyj |Zj


Questions yet to answer

How to simulate from or explore general non-Gaussian posterior

distributions? (This lecture)

How to make Bayesian inference computationally tractable when the

forward model is expensive (e.g., a PDE) and the parameters are

high- or infinite-dimensional? (Lecture #3)


MCMC overview

Markov chain Monte Carlo (MCMC)

Metropolis-Hastings algorithm, transition kernels, ergodicity

Mixture and cycles of kernels

Gibbs sampling

Gradient-exploiting MCMC, adaptive MCMC, other practicalities

Using approximations (e.g., approximate likelihoods) within MCMC


Why Markov chain Monte Carlo (MCMC)?

In general, MCMC provides a means of sampling (“simulating”) from an

arbitrary distribution.

The density π(x) need be known only up to a normalizing constant

Utility in inference and prediction: write both as posterior

expectations, Eπf .

Then

Eπf ≈1

n

n∑i

f(

x (i))

x (i) will be asymptotically distributed according to π

x (i) will not be i.i.d. In other words, we must pay a price!


Construction of an MCMC sampler

Define a Markov chain (i.e., discrete time). For real-valued random

variables, the chain has a continuous-valued state space (e.g., Rd).

Ingredients of the definition:

Initial distribution, x0 ∼ π0Transition kernel K (xn, xn+1).

P (Xn+1 ∈ A|Xn = x) =

∫A

K (x , x ′) dx ′

(Analogy: consider matrix of transition probabilities for a finite state

space.)

Markov property: Xn+1 depends only on Xn.

Goal: design transition kernel K such that chain converges

asymptotically to the target distribution π independently of the initial

distribution (starting point).


Construction of an MCMC sampler (cont.)

Goal: choose transition kernel K such that chain converges

asymptotically to the target distribution π independently of the starting

point.

Use realizations of Xn,Xn−1, . . . in a Monte Carlo estimator of

posterior expectations (an ergodic average)

Would like to converge to the target distribution quickly and to

have samples as close to independent as possible

Price for non-i.i.d. samples: greater variance in MC estimates of

posterior expectations


Metropolis-Hastings algorithm

A simple recipe! From xn to xn+1:

1 Draw a proposal y from q (y |xn)2 Calculate acceptance ratio

α(xn, y) = min

{1,π(y)q(xn|y)

π(xn)q(y |xn)

}3 Put

xn+1 =

{y , with probability α(xn, y)xn, with probability 1− α(xn, y)



Notes on the algorithm:

If q(y |xn) ∝ π(y) then α = 1. Thus we “correct” for sampling from

q, rather than from π, via the Metropolis acceptance step.

q does not have to be symmetric. If the proposal is symmetric, the

acceptance probability simplifies (a “Hastings” proposal).

π need be evaluated only up to a multiplicative constant



What is the transition kernel of the Markov chain we have just defined?

Hint: it is not q!

Informally, it is

K (xn, xn+1) = p (xn+1|accept)P[accept] + p (xn+1|reject)P[reject]

More precisely, we have:

K (xn, xn+1) = p (xn+1|xn)= q(xn+1|xn)α(xn, xn+1) + δxn (xn+1) r(xn),

where r(xn) ≡∫

q(y |xn) (1− α(xn, y)) dy



Now, some theory. What are the key questions?

1 Is π a stationary distribution of the chain? (Is the chainπ-invariant?)

Stationarity: π is such that Xn ∼ π ⇒ Xn+1 ∼ π2 Does the chain converge to stationarity? In other words, as n →∞,

does L(Xn) converge to π?3 Can we use paths of the chain in Monte Carlo estimates?

A sufficient (but not necessary) condition for (1) is detailed balance

(also called ‘reversibility’):

π(xn)K (xn, xn+1) = π(xn+1)K (xn+1, xn)

This expresses an equilibrium in the flow of the chain

Hence∫π(xn)K (xn, xn+1) dxn =

∫π(xn+1)K (xn+1, xn) dxn =

π(xn+1)∫

K (xn+1, xn) dxn = π(xn+1).

As an exercise, verify detailed balance for the M-H kernel defined on

the previous slide.



Beyond π-invariance, we also need to establish (2) and (3) from the

previous slide. This leads to additional technical requirements:

π-irreducibility: for every set A with π(A) > 0, there exists n suchthat Kn(x ,A) > 0 ∀x .

Intuition: chain visits any measurable subset with nonzero probability

in a finite number of steps. Helps you “forget” the initial condition.

Sufficient to have q(y |x) > 0 for every (x , y) ∈ χ× χ.

Aperiodicity: “don’t get trapped in cycles”



When these requirements are satisfied (i.e., chain is irreducible and

aperiodic, with stationary distribution π) we have

1

limn→∞

∥∥∥∥∫ Kn (x , ·)µ(dx)− π(·)∥∥∥∥TV

= 0

for every initial distribution µ.

K n is the kernel for n transitions

This yields the law of Xn:∫

K n (x , ·)µ(dx) = L(Xn)The total variation distance ‖µ1 − µ2‖TV = supA |µ1(A)− µ2(A)| is

the largest possible difference between the probabilities that the two

measures can assign to the same event.



When these requirements are satisfied (i.e., chain is irreducible and

aperiodic, with stationary distribution π) we have

2 For h ∈ L1π,

limn→∞

1

n

n∑i

h(

x (i))

= Eπ[h] w.p. 1

This is a strong law of large numbers that allows computation of

posterior expectations.

Obtaining a central limit theorem, or more generally saying anything

about the rate of convergence to stationarity, requires additional

conditions (e.g., geometric ergodicity).

See [Roberts & Rosenthal 2004] for an excellent survey of MCMC

convergence results.


Metropolis-Hastings diagnostics

What about the quality of MCMC estimates?

What is the price one pays for correlated samples?

Compare Monte Carlo (iid) and MCMC estimates of Eπh (and for the

latter, assume we have a CLT):

Monte Carlo

Var[hn]

=Varπ [h(X )]

nMCMC

Var[hn]

= θVarπ [h(X )]

nwhere

θ = 1 + 2

∞∑s>0

corr (h(Xi), h(Xi+s))

is the integrated autocorrelation time.

Effective sample size (ESS) is then n/θ



Now try a very simple computational demonstration (in MATLAB):

MCMC sampling from a univariate distribution

Look at autocorrelation and visual diagnostics (e.g., trace of chain)


Other Metropolis-Hastings diagnostics

Example: multivariate potential scale reduction factor (MPSRF) [Brooks

& Gelman 1998]

Run multiple “replicate” chains from over-dispersed starting points.

Compute:

Pooled-sample covariance estimate (across all chains) V (tends to

over-estimate)

Average of individual-chain sample covariance estimates W (tends to

under-estimate)

Let R1/2 be the largest generalized eigenvalue of the pencil (V,W).

Diagnostic: value of statistic R1/2 approaches 1 (from above) as

the chains become similar


More sophisticated Metropolis-Hastings

M-H construction was extremely general.

Achieving efficient sampling (good “mixing”) requires moreexploitation of problem structure.1 Mixtures of kernels2 Cycles of kernels; Gibbs sampling3 Adaptive MCMC4 Gradient- and Hessian-exploiting MCMC5 MCMC in infinite dimensions


Mixtures and cycles

Mixtures of kernels

Let Ki all have π as limiting distribution

Use a convex combination: K ∗ =∑i νiKi

νi is the probability of picking transition kernel Ki at a given step of

the chain

Kernels can correspond to transitions that each have desirable

properties, e.g., local versus global proposals

Cycles of kernels

Split multivariate state vector into blocks that are updated

separately; each update is accomplished by transition kernel Kj

Need to combine kernels. Cycle = a systematic scan, K ∗ =∏j Kj


Componentwise Metropolis-Hastings

This is an example of using a cycle of kernels

Let x = (x1, . . . , xd) ∈ Rd

Proposal qi(y |x) updates only component i

Walk through components of the state sequentially, i = 1 . . . d :

Propose a new value for component i using

qi(

y i |x1n+1, . . . , x i−1n+1, x in, x i+1n , . . . , xdn)

Accept (x in+1 = y i) or reject (x in+1 = x in) this component update with

acceptance probability

αi (xi , yi ) = min

{1,π(yi )qi (x in|yi )π(xi )qi (y i |xi )

}where xi and yi differ only in component i

yi ≡(

x1n+1, . . . , x i−1n+1, y , x i+1n , . . . , xdn)

and

xi ≡(

x1n+1, . . . , x i−1n+1, x in, x i+1n , . . . , xdn)


Gibbs sampling

One very useful cycle is the Gibbs sampler.

Requires the ability to sample directly from the full conditionaldistribution π(xi |x∼i).

x∼i denotes all components of x other than xiIn problems with appropriate structure, generating independent

samples from the full conditional may be feasible while sampling from

π is not.

xi can represent a block of the state vector, rather than just an

individual component

A Gibbs update is a proposal from the full conditional; the

acceptance probability is identically one!

αi(xi , yi) = min

{1,π(yi) qi(x in|yi)π(xi) qi(y i |xi)

}= min

{1,π(yi |x∼i)π(x∼i)π(x in|x∼i)π(x in|x∼i)π(x∼i)π(y i |x∼i)

}= 1.


Gibbs sampling example #1

Correlated bivariate normal

x ∼ N

([µ1µ2

],

[σ21 ρσ1σ2

ρσ1σ2 σ22

])Full conditionals are:

x1|x2 ∼ N

(µ1 +

σ1σ2ρ(x2 − µ2), (1− ρ2)σ21

)x2|x1 ∼ . . .

See computational demo


Gibbs sampling example #2

Bayesian linear regression with a variance hyperparameter

yi = βTxi + σzi , yi ∈ R; β, xi ∈ Rd ; zi ∼ N(0, 1)

This problem has a non-Gaussian posterior but is amenable to block

Gibbs sampling

Let the data consist of n observations Dn ≡ {(yi , xi)}ni=1Bayesian hierarchical model, likelihood and priors:

y |β, σ2 ∼ N(Xβ, σ2In

)β |σ2 ∼ N

(0, τ2σ2Id

)1/σ2 ∼ Γ (α, γ)

where X ∈ Rn×d has rows xi and y ∈ Rn is a vector of y1 . . . yn.


Gibbs sampling example #2 (cont.)

Posterior density:

π(β, σ2

)≡ p

(β, σ2|Dn

)∝

1

σnexp

(−

1

2σ2(y − Xβ)T (y − Xβ)

)1

(τσ)dexp

(−

1

2τ2σ2βTβ

)(

1

σ2

)α−1exp

(−γ/σ2

)Full conditionals β|σ2,Dn and σ2|β,Dn have a closed form! Try to

obtain by inspecting the joint density above. (See next page for

answer.)


Gibbs sampling example #2 (cont.)

Full conditional for β is Gaussian:

β |σ2,Dn ∼ N(µ, σ2ΣΣΣ

)where

ΣΣΣ−1 =

(1

τ2Id + XTX

)and µ = ΣΣΣXTy.

Full conditional for 1/σ2 is Gamma:

1/σ2 |β,Dn ∼ Γ (α, γ)

where

a = a + n/2 + d/2 and γ = γ +1

2τ2βTβ +

1

2(y − Xβ)T (y − Xβ) .

Alternately sample from these FCs in order to simulate the joint

posterior.

Also, this is an example of the use of conjugate priors.


Metropolis-within-Gibbs

What if we cannot sample from the full conditionals?

Solution: “Metropolis-within-Gibbs”

This is just componentwise Metropolis-Hastings (which is where we

started)


Adaptive Metropolis

Intuitive idea: learn a better proposal q(y |x) from past samples.Learn an appropriate proposal scale.

Learn an appropriate proposal orientation and anisotropy; this is

essential in problems with strong correlation in π

Adaptive Metropolis scheme of [Haario et al. 2001]:Covariance matrix at step n

C ∗n = sd Cov (x0, . . . , xn) + sdεId

where ε > 0, d is the dimension of the state, and sd = 2.42/d

(scaling rule-of-thumb).

Proposals are Gaussians centered at xn. Use a fixed covariance C0 for

the first n0 steps, then use C ∗n .

Chain is not Markov, and previous convergence proofs do not apply.

Nonetheless, one can prove that the chain converges to π. See paper

in references.

Many other adaptive MCMC ideas have been developed in recent

years


Adaptive Metropolized independence samplers

Independence proposal: does not depend on current state

Consider a proposal q(x ;ψ) with parameter ψ.

Key idea: minimize Kullback-Leibler divergence between this

proposal and the target distribution:

minψ

DKL (π(x)‖q(x ;ψ))

Equivalently, maximize∫π(x) log q(x ;ψ)dx

Solve this optimization problem with successive steps of stochastic

approximation (e.g., Robbins-Monro), while approximating the

integral via MCMC samples

Common choice: let q be a mixture of Gaussians or other

exponential-family distributions


Online MCMC demo

Very cool demo, thanks to Chi Feng (MIT):

https://chi-feng.github.io/mcmc-demo

Let’s look at RWM and AM on various targets



Langevin MCMC

Intuitive idea: use gradient of the posterior to steer samples towards

higher density regions

Consider the SDE

dXt =1

2∇ logπ(Xt)dt + dWt

This SDE has π as its stationary distribution

Discretize the SDE (e.g., Euler-Maruyama)

X t+1 = X t +σ2

2∇ logπ(X t) + σεt , εt ∼ N(0, I )

Discretized process X t no longer has π as its stationary distribution!

But we can use X t+1 as a proposal in the regular

Metropolis-Hastings framework, and accept or reject it accordingly.

σ2 (discretization time step) is an adjustable free parameter.

Langevin schemes require access to the gradient of the posterior.


Preconditioned Langevin

Introduce a positive definite matrix A to the Langevin SDE:

dXt =1

2A∇ logπ(Xt)dt + A1/2dWt

Let A reflect covariance structure of target

For example: let A be the local inverse Hessian of the log-posterior,

or the inverse Hessian at the posterior mode, or posterior-averaged

Hessian information, or some other estimate of the posterior

covariance

Key idea for inverse problems: use low-rank approximations of the

posterior covariance/precision developed for the linear-Gaussian case


Hamiltonian MCMC

Let x be “position” variables; introduce auxiliary “momentum”

variables w

Consider a separable Hamiltonian, H(x ,w) = U(x) + wTM−1w/2.

Put U(x) = − logπ(x).

Hamiltonian dynamics are reversible and conserve H. Use them to

propose new states x!

In particular, sample from p(x ,w) = 1Z exp (−H(x ,w)):

First, sample the momentum variables w from their Gaussian

distribution

Second, integrate Hamilton’s equations to propose a new state

(x ,w); then apply Metropolis accept/reject

Features:Enables faraway moves in x-space while leaving the value of the

density (essentially) unchanged. Good mixing!

Requires good symplectic integrators and access to derivatives

Recent extension: Riemannian manifold HMC [Girolami & Calderhead

JRSSB 2011]


Online MCMC demo

Back to the demo:


Now look at MALA and HMC/NUTS on various targets



MCMC in infinite dimensions

Would like to construct a well-defined MCMC sampler for functions

u ∈ H.

First, the posterior measure µy should be a well-defined probability

measure on H (see Stuart Acta Numerica 2010). For simplicity, let

the prior µ0 be N (0,C ).

Now let q be the proposal distribution, and consider pair of measures

ν(du, du′) = q(u, du′)µy (du), ν⊥(du, du′) = q(u′, du)µy (du′);

Then the MCMC acceptance probability is

α(uk , u′) = min

{1,

dν⊥

dν(uk , u′)

}To define a valid transition kernel, we need absolute continuity

ν⊥ � ν; in turn, this places requirements on the proposal q


MCMC in infinite dimensions (cont.)

One way to produce a valid transition kernel is the preconditioned

Crank-Nicolson (pCN) proposal (Cotter et al. 2013):

u′ = (1− β2)1/2uk + βξk , ξk ∼ N (0,C ), β ∈ (0, 1).

Practical impact: sampling efficiency does not degenerate as

discretization of u is refined

More sophisticated versions: combine pCN with Hessian/geometry

information, e.g., DILI (dimension-independent likelihood-informed)

proposals (Cui, Law, M 2016)


MCMC practicalities

Effective use of MCMC still requires some (problem-specific) experience.

Some useful rules of thumb:

Adaptive schemes are not a panacea.

Whenever possible, (re-)parameterize the problem in order to

minimize posterior correlations.

What to do, if anything, about “burn-in?”

Visual inspection of chain components is often the first and best

convergence diagnostic.

Also look at autocorrelation plots. Run multiple chains from

different starting points. Evaluate MPSRF or other diagnostics.


MCMC practicalities

Additional advice:

“The best Monte Carlo is a dead Monte Carlo”: If you can tackleany part of the problem analytically, do it!

Example: Rao-Blackwellization in Cui et al., “Likelihood-informed

dimension reduction for nonlinear inverse problems,” Inverse Problems

30: 114015 (2014).

πprior πposterior

Data dominated


MCMC references

A small selection of useful “general” MCMC references.

C. Andrieu, N. de Freitas, A. Doucet, M. I. Jordan, “An introduction to MCMC

for machine learning,” Machine Learning 50 (2003) 5–43.

S. Brooks, A. Gelman, G. Jones and X. Meng, editors. Handbook of MCMC.

Chapman & Hall/CRC, 2011.

A. Gelman, J. B. Carlin, H. S. Stern, D. Dunson, A. Vehtari, D. B. Rubin.

Bayesian Data Analysis. Chapman & Hall CRC, 3rd edition, 2013.

P. J. Green. “Reversible jump Markov chain Monte Carlo computation and

Bayesian model determination.” Biometrika, 82: 711–732, 1995.

H. Haario, M. Laine, A. Mira, and E. Saksman. “DRAM: Efficient adaptive

MCMC.” Statistics and Computing, 16(4): 339–354, 2006.

C. P. Robert, G. Casella, Monte Carlo Statistical Methods, 2nd Edition,

Springer, 2004.

G. Roberts, J. Rosenthal. “General state space Markov chains and MCMC

algorithms.” Probability Surveys, 1: 20–71, 2004.


lectures 1{2: bayesian inference and mcmc foundations

Documents