an introduction to markov chain monte carlo methodsateckent/mcqmctutorial.pdfan introduction to...

An introduction to Markov chain Monte Carlo methods

Aretha Teckentrup

School of Mathematics, University of Edinburgh

MCQMC ’20 - August 10, 2020

A. Teckentrup (Edinburgh) MCMC August 10, 2020 1 / 33

Outline

1 Motivation

2 Metropolis Hastings Algorithm

3 Choice of proposal density

MotivationSampling methods

In many applications, one is interested in computing the expectedvalue of a quantity of interest φ under the target distribution π:

E[φ] =

∫Rdu

φ(u)π(u) du.

The integral is typically intractable, and can be approximated using asampling method:

E[φ] ≈ 1

N∑i=1

φ(u(i)),

where u(i) ∼ π, for all 1 ≤ i ≤ N .

MotivationSampling methods

Generating samples u(i) ∼ π is often difficult since:

I π is not known in closed form, eg only up to a normalisation constant.⇒ not possible to generate independent (i.i.d.) samples from π

I the state variable u ∈ Rdu is high dimensional.

I π can concentrate on low-dimensional manifolds.

MotivationExample: Bayesian inference [Kaipio, Somersalo ’04]

In a Bayesian inference problem (parameteridentification problem), we are interested in πbeing the posterior distribution:

π(u) =1

Ze−‖Γ

−1/2(y−F (u))‖22 π0(u).

where the normalising constant Z isintractable.

This arises from

I incorporating knowledge on u in a priordistribution π0,

I observing data y = F (u) + η, with noiseη ∼ N(0,Γ),

I conditioning π0 on y, resulting in the posteriordistribution π.

Posterior sample

Metropolis Hastings AlgorithmMarkov chain Monte Carlo (MCMC) [Robert, Casella ’99]

A Markov chain Monte Carlo (MCMC) estimator of E[φ] is of the form

EMCMCN :=

N∑i=1

φ(u(i)),

where {u(i)}∞i=1 is a Markov chain.

Definition (Markov chain)

The family of random variables {u(i)}∞i=1 is called a Markov chain if

Pr[u(i) = xi

∣∣u(1) = x1 , . . . , u(i−1) = xi−1

[u(i) = xi

∣∣u(i−1) = xi−1

for all i ≥ 2 and x1, . . . , xi ∈ Rdu .

We want the distribution of each u(i) to be (close to) π.

A Markov chain Monte Carlo (MCMC) estimator of E[φ] is of the form

EMCMCN :=

N∑i=1

φ(u(i)),

where {u(i)}∞i=1 is a Markov chain.

Definition (Markov chain)

The family of random variables {u(i)}∞i=1 is called a Markov chain if

Pr[u(i) = xi

∣∣u(1) = x1 , . . . , u(i−1) = xi−1

[u(i) = xi

∣∣u(i−1) = xi−1

for all i ≥ 2 and x1, . . . , xi ∈ Rdu .

We want the distribution of each u(i) to be (close to) π.

Why Markov chains?

Practical advantage:

I Allows for sequential construction: construct u(i+1) from u(i).

Theoretical advantages:

I Stationary distributions: we can construct {u(i)}∞i=1 s.t. u(i) ∼ π asi→∞ for any u(1) ∈ Rdu

I Ergodic average: we can construct {u(i)}∞i=1 s.t.1N

∑Ni=1 φ(u(i))→ E[φ] as N →∞.

Metropolis Hastings AlgorithmStandard Algorithm [Robert, Casella ’99]

A particular example is the Metropolis Hastings (MH-MCMC) estimator,which uses the following algorithm to construct {u(i)}∞i=1:

ALGORITHM 1. (Metropolis Hastings)

Choose u(1) with π(u(1)) > 0.

At state u(i), sample a proposal u′ from density q(u′ |u(i)).

Accept sample u′ with probability

α(u′ |u(i)) = min

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

i.e. u(i+1) = u′ with probability α(u′ |u(i)); otherwise stay atu(i+1) = u(i).

The proposal density q is chosen to be easy to sample from.

The accept/reject step is added in order to obtain samples from π.

Knowledge of the normalising constant Z of π is not required.

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

Metropolis Hastings AlgorithmProperties of acceptance probability α

Given current state u(i), we accept proposed state u′ with probability

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

The proposal u′ has a high probability of acceptance when

π(u′)π(u(i))

is large ⇒ u′ has a high target probability ⇒ we choose

samples from regions of high target probability,

q(u(i) |u′)q(u′ |u(i)) is large ⇒ there is a high probability of moving back to u(i)

from u′.

Metropolis Hastings AlgorithmProperties of acceptance probability α

Given current state u(i), we accept proposed state u′ with probability

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

The proposal u′ has a high probability of acceptance when

π(u′)π(u(i))

is large ⇒ u′ has a high target probability ⇒ we choose

samples from regions of high target probability,

q(u(i) |u′)q(u′ |u(i)) is large ⇒ there is a high probability of moving back to u(i)

from u′.

Metropolis Hastings AlgorithmDistribution of the Markov chain [Robert, Casella ’99]

What is the distribution of samples {u(i)}∞i=1 produced by Algorithm MH?

Denote by K(i−1)(· |u(1)) the distribution of u(i) starting at u(1) ∈ Rdu .

Theorem (Convergence to stationary distribution)

Suppose Pr[α = 1] < 1 and

q(u |u∗) > 0 for all u, u∗ with π(u), π(u∗) > 0.

Then, as i→∞,

K(i−1)(· |u(1))TV−−→ π. (u(i) ∼ π)

Here,TV−−→ denotes convergence in total variation:

supA∈B(Rdu )

|K(i−1)(A |u(1))− π(A)| → 0.

Pr[α = 1] < 1 is sufficient for aperiodicity, q(u |u∗) > 0 is sufficient for irreducibility.

Then, as i→∞,

K(i−1)(· |u(1))TV−−→ π. (u(i) ∼ π)

supA∈B(Rdu )

|K(i−1)(A |u(1))− π(A)| → 0.

Then, as i→∞,

K(i−1)(· |u(1))TV−−→ π. (u(i) ∼ π)

supA∈B(Rdu )

|K(i−1)(A |u(1))− π(A)| → 0.

Metropolis Hastings AlgorithmCentral Limit Theorem [Robert, Casella ’99]

Define an auxiliary chain {u(i)}∞i=1, generated by Algorithm 2 withu(1) ∼ π. Define the asymptotic variance

σ2φ := V[φ(u(1))] + 2

∞∑i=2

Cov[φ(u(1)), φ(u(i))].

Theorem (Central Limit Theorem)

Suppose σ2φ <∞, Pr[α = 1] < 1 and q(u |u∗) > 0 for all u, u∗ s.t.

π(u), π(u∗) > 0. Then, as N →∞, we have√Nσ−2

N∑i=1

φ(u(i))− E[φ]

)D−→ N (0, 1).

The estimator 1N

∑Ni=1 φ(u(i)) is

asymptotically normally distributed, with

mean E[φ] and varianceσ2φ

Metropolis Hastings AlgorithmCentral Limit Theorem [Robert, Casella ’99]

Define an auxiliary chain {u(i)}∞i=1, generated by Algorithm 2 withu(1) ∼ π. Define the asymptotic variance

σ2φ := V[φ(u(1))] + 2

∞∑i=2

Cov[φ(u(1)), φ(u(i))].

Theorem (Central Limit Theorem)

Suppose σ2φ <∞, Pr[α = 1] < 1 and q(u |u∗) > 0 for all u, u∗ s.t.

π(u), π(u∗) > 0. Then, as N →∞, we have√Nσ−2

N∑i=1

φ(u(i))− E[φ]

)D−→ N (0, 1).

The estimator 1N

∑Ni=1 φ(u(i)) is

asymptotically normally distributed, with

mean E[φ] and varianceσ2φ

Choice of proposal densityAspects to consider

The proposal density is chosen to balance

accuracy:

I We want to make the asymptotic variance σ2φ small.

I This requires reducing the correlation between samples.

I A related notion is the effective sample size Neff = N V[φ]σ2φ

, increasing

Neff again corresponds to reducing correlation between samples.

I Sampling from the proposal q(· |u(i)) has varying cost depending theparticular choice.

I This could involve computing the gradient ∇ log π(u(i)) and/or higherorder derivatives.

Choice of proposal densityIndependent proposal

The independence sampler chooses a proposal distribution independent ofthe current state u(i): q(· |u(i)) = ν(·).

This either works very well orvery poorly...

It can work well e.g. in theBayesian inference problem,with ν = π0, if the prior π0 andthe posterior π are sufficientlyclose.

Note that we do not getindependent samples {u(i)}∞i=1

due to accept/reject step.

The independence sampler does not use the current state u(i)...

Choice of proposal densityIndependent proposal

The independence sampler chooses a proposal distribution independent ofthe current state u(i): q(· |u(i)) = ν(·).

This either works very well orvery poorly...

It can work well e.g. in theBayesian inference problem,with ν = π0, if the prior π0 andthe posterior π are sufficientlyclose.

Note that we do not getindependent samples {u(i)}∞i=1

due to accept/reject step.

The independence sampler does not use the current state u(i)...

Choice of proposal densityRandom walk proposal [Roberts et al, ’97]

The random walk proposal is given by q(u′ |u(i)) = N (u(i), β2I), for someβ > 0, i.e.

u′ = u(i) + βξi, where ξi ∼ N (0, I), β > 0,

Here, β is a step size that needsto be tuned:

I if β is too small, you don’texplore the state space.

I if β is too large, you rejecttoo often.

I both scenarios lead to largeasymptotic variance σ2

A general rule of thumb is to tune β such that the averageacceptance probability is α ≈ 0.234, to achieve small asymptoticvariance. ⇒ β ∼ d−1

Choice of proposal densityRandom walk proposal [Roberts et al, ’97]

The random walk proposal is given by q(u′ |u(i)) = N (u(i), β2I), for someβ > 0, i.e.

u′ = u(i) + βξi, where ξi ∼ N (0, I), β > 0,

Here, β is a step size that needsto be tuned:

I if β is too small, you don’texplore the state space.

I if β is too large, you rejecttoo often.

I both scenarios lead to largeasymptotic variance σ2

A general rule of thumb is to tune β such that the averageacceptance probability is α ≈ 0.234, to achieve small asymptoticvariance. ⇒ β ∼ d−1

Choice of proposal densityRandom walk proposal in high dimensions [Cotter et al ’13]

Challenge in high dimensions: for any β, the average acceptance rateα→ 0 as du →∞. ⇒ σ2

φ →∞ as du →∞

Example in discretised PDE, with du = (∆x)−2:

Choice of proposal densityPre-conditioned Crank-Nicolson (pCN) [Cotter et al ’13]

The pre-conditioned Crank-Nicolson (pCN) proposal is well-defined inthe infinite-dimensional setting du. ⇒ σ2

φ independent of du

The specific form of the pCN proposal depends on a referencemeasure π0. If π0 is N (0, C0), then q(u′ |u(i)) is defined by

u′ =√

1− β2 u(i) + βξi, where ξi ∼ N (0, C0), β > 0.

β is a step size parameter that needs to be tuned.

The same heuristic to tune βsuch that the averageacceptance probability isα ≈ 0.234 is often used.⇒ β ∼ 1

The pre-conditioned Crank-Nicolson (pCN) proposal is well-defined inthe infinite-dimensional setting du. ⇒ σ2

φ independent of du

The specific form of the pCN proposal depends on a referencemeasure π0. If π0 is N (0, C0), then q(u′ |u(i)) is defined by

u′ =√

1− β2 u(i) + βξi, where ξi ∼ N (0, C0), β > 0.

β is a step size parameter that needs to be tuned.

The same heuristic to tune βsuch that the averageacceptance probability isα ≈ 0.234 is often used.⇒ β ∼ 1

pCN was originally developed for measures µ that are given by achange of measure from a reference Gaussian measure µ0:

dµ0(u) = exp(−Φ(u)), π(u) = exp(−Φ(u))π0(u).

The pCN proposal is π0-reversible, i.e.

π0(u(i)) q(u′ |u(i)) = π0(u′) q(u(i) |u′).

The acceptance probability then becomes

α(u′|u(i)) = min

exp[−Φ(u′)]π0(u′) q(u(i) |u′)exp[−Φ(u(i))]π0(u(i)) q(u′ |u(i))

which depends on u′ only through its likelihood exp[−Φ(u′)].

π0(u(i)) q(u′ |u(i)) = π0(u′) q(u(i) |u′).

α(u′|u(i)) = min

π0(u(i)) q(u′ |u(i)) = π0(u′) q(u(i) |u′).

α(u′|u(i)) = min

Choice of proposal densityHow to incorporate gradient information?

The proposals we have seen so far are agnostic about which parts ofstate space are more probable.

Ideally we would like proposals that take this into account (⇒ make itmore probable to move to areas where π is large).

Connecting to optimisation, a possible way to do this is to usegradient information and propose the next move in the following way

u′ = u(i) + β∇π(u(i))

1 This is deterministic move (we are losing randomness, and the abilityto explore the state space, as we would converge to a local maximum).

2 How can we do this properly?

Choice of proposal densityHow to incorporate gradient information?

The proposals we have seen so far are agnostic about which parts ofstate space are more probable.

Ideally we would like proposals that take this into account (⇒ make itmore probable to move to areas where π is large).

Connecting to optimisation, a possible way to do this is to usegradient information and propose the next move in the following way

u′ = u(i) + β∇π(u(i))

1 This is deterministic move (we are losing randomness, and the abilityto explore the state space, as we would converge to a local maximum).

2 How can we do this properly?

Choice of proposal densityMetropolis Adjusted Langevin Algorithm (MALA)

MALA: q(u′|u(i)) = N (u(i) + β∇ log π(u(i)), 2βI), for some β > 0,i.e.

u′ = u(i) + β∇ log π(u(i)) +√

2βξi, where ξi ∼ N (0, I)

For optimal efficiency, the step size β should be tuned to obtain an

average acceptance rate of α ≈ 0.574. ⇒ β ∼ d−1/3u

Choice of proposal densityFokker Planck equation

Let X(t) satisfy the stochastic differential equation

dX = f(X)dt+√

2dW, X(0) = y.

Then the probability density function ρ(x, t) of X(t) satisfies the followingpartial differential equation:

∂t= −∇ · (fρ) + ∆ρ

ρ(x, 0) = δ(x− y)

If there exists a probability density function ρ∞ for which

∂ρ∞∂t

we call this a stationary probability density. X(0) ∼ ρ∞ ⇒ X(t) ∼ ρ∞∀t

Choice of proposal densityDesigning an SDE with specific stationary density

If the distribution of X(t) converges to ρ∞ as t→∞, then we can samplefrom ρ∞ by simulating paths of the SDE.

Question

How can one design a SDE for which ρ∞ = π?

Choice of proposal densityLangevin diffusion

The desirable SDE is the Langevin diffusion

dX = ∇ log π(X)dt+√

2dW (1)

−∇ · ((∇ log π)π) +∇ · ∇π = ∇ ·(

π∇π)π

)+∇ · ∇π

= −∇ · ∇π +∇ · ∇π= 0

and hence π is the stationary density of the Langevin SDE (1).

Choice of proposal densityOrnstein-Uhlenbeck process I

ConsiderdX = −Xdt+

√2dW, X(0) = y. (2)

Using variation of parameters formula we can solve this equation to obtain

X(t) = ye−t +

0e−(t−s)dW (s).

Using the properties of the Ito stochastic integral, we can show that

X(t) ∼ N(ye−t, 1− e−2t

Choice of proposal densityOrnstein-Uhlenbeck process II

In the limit of t→∞, we have X(t) ∼ N (0, 1).

We know that ρ∞ = N (0, 1), since ∇ log π(x) = −x ⇒π(x) = e−x

Starting at a deterministic y ∈ Rdu , in the large time limit thesolutions of the SDE are distributed according to the stationarydistribution.

In particular one can show that under mild assumptions on π theprobability density function ρ(x, t) of X(t) solving the Langevin SDEsatisfies

‖ρ(x, t)− π(x)‖TV → 0, as t→∞.

Choice of proposal densitySo why aren’t we done yet?

OU process is simple enough to be able to solve analytically.

This will not be the case for more complicated SDEs arising frommore complicated π’s.

In this case we would need to solve the SDE numerically, in whichcase we are no longer sampling from π as t→∞.

Choice of proposal densityEuler-Maruyama scheme

Consider the Langevin SDE

dX = ∇ log π(X)dt+√

The simplest numerical scheme for solving an equation of this type is theEuler-Maruyama scheme

Xn+1 = Xn + ∆t∇ log π(Xn) +√

2∆tξn, ξn ∼ N (0, I)

Example

In the case of the OU process the Euler-Maruyama scheme reads

Xn+1 = Xn −∆tXn +√

2∆tξn

Choice of proposal densityStationary distribution for the numerical solution

We have that Xn ∼ N (µn, σn) where

µn+1 = (1−∆t)µn,

σn+1 = (1−∆t)2σn + 2∆t.

Hence as long as |1−∆t| < 1 we have

µn → 0

σn →1

1− ∆t2

The stationary distribution for the numerical solution is different than thetrue stationary distribution.

⇒ Use discretised SDE dynamics to define proposal q!

Choice of proposal densityMetropolis Adjusted Langevin Algorithm (MALA)

MALA: q(u′|u(i)) = N (u(i) + β∇ log π(u(i)), 2βI), for some β > 0,i.e.

u′ = u(i) + β∇ log π(u(i)) +√

2βξi, where ξi ∼ N (0, I)

Euler-Maruyama scheme for Langevin SDE:

Xn+1 = Xn + ∆t∇ log π(Xn) +√

2∆tξn, ξn ∼ N (0, I)

Choice of proposal densityIncorporating anisotropy

We have seen a range of proposal densities.

Next step: incorporate Hessian information!

Riemann manifold Langevin proposals [Girolami, Calderhead ’11]

Dimension-independent likelihood-informed proposals [Cui, Law,Marzouk ’16]

References I

P. Billingsley, Probability and measure, John Wiley & Sons, 1995.

S. Cotter, M. Dashti, and A. Stuart, Variational dataassimilation using targetted random walks, International Journal forNumerical Methods in Fluids, 68 (2012), pp. 403–421.

T. Dodwell, C. Ketelsen, R. Scheichl, andA. Teckentrup, A hierarchical multilevel Markov chain MonteCarlo algorithm with applications to uncertainty quantification insubsurface flow, SIAM/ASA Journal on Uncertainty Quantification, 3(2015), pp. 1075–1108.

Y. Efendiev, B. Jin, M. Presho, and X. Tan, MultilevelMarkov Chain Monte Carlo Method for High-Contrast Single-PhaseFlow Problems, Communications in Computational Physics, 17(2015), pp. 259–286.

References II

M. Giles, Multilevel Monte Carlo Path Simulation, OperationsResearch, 56 (2008), pp. 607–617.

S. Heinrich, Multilevel Monte Carlo Methods, in InternationalConference on Large-Scale Scientific Computing, Springer, 2001,pp. 58–67.

V. Hoang, C. Schwab, and A. Stuart, Complexity analysis ofaccelerated MCMC methods for Bayesian inversion, Inverse Problems,29 (2013), p. 085010.

B. Hosseini, Two Metropolis–Hastings Algorithms for PosteriorMeasures with Non-Gaussian Priors in Infinite Dimensions, SIAM/ASAJournal on Uncertainty Quantification, 7 (2019), pp. 1185–1223.

J. Kaipio and E. Somersalo, Statistical and computationalinverse problems, Springer, 2004.

References III

P. L‘Ecuyer, Random number generation, in Handbook ofComputational Statistics, Springer, 2011, pp. 35–71.

C. Robert and G. Casella, Monte Carlo Statistical Methods,Springer, 1999.

G. O. Roberts, A. Gelman, W. R. Gilks, et al., Weakconvergence and optimal scaling of random walk Metropolisalgorithms, The annals of applied probability, 7 (1997), pp. 110–120.

D. Rudolf, Explicit error bounds for Markov chain Monte Carlo,arXiv preprint arXiv:1108.3201, (2011).

A. Stuart, Inverse Problems: A Bayesian Perspective, ActaNumerica, 19 (2010), pp. 451–559.

References IV

S. J. Vollmer, Dimension-independent MCMC sampling for inverseproblems with non-Gaussian priors, SIAM/ASA Journal on UncertaintyQuantification, 3 (2015), pp. 535–561.

an introduction to markov chain monte carlo methodsateckent/mcqmctutorial.pdfan introduction to...

Documents