an introduction to markov chain monte carlo methodsateckent/mcqmctutorial.pdfan introduction to...

47
An introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University of Edinburgh MCQMC ’20 - August 10, 2020 A. Teckentrup (Edinburgh) MCMC August 10, 2020 1 / 33

Upload: others

Post on 21-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

An introduction to Markov chain Monte Carlo methods

Aretha Teckentrup

School of Mathematics, University of Edinburgh

MCQMC ’20 - August 10, 2020

A. Teckentrup (Edinburgh) MCMC August 10, 2020 1 / 33

Page 2: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Outline

1 Motivation

2 Metropolis Hastings Algorithm

3 Choice of proposal density

A. Teckentrup (Edinburgh) MCMC August 10, 2020 2 / 33

Page 3: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

MotivationSampling methods

In many applications, one is interested in computing the expectedvalue of a quantity of interest φ under the target distribution π:

E[φ] =

∫Rdu

φ(u)π(u) du.

The integral is typically intractable, and can be approximated using asampling method:

E[φ] ≈ 1

N

N∑i=1

φ(u(i)),

where u(i) ∼ π, for all 1 ≤ i ≤ N .

A. Teckentrup (Edinburgh) MCMC August 10, 2020 3 / 33

Page 4: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

MotivationSampling methods

Generating samples u(i) ∼ π is often difficult since:

I π is not known in closed form, eg only up to a normalisation constant.⇒ not possible to generate independent (i.i.d.) samples from π

I the state variable u ∈ Rdu is high dimensional.

I π can concentrate on low-dimensional manifolds.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 4 / 33

Page 5: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

MotivationExample: Bayesian inference [Kaipio, Somersalo ’04]

In a Bayesian inference problem (parameteridentification problem), we are interested in πbeing the posterior distribution:

π(u) =1

Ze−‖Γ

−1/2(y−F (u))‖22 π0(u).

where the normalising constant Z isintractable.

This arises from

I incorporating knowledge on u in a priordistribution π0,

I observing data y = F (u) + η, with noiseη ∼ N(0,Γ),

I conditioning π0 on y, resulting in the posteriordistribution π.

Truth

Posterior sample

A. Teckentrup (Edinburgh) MCMC August 10, 2020 5 / 33

Page 6: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmMarkov chain Monte Carlo (MCMC) [Robert, Casella ’99]

A Markov chain Monte Carlo (MCMC) estimator of E[φ] is of the form

EMCMCN :=

1

N

N∑i=1

φ(u(i)),

where {u(i)}∞i=1 is a Markov chain.

Definition (Markov chain)

The family of random variables {u(i)}∞i=1 is called a Markov chain if

Pr[u(i) = xi

∣∣u(1) = x1 , . . . , u(i−1) = xi−1

]= Pr

[u(i) = xi

∣∣u(i−1) = xi−1

],

for all i ≥ 2 and x1, . . . , xi ∈ Rdu .

We want the distribution of each u(i) to be (close to) π.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 6 / 33

Page 7: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmMarkov chain Monte Carlo (MCMC) [Robert, Casella ’99]

A Markov chain Monte Carlo (MCMC) estimator of E[φ] is of the form

EMCMCN :=

1

N

N∑i=1

φ(u(i)),

where {u(i)}∞i=1 is a Markov chain.

Definition (Markov chain)

The family of random variables {u(i)}∞i=1 is called a Markov chain if

Pr[u(i) = xi

∣∣u(1) = x1 , . . . , u(i−1) = xi−1

]= Pr

[u(i) = xi

∣∣u(i−1) = xi−1

],

for all i ≥ 2 and x1, . . . , xi ∈ Rdu .

We want the distribution of each u(i) to be (close to) π.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 6 / 33

Page 8: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmMarkov chain Monte Carlo (MCMC) [Robert, Casella ’99]

Why Markov chains?

Practical advantage:

I Allows for sequential construction: construct u(i+1) from u(i).

Theoretical advantages:

I Stationary distributions: we can construct {u(i)}∞i=1 s.t. u(i) ∼ π asi→∞ for any u(1) ∈ Rdu

I Ergodic average: we can construct {u(i)}∞i=1 s.t.1N

∑Ni=1 φ(u(i))→ E[φ] as N →∞.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 7 / 33

Page 9: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmStandard Algorithm [Robert, Casella ’99]

A particular example is the Metropolis Hastings (MH-MCMC) estimator,which uses the following algorithm to construct {u(i)}∞i=1:

ALGORITHM 1. (Metropolis Hastings)

Choose u(1) with π(u(1)) > 0.

At state u(i), sample a proposal u′ from density q(u′ |u(i)).

Accept sample u′ with probability

α(u′ |u(i)) = min

(1,

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

),

i.e. u(i+1) = u′ with probability α(u′ |u(i)); otherwise stay atu(i+1) = u(i).

The proposal density q is chosen to be easy to sample from.

The accept/reject step is added in order to obtain samples from π.

Knowledge of the normalising constant Z of π is not required.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 8 / 33

Page 10: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmStandard Algorithm [Robert, Casella ’99]

A particular example is the Metropolis Hastings (MH-MCMC) estimator,which uses the following algorithm to construct {u(i)}∞i=1:

ALGORITHM 1. (Metropolis Hastings)

Choose u(1) with π(u(1)) > 0.

At state u(i), sample a proposal u′ from density q(u′ |u(i)).

Accept sample u′ with probability

α(u′ |u(i)) = min

(1,

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

),

i.e. u(i+1) = u′ with probability α(u′ |u(i)); otherwise stay atu(i+1) = u(i).

The proposal density q is chosen to be easy to sample from.

The accept/reject step is added in order to obtain samples from π.

Knowledge of the normalising constant Z of π is not required.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 8 / 33

Page 11: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmStandard Algorithm [Robert, Casella ’99]

A particular example is the Metropolis Hastings (MH-MCMC) estimator,which uses the following algorithm to construct {u(i)}∞i=1:

ALGORITHM 1. (Metropolis Hastings)

Choose u(1) with π(u(1)) > 0.

At state u(i), sample a proposal u′ from density q(u′ |u(i)).

Accept sample u′ with probability

α(u′ |u(i)) = min

(1,

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

),

i.e. u(i+1) = u′ with probability α(u′ |u(i)); otherwise stay atu(i+1) = u(i).

The proposal density q is chosen to be easy to sample from.

The accept/reject step is added in order to obtain samples from π.

Knowledge of the normalising constant Z of π is not required.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 8 / 33

Page 12: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmStandard Algorithm [Robert, Casella ’99]

A particular example is the Metropolis Hastings (MH-MCMC) estimator,which uses the following algorithm to construct {u(i)}∞i=1:

ALGORITHM 1. (Metropolis Hastings)

Choose u(1) with π(u(1)) > 0.

At state u(i), sample a proposal u′ from density q(u′ |u(i)).

Accept sample u′ with probability

α(u′ |u(i)) = min

(1,

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

),

i.e. u(i+1) = u′ with probability α(u′ |u(i)); otherwise stay atu(i+1) = u(i).

The proposal density q is chosen to be easy to sample from.

The accept/reject step is added in order to obtain samples from π.

Knowledge of the normalising constant Z of π is not required.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 8 / 33

Page 13: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmProperties of acceptance probability α

Given current state u(i), we accept proposed state u′ with probability

α(u′ |u(i)) = min

(1,

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

).

The proposal u′ has a high probability of acceptance when

π(u′)π(u(i))

is large ⇒ u′ has a high target probability ⇒ we choose

samples from regions of high target probability,

q(u(i) |u′)q(u′ |u(i)) is large ⇒ there is a high probability of moving back to u(i)

from u′.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 9 / 33

Page 14: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmProperties of acceptance probability α

Given current state u(i), we accept proposed state u′ with probability

α(u′ |u(i)) = min

(1,

π(u′) q(u(i) |u′)π(u(i)) q(u′ |u(i))

).

The proposal u′ has a high probability of acceptance when

π(u′)π(u(i))

is large ⇒ u′ has a high target probability ⇒ we choose

samples from regions of high target probability,

q(u(i) |u′)q(u′ |u(i)) is large ⇒ there is a high probability of moving back to u(i)

from u′.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 9 / 33

Page 15: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmDistribution of the Markov chain [Robert, Casella ’99]

What is the distribution of samples {u(i)}∞i=1 produced by Algorithm MH?

Denote by K(i−1)(· |u(1)) the distribution of u(i) starting at u(1) ∈ Rdu .

Theorem (Convergence to stationary distribution)

Suppose Pr[α = 1] < 1 and

q(u |u∗) > 0 for all u, u∗ with π(u), π(u∗) > 0.

Then, as i→∞,

K(i−1)(· |u(1))TV−−→ π. (u(i) ∼ π)

Here,TV−−→ denotes convergence in total variation:

supA∈B(Rdu )

|K(i−1)(A |u(1))− π(A)| → 0.

Pr[α = 1] < 1 is sufficient for aperiodicity, q(u |u∗) > 0 is sufficient for irreducibility.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 10 / 33

Page 16: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmDistribution of the Markov chain [Robert, Casella ’99]

What is the distribution of samples {u(i)}∞i=1 produced by Algorithm MH?

Denote by K(i−1)(· |u(1)) the distribution of u(i) starting at u(1) ∈ Rdu .

Theorem (Convergence to stationary distribution)

Suppose Pr[α = 1] < 1 and

q(u |u∗) > 0 for all u, u∗ with π(u), π(u∗) > 0.

Then, as i→∞,

K(i−1)(· |u(1))TV−−→ π. (u(i) ∼ π)

Here,TV−−→ denotes convergence in total variation:

supA∈B(Rdu )

|K(i−1)(A |u(1))− π(A)| → 0.

Pr[α = 1] < 1 is sufficient for aperiodicity, q(u |u∗) > 0 is sufficient for irreducibility.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 10 / 33

Page 17: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmDistribution of the Markov chain [Robert, Casella ’99]

What is the distribution of samples {u(i)}∞i=1 produced by Algorithm MH?

Denote by K(i−1)(· |u(1)) the distribution of u(i) starting at u(1) ∈ Rdu .

Theorem (Convergence to stationary distribution)

Suppose Pr[α = 1] < 1 and

q(u |u∗) > 0 for all u, u∗ with π(u), π(u∗) > 0.

Then, as i→∞,

K(i−1)(· |u(1))TV−−→ π. (u(i) ∼ π)

Here,TV−−→ denotes convergence in total variation:

supA∈B(Rdu )

|K(i−1)(A |u(1))− π(A)| → 0.

Pr[α = 1] < 1 is sufficient for aperiodicity, q(u |u∗) > 0 is sufficient for irreducibility.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 10 / 33

Page 18: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmCentral Limit Theorem [Robert, Casella ’99]

Define an auxiliary chain {u(i)}∞i=1, generated by Algorithm 2 withu(1) ∼ π. Define the asymptotic variance

σ2φ := V[φ(u(1))] + 2

∞∑i=2

Cov[φ(u(1)), φ(u(i))].

Theorem (Central Limit Theorem)

Suppose σ2φ <∞, Pr[α = 1] < 1 and q(u |u∗) > 0 for all u, u∗ s.t.

π(u), π(u∗) > 0. Then, as N →∞, we have√Nσ−2

φ

(1

N

N∑i=1

φ(u(i))− E[φ]

)D−→ N (0, 1).

The estimator 1N

∑Ni=1 φ(u(i)) is

asymptotically normally distributed, with

mean E[φ] and varianceσ2φ

N .

A. Teckentrup (Edinburgh) MCMC August 10, 2020 11 / 33

Page 19: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Metropolis Hastings AlgorithmCentral Limit Theorem [Robert, Casella ’99]

Define an auxiliary chain {u(i)}∞i=1, generated by Algorithm 2 withu(1) ∼ π. Define the asymptotic variance

σ2φ := V[φ(u(1))] + 2

∞∑i=2

Cov[φ(u(1)), φ(u(i))].

Theorem (Central Limit Theorem)

Suppose σ2φ <∞, Pr[α = 1] < 1 and q(u |u∗) > 0 for all u, u∗ s.t.

π(u), π(u∗) > 0. Then, as N →∞, we have√Nσ−2

φ

(1

N

N∑i=1

φ(u(i))− E[φ]

)D−→ N (0, 1).

The estimator 1N

∑Ni=1 φ(u(i)) is

asymptotically normally distributed, with

mean E[φ] and varianceσ2φ

N .

A. Teckentrup (Edinburgh) MCMC August 10, 2020 11 / 33

Page 20: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityAspects to consider

The proposal density is chosen to balance

accuracy:

I We want to make the asymptotic variance σ2φ small.

I This requires reducing the correlation between samples.

I A related notion is the effective sample size Neff = N V[φ]σ2φ

, increasing

Neff again corresponds to reducing correlation between samples.

cost:

I Sampling from the proposal q(· |u(i)) has varying cost depending theparticular choice.

I This could involve computing the gradient ∇ log π(u(i)) and/or higherorder derivatives.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 12 / 33

Page 21: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityIndependent proposal

The independence sampler chooses a proposal distribution independent ofthe current state u(i): q(· |u(i)) = ν(·).

This either works very well orvery poorly...

It can work well e.g. in theBayesian inference problem,with ν = π0, if the prior π0 andthe posterior π are sufficientlyclose.

Note that we do not getindependent samples {u(i)}∞i=1

due to accept/reject step.

The independence sampler does not use the current state u(i)...

A. Teckentrup (Edinburgh) MCMC August 10, 2020 13 / 33

Page 22: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityIndependent proposal

The independence sampler chooses a proposal distribution independent ofthe current state u(i): q(· |u(i)) = ν(·).

This either works very well orvery poorly...

It can work well e.g. in theBayesian inference problem,with ν = π0, if the prior π0 andthe posterior π are sufficientlyclose.

Note that we do not getindependent samples {u(i)}∞i=1

due to accept/reject step.

The independence sampler does not use the current state u(i)...

A. Teckentrup (Edinburgh) MCMC August 10, 2020 13 / 33

Page 23: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityRandom walk proposal [Roberts et al, ’97]

The random walk proposal is given by q(u′ |u(i)) = N (u(i), β2I), for someβ > 0, i.e.

u′ = u(i) + βξi, where ξi ∼ N (0, I), β > 0,

Here, β is a step size that needsto be tuned:

I if β is too small, you don’texplore the state space.

I if β is too large, you rejecttoo often.

I both scenarios lead to largeasymptotic variance σ2

φ.

A general rule of thumb is to tune β such that the averageacceptance probability is α ≈ 0.234, to achieve small asymptoticvariance. ⇒ β ∼ d−1

u

A. Teckentrup (Edinburgh) MCMC August 10, 2020 14 / 33

Page 24: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityRandom walk proposal [Roberts et al, ’97]

The random walk proposal is given by q(u′ |u(i)) = N (u(i), β2I), for someβ > 0, i.e.

u′ = u(i) + βξi, where ξi ∼ N (0, I), β > 0,

Here, β is a step size that needsto be tuned:

I if β is too small, you don’texplore the state space.

I if β is too large, you rejecttoo often.

I both scenarios lead to largeasymptotic variance σ2

φ.

A general rule of thumb is to tune β such that the averageacceptance probability is α ≈ 0.234, to achieve small asymptoticvariance. ⇒ β ∼ d−1

u

A. Teckentrup (Edinburgh) MCMC August 10, 2020 14 / 33

Page 25: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityRandom walk proposal in high dimensions [Cotter et al ’13]

Challenge in high dimensions: for any β, the average acceptance rateα→ 0 as du →∞. ⇒ σ2

φ →∞ as du →∞

Example in discretised PDE, with du = (∆x)−2:

A. Teckentrup (Edinburgh) MCMC August 10, 2020 15 / 33

Page 26: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityPre-conditioned Crank-Nicolson (pCN) [Cotter et al ’13]

The pre-conditioned Crank-Nicolson (pCN) proposal is well-defined inthe infinite-dimensional setting du. ⇒ σ2

φ independent of du

The specific form of the pCN proposal depends on a referencemeasure π0. If π0 is N (0, C0), then q(u′ |u(i)) is defined by

u′ =√

1− β2 u(i) + βξi, where ξi ∼ N (0, C0), β > 0.

β is a step size parameter that needs to be tuned.

The same heuristic to tune βsuch that the averageacceptance probability isα ≈ 0.234 is often used.⇒ β ∼ 1

A. Teckentrup (Edinburgh) MCMC August 10, 2020 16 / 33

Page 27: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityPre-conditioned Crank-Nicolson (pCN) [Cotter et al ’13]

The pre-conditioned Crank-Nicolson (pCN) proposal is well-defined inthe infinite-dimensional setting du. ⇒ σ2

φ independent of du

The specific form of the pCN proposal depends on a referencemeasure π0. If π0 is N (0, C0), then q(u′ |u(i)) is defined by

u′ =√

1− β2 u(i) + βξi, where ξi ∼ N (0, C0), β > 0.

β is a step size parameter that needs to be tuned.

The same heuristic to tune βsuch that the averageacceptance probability isα ≈ 0.234 is often used.⇒ β ∼ 1

A. Teckentrup (Edinburgh) MCMC August 10, 2020 16 / 33

Page 28: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityPre-conditioned Crank-Nicolson (pCN) [Cotter et al ’13]

pCN was originally developed for measures µ that are given by achange of measure from a reference Gaussian measure µ0:

dµ0(u) = exp(−Φ(u)), π(u) = exp(−Φ(u))π0(u).

The pCN proposal is π0-reversible, i.e.

π0(u(i)) q(u′ |u(i)) = π0(u′) q(u(i) |u′).

The acceptance probability then becomes

α(u′|u(i)) = min

(1,

exp[−Φ(u′)]π0(u′) q(u(i) |u′)exp[−Φ(u(i))]π0(u(i)) q(u′ |u(i))

),

which depends on u′ only through its likelihood exp[−Φ(u′)].

A. Teckentrup (Edinburgh) MCMC August 10, 2020 17 / 33

Page 29: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityPre-conditioned Crank-Nicolson (pCN) [Cotter et al ’13]

pCN was originally developed for measures µ that are given by achange of measure from a reference Gaussian measure µ0:

dµ0(u) = exp(−Φ(u)), π(u) = exp(−Φ(u))π0(u).

The pCN proposal is π0-reversible, i.e.

π0(u(i)) q(u′ |u(i)) = π0(u′) q(u(i) |u′).

The acceptance probability then becomes

α(u′|u(i)) = min

(1,

exp[−Φ(u′)]π0(u′) q(u(i) |u′)exp[−Φ(u(i))]π0(u(i)) q(u′ |u(i))

),

which depends on u′ only through its likelihood exp[−Φ(u′)].

A. Teckentrup (Edinburgh) MCMC August 10, 2020 17 / 33

Page 30: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityPre-conditioned Crank-Nicolson (pCN) [Cotter et al ’13]

pCN was originally developed for measures µ that are given by achange of measure from a reference Gaussian measure µ0:

dµ0(u) = exp(−Φ(u)), π(u) = exp(−Φ(u))π0(u).

The pCN proposal is π0-reversible, i.e.

π0(u(i)) q(u′ |u(i)) = π0(u′) q(u(i) |u′).

The acceptance probability then becomes

α(u′|u(i)) = min

(1,

exp[−Φ(u′)]π0(u′) q(u(i) |u′)exp[−Φ(u(i))]π0(u(i)) q(u′ |u(i))

),

which depends on u′ only through its likelihood exp[−Φ(u′)].

A. Teckentrup (Edinburgh) MCMC August 10, 2020 17 / 33

Page 31: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityHow to incorporate gradient information?

The proposals we have seen so far are agnostic about which parts ofstate space are more probable.

Ideally we would like proposals that take this into account (⇒ make itmore probable to move to areas where π is large).

Connecting to optimisation, a possible way to do this is to usegradient information and propose the next move in the following way

u′ = u(i) + β∇π(u(i))

1 This is deterministic move (we are losing randomness, and the abilityto explore the state space, as we would converge to a local maximum).

2 How can we do this properly?

A. Teckentrup (Edinburgh) MCMC August 10, 2020 18 / 33

Page 32: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityHow to incorporate gradient information?

The proposals we have seen so far are agnostic about which parts ofstate space are more probable.

Ideally we would like proposals that take this into account (⇒ make itmore probable to move to areas where π is large).

Connecting to optimisation, a possible way to do this is to usegradient information and propose the next move in the following way

u′ = u(i) + β∇π(u(i))

1 This is deterministic move (we are losing randomness, and the abilityto explore the state space, as we would converge to a local maximum).

2 How can we do this properly?

A. Teckentrup (Edinburgh) MCMC August 10, 2020 18 / 33

Page 33: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityMetropolis Adjusted Langevin Algorithm (MALA)

MALA: q(u′|u(i)) = N (u(i) + β∇ log π(u(i)), 2βI), for some β > 0,i.e.

u′ = u(i) + β∇ log π(u(i)) +√

2βξi, where ξi ∼ N (0, I)

For optimal efficiency, the step size β should be tuned to obtain an

average acceptance rate of α ≈ 0.574. ⇒ β ∼ d−1/3u

A. Teckentrup (Edinburgh) MCMC August 10, 2020 19 / 33

Page 34: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityFokker Planck equation

Let X(t) satisfy the stochastic differential equation

dX = f(X)dt+√

2dW, X(0) = y.

Then the probability density function ρ(x, t) of X(t) satisfies the followingpartial differential equation:

∂ρ

∂t= −∇ · (fρ) + ∆ρ

ρ(x, 0) = δ(x− y)

If there exists a probability density function ρ∞ for which

∂ρ∞∂t

= 0

we call this a stationary probability density. X(0) ∼ ρ∞ ⇒ X(t) ∼ ρ∞∀t

A. Teckentrup (Edinburgh) MCMC August 10, 2020 20 / 33

Page 35: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityDesigning an SDE with specific stationary density

If the distribution of X(t) converges to ρ∞ as t→∞, then we can samplefrom ρ∞ by simulating paths of the SDE.

Question

How can one design a SDE for which ρ∞ = π?

A. Teckentrup (Edinburgh) MCMC August 10, 2020 21 / 33

Page 36: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityLangevin diffusion

The desirable SDE is the Langevin diffusion

dX = ∇ log π(X)dt+√

2dW (1)

Proof

−∇ · ((∇ log π)π) +∇ · ∇π = ∇ ·(

(1

π∇π)π

)+∇ · ∇π

= −∇ · ∇π +∇ · ∇π= 0

and hence π is the stationary density of the Langevin SDE (1).

A. Teckentrup (Edinburgh) MCMC August 10, 2020 22 / 33

Page 37: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityOrnstein-Uhlenbeck process I

ConsiderdX = −Xdt+

√2dW, X(0) = y. (2)

Using variation of parameters formula we can solve this equation to obtain

X(t) = ye−t +

∫ t

0e−(t−s)dW (s).

Using the properties of the Ito stochastic integral, we can show that

X(t) ∼ N(ye−t, 1− e−2t

).

A. Teckentrup (Edinburgh) MCMC August 10, 2020 23 / 33

Page 38: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityOrnstein-Uhlenbeck process II

In the limit of t→∞, we have X(t) ∼ N (0, 1).

We know that ρ∞ = N (0, 1), since ∇ log π(x) = −x ⇒π(x) = e−x

2/2.

Starting at a deterministic y ∈ Rdu , in the large time limit thesolutions of the SDE are distributed according to the stationarydistribution.

In particular one can show that under mild assumptions on π theprobability density function ρ(x, t) of X(t) solving the Langevin SDEsatisfies

‖ρ(x, t)− π(x)‖TV → 0, as t→∞.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 24 / 33

Page 39: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densitySo why aren’t we done yet?

OU process is simple enough to be able to solve analytically.

This will not be the case for more complicated SDEs arising frommore complicated π’s.

In this case we would need to solve the SDE numerically, in whichcase we are no longer sampling from π as t→∞.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 25 / 33

Page 40: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityEuler-Maruyama scheme

Consider the Langevin SDE

dX = ∇ log π(X)dt+√

2dW.

The simplest numerical scheme for solving an equation of this type is theEuler-Maruyama scheme

Xn+1 = Xn + ∆t∇ log π(Xn) +√

2∆tξn, ξn ∼ N (0, I)

Example

In the case of the OU process the Euler-Maruyama scheme reads

Xn+1 = Xn −∆tXn +√

2∆tξn

A. Teckentrup (Edinburgh) MCMC August 10, 2020 26 / 33

Page 41: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityStationary distribution for the numerical solution

We have that Xn ∼ N (µn, σn) where

µn+1 = (1−∆t)µn,

σn+1 = (1−∆t)2σn + 2∆t.

Hence as long as |1−∆t| < 1 we have

µn → 0

σn →1

1− ∆t2

The stationary distribution for the numerical solution is different than thetrue stationary distribution.

⇒ Use discretised SDE dynamics to define proposal q!

A. Teckentrup (Edinburgh) MCMC August 10, 2020 27 / 33

Page 42: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityMetropolis Adjusted Langevin Algorithm (MALA)

MALA: q(u′|u(i)) = N (u(i) + β∇ log π(u(i)), 2βI), for some β > 0,i.e.

u′ = u(i) + β∇ log π(u(i)) +√

2βξi, where ξi ∼ N (0, I)

Euler-Maruyama scheme for Langevin SDE:

Xn+1 = Xn + ∆t∇ log π(Xn) +√

2∆tξn, ξn ∼ N (0, I)

A. Teckentrup (Edinburgh) MCMC August 10, 2020 28 / 33

Page 43: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

Choice of proposal densityIncorporating anisotropy

We have seen a range of proposal densities.

Next step: incorporate Hessian information!

Riemann manifold Langevin proposals [Girolami, Calderhead ’11]

Dimension-independent likelihood-informed proposals [Cui, Law,Marzouk ’16]

A. Teckentrup (Edinburgh) MCMC August 10, 2020 29 / 33

Page 44: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

References I

P. Billingsley, Probability and measure, John Wiley & Sons, 1995.

S. Cotter, M. Dashti, and A. Stuart, Variational dataassimilation using targetted random walks, International Journal forNumerical Methods in Fluids, 68 (2012), pp. 403–421.

T. Dodwell, C. Ketelsen, R. Scheichl, andA. Teckentrup, A hierarchical multilevel Markov chain MonteCarlo algorithm with applications to uncertainty quantification insubsurface flow, SIAM/ASA Journal on Uncertainty Quantification, 3(2015), pp. 1075–1108.

Y. Efendiev, B. Jin, M. Presho, and X. Tan, MultilevelMarkov Chain Monte Carlo Method for High-Contrast Single-PhaseFlow Problems, Communications in Computational Physics, 17(2015), pp. 259–286.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 30 / 33

Page 45: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

References II

M. Giles, Multilevel Monte Carlo Path Simulation, OperationsResearch, 56 (2008), pp. 607–617.

S. Heinrich, Multilevel Monte Carlo Methods, in InternationalConference on Large-Scale Scientific Computing, Springer, 2001,pp. 58–67.

V. Hoang, C. Schwab, and A. Stuart, Complexity analysis ofaccelerated MCMC methods for Bayesian inversion, Inverse Problems,29 (2013), p. 085010.

B. Hosseini, Two Metropolis–Hastings Algorithms for PosteriorMeasures with Non-Gaussian Priors in Infinite Dimensions, SIAM/ASAJournal on Uncertainty Quantification, 7 (2019), pp. 1185–1223.

J. Kaipio and E. Somersalo, Statistical and computationalinverse problems, Springer, 2004.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 31 / 33

Page 46: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

References III

P. L‘Ecuyer, Random number generation, in Handbook ofComputational Statistics, Springer, 2011, pp. 35–71.

C. Robert and G. Casella, Monte Carlo Statistical Methods,Springer, 1999.

G. O. Roberts, A. Gelman, W. R. Gilks, et al., Weakconvergence and optimal scaling of random walk Metropolisalgorithms, The annals of applied probability, 7 (1997), pp. 110–120.

D. Rudolf, Explicit error bounds for Markov chain Monte Carlo,arXiv preprint arXiv:1108.3201, (2011).

A. Stuart, Inverse Problems: A Bayesian Perspective, ActaNumerica, 19 (2010), pp. 451–559.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 32 / 33

Page 47: An introduction to Markov chain Monte Carlo methodsateckent/MCQMCtutorial.pdfAn introduction to Markov chain Monte Carlo methods Aretha Teckentrup School of Mathematics, University

References IV

S. J. Vollmer, Dimension-independent MCMC sampling for inverseproblems with non-Gaussian priors, SIAM/ASA Journal on UncertaintyQuantification, 3 (2015), pp. 535–561.

A. Teckentrup (Edinburgh) MCMC August 10, 2020 33 / 33