lectures 1{2: bayesian inference and mcmc foundations
TRANSCRIPT
Lectures 1–2:
Bayesian inference and MCMC foundations
Youssef Marzouk
Department of Aeronautics and Astronautics
Center for Computational Engineering
Statistics and Data Science Center
Massachusetts Institute of Technology
http://uqgroup.mit.edu
19–22 March 2018
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 1 / 53
Agenda
Plan for the lectures:
1 Lectures 1–2: Bayesian inference and MCMC foundations
Bayesian modeling
Computational approaches/demos
2 Lecture 3: Bayesian approach to inverse problems
What distinguishes inverse problems? Elements of a Bayesian inverse
problem formulation
Linear–Gaussian problems in detail
Computational issues: surrogate modeling/likelihood approximations,
parameter dimension reduction, . . .
3 Lecture 4: Bayesian optimal experimental design or some other
topic TBD
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 2 / 53
Statistical inference
Why is a statistical perspective useful in data assimilation?
To characterize uncertainty in the parameters and/or state of a
system
To understand how this uncertainty depends on the number and
quality of observations, features of the model, prior information, etc.
To make probabilistic predictions
To choose useful observations or experiments
To address questions of model error and model validity; to perform
model selection
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 3 / 53
Bayesian inference
Bayes’ rule
p(θ|y) =p(y |θ)p(θ)
p(y)
Key idea: model parameters θ are treated as random variables
(For simplicity, we let our random variables have densities)
Notation
θ are model parameters; y are the data; assume both to be
finite-dimensional unless otherwise indicated
p(θ) is the prior probability density
L(θ) := p(y |θ) is the likelihood function
p(θ|y) is the posterior probability density
p(y) is the evidence, or equivalently, the marginal likelihood
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 4 / 53
Bayesian inference example
Infer the bias θ ∈ [0, 1] of a coin, given flip outcomes (yi)ni=1 ∈ {0, 1}n.
Convention: outcome yi = 1 is “heads” and yi = 0 is “tails.”
Elements of the Bayesian formulation:
Likelihood function
Single observation: Yi ∼ Ber(θ), where θ := P [Yi = 1] . Hence:
P(yi |θ) = θyi (1− θ)(1−yi )
Multiple observations are conditionally independent given θ:
P (y1, . . . , yn|θ) =
n∏i=1
P(yi |θ) =
n∏i=1
θyi (1− θ)1−yi
Rewrite more compactly, with k :=∑ni=1 yi heads in n trials:
P(k |θ, n) =
(n
k
)θk(1− θ)n−k , i.e., K ∼ Binomial(θ, n)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 5 / 53
Bayesian inference example
Infer the bias θ ∈ [0, 1] of a coin, given flip outcomes (yi)ni=1 ∈ {0, 1}n.
Convention: outcome yi = 1 is “heads” and yi = 0 is “tails.”
Elements of the Bayesian formulation:
Likelihood function
Single observation: Yi ∼ Ber(θ), where θ := P [Yi = 1] . Hence:
P(yi |θ) = θyi (1− θ)(1−yi )
Multiple observations are conditionally independent given θ:
P (y1, . . . , yn|θ) =
n∏i=1
P(yi |θ) =
n∏i=1
θyi (1− θ)1−yi
Rewrite more compactly, with k :=∑ni=1 yi heads in n trials:
P(k |θ, n) =
(n
k
)θk(1− θ)n−k , i.e., K ∼ Binomial(θ, n)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 5 / 53
Bayesian inference example
Infer the bias θ ∈ [0, 1] of a coin, given flip outcomes (yi)ni=1 ∈ {0, 1}n.
Convention: outcome yi = 1 is “heads” and yi = 0 is “tails.”
Elements of the Bayesian formulation:
Likelihood function
Single observation: Yi ∼ Ber(θ), where θ := P [Yi = 1] . Hence:
P(yi |θ) = θyi (1− θ)(1−yi )
Multiple observations are conditionally independent given θ:
P (y1, . . . , yn|θ) =
n∏i=1
P(yi |θ) =
n∏i=1
θyi (1− θ)1−yi
Rewrite more compactly, with k :=∑ni=1 yi heads in n trials:
P(k |θ, n) =
(n
k
)θk(1− θ)n−k , i.e., K ∼ Binomial(θ, n)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 5 / 53
Bayesian inference example
Infer the bias θ ∈ [0, 1] of a coin, given flip outcomes (yi)ni=1 ∈ {0, 1}n.
Convention: outcome yi = 1 is “heads” and yi = 0 is “tails.”
Elements of the Bayesian formulation:
Prior distribution
Use Θ ∼ Beta(β1, β2):
p(θ |β1, β2) ∝ θ β1−1(1− θ)β2−1
Uniform distribution is a special case: Beta(1, 1) = U(0, 1)
Posterior distribution
Posterior density follows from simple algebra:
p(θ|k , n, β1, β2) ∝ p(k |θ, n)p(θ |β1, β2) ∝ θk+β1−1(1− θ)n−k+β2−1
i.e., Θ|k , n ∼ Beta(β1 + k , β2 + n − k)
This happens to be a conjugate Bayesian model!
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 6 / 53
Bayesian inference example
Infer the bias θ ∈ [0, 1] of a coin, given flip outcomes (yi)ni=1 ∈ {0, 1}n.
Convention: outcome yi = 1 is “heads” and yi = 0 is “tails.”
Elements of the Bayesian formulation:
Prior distribution
Use Θ ∼ Beta(β1, β2):
p(θ |β1, β2) ∝ θ β1−1(1− θ)β2−1
Uniform distribution is a special case: Beta(1, 1) = U(0, 1)
Posterior distribution
Posterior density follows from simple algebra:
p(θ|k , n, β1, β2) ∝ p(k |θ, n)p(θ |β1, β2) ∝ θk+β1−1(1− θ)n−k+β2−1
i.e., Θ|k , n ∼ Beta(β1 + k , β2 + n − k)
This happens to be a conjugate Bayesian model!
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 6 / 53
Coin flip example 1/2 [Sivia 2006]16 Parameter estimation I
Fig. 2.1 The evolution of the posterior pdf for the bias-weighting of a coin, prob(H |{data}, I ),as the number of data available increases. The figure on the top right-hand corner of each panelshows the number of data analysed; in the early panels, the H or T in the top left-hand cornershows whether the result of that (last) flip was a head or a tail.
from Sivia & Skilling, Data Analysis: a Bayesian Tutorial, Oxford UP (2006).
Coin flip example 1/2 [Sivia 2006]
16 Parameter estimation I
Fig. 2.1 The evolution of the posterior pdf for the bias-weighting of a coin, prob(H |{data}, I ),as the number of data available increases. The figure on the top right-hand corner of each panelshows the number of data analysed; in the early panels, the H or T in the top left-hand cornershows whether the result of that (last) flip was a head or a tail.
from Sivia & Skilling, Data Analysis: a Bayesian Tutorial, Oxford UP (2006).
Coin flip example 2/2 [Sivia 2006]18 Parameter estimation I
Fig. 2.2 The effect of different priors, prob(H |I ), on the posterior pdf for the bias-weighting ofa coin. The solid line is the same as in Fig. 2.1, and is included for ease of comparison. The casefor two alternative priors, reflecting slightly different assumptions in the conditioning informationI , are shown with dashed and dotted lines.
from Sivia & Skilling, Data Analysis: a Bayesian Tutorial, Oxford UP (2006).
Coin flip example 2/2 [Sivia 2006]
18 Parameter estimation I
Fig. 2.2 The effect of different priors, prob(H |I ), on the posterior pdf for the bias-weighting ofa coin. The solid line is the same as in Fig. 2.1, and is included for ease of comparison. The casefor two alternative priors, reflecting slightly different assumptions in the conditioning informationI , are shown with dashed and dotted lines.
from Sivia & Skilling, Data Analysis: a Bayesian Tutorial, Oxford UP (2006).
Linear regression example
Bayesian linear regression
yi = θ0 + θ>1 xi + ξi , yi , θ0 ∈ R; xi , θ1 ∈ Rd ; ξi ∼ N(0, σ2)
Let the data consist of n observations Dn ≡ {(yi , xi)}ni=1Define the matrix X ∈ Rn×d with rows xi , y ∈ Rn as a column
vector of y1 . . . yn, θ = [θ0; θ1] ∈ Rd+1, 1 ∈ Rn as a vector of ones,
and In as the n-dimensional identity matrix
Bayesian model: likelihood and prior:
y | θ0, θ1 ∼ N(1θ0 + Xθ1, σ
2In)
θ ∼ N (µpr ,ΣΣΣpr )
Yields the joint (d + 1–dimensional) posterior distribution of
constant term θ0 and “slopes” θ1
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 11 / 53
Bayesian inference
Summaries of the posterior distribution
What information to extract?
Posterior mean of θ; maximum a posteriori (MAP) estimate of θ
Posterior covariance or higher moments of θ
Quantiles
Credibile intervals: C (y) such that P [θ ∈ C (y) | y ] = 1− α.
Credible intervals are not uniquely defined above; thus consider, for
example, the HPD (highest posterior density) region.
Posterior realizations: for direct assessment, or to estimate posterior
expectations
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 12 / 53
Bayesian and frequentist paradigms
Understanding both perspectives is useful and important. . .
Key differences between these two statistical paradigms
Frequentists do not assign probabilities to unknown parameters θ.
One can write likelihoods pθ(y) ≡ p(y |θ) but not priors p(θ) or
posteriors. θ is not a random variable.
In the frequentist viewpoint, there is no single preferred
methodology for inverting the relationship between parameters and
data. Instead, consider various estimators θ(y) of θ.
The estimator θ is a random variable. Why? Frequentist paradigm
considers y to result from a random and repeatable experiment.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 13 / 53
Bayesian and frequentist paradigms
Key differences (continued)
Evaluate quality of θ through various criteria: bias, variance,
mean-square error, consistency, efficiency, . . .
One common frequentist approach is maximum likelihood
estimation: θML = argmaxθ p(y |θ). (View p(y |θ) as a family of
distributions indexed by θ.)
Link to Bayesian approach: MAP estimate maximizes a “penalized
likelihood.”
What about Bayesian versus frequentist prediction of ynew ⊥⊥ y | θ?
Frequentist: use “plug-in” estimate of θ
Bayesian: posterior prediction via integration
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 14 / 53
Statistical problems
Canonical statistical problems
Density estimation: observe realizations {y (i)} of a random variable
Y and use them to learn the probability distribution (density) of Y .
Parametric (e.g., pθ(y)) and nonparametric approaches.
Regression: observe dependence of a response or output variable Y
on a covariate or input variable X . Consider a model p(y |x , θ); learn
θ and predict future y |x .
Classification: like regression, but response variable ranges over a
finite set.
Not all statistical problems fall cleanly into one of these three categories.
But core aspects of these problems are worth studying!
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 15 / 53
Bayesian inference
Likelihood functions (initial summary)
In general, p(y |θ) = pθ(y) is a probabilistic model for the data
Preview: in inverse problems, the likelihood function is where the
forward model appears, along with a noise model and (if applicable)
an expression for model discrepancy
Preview: in filtering, the likelihood function might be simpler (e.g.,
direct noisy observations of the state)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 16 / 53
Bayesian inference
Prior distributions (initial summary)
Much can be written about choosing priors.
Intuitive idea: assign lower probability to neighborhoods of θ that
you don’t expect to see, higher probability to neighborhoods of θ
that you do expect to see.
Preview: in ill-posed parameter estimation problems, e.g., inverse
problems, prior information plays a key role!
Preview: in filtering problems, the prior is often the result of
“applying” the dynamics to an earlier distribution on the state
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 16 / 53
Bayesian inference
Hierarchical modeling
One of the key flexibilities of the Bayesian construction!
Hierarchical modeling has important implications for the design of
efficient MCMC samplers (later in the lecture)
Examples:
1 Unknown noise variance2 Unknown scale of the prior (cf. choosing the regularization
parameter in an inverse problem)3 Many more, as dictated by the physical and statistical models at hand
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 17 / 53
Hierarchical modeling example
State-space model with static parameters
Z0 Z1 Z2 Z3 ZN
Y0 Y1 Y2 Y3 YN
Θ
1
Ingredients of the Bayesian model:Transition density πZk |Zk−1,Θ
Observation density (likelihood) πYk |ZkPrior on static parameters πΘ
Prior on initial condition πZ0
Posterior density:
πZ0:N ,Θ | y0:N ∝ πΘπZ0
(N∏k=1
πZk |Zk−1,Θ
) N∏j=1
πyj |Zj
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 18 / 53
Hierarchical modeling example
State-space model with static parameters
Z0 Z1 Z2 Z3 ZN
Y0 Y1 Y2 Y3 YN
Θ
1
Ingredients of the Bayesian model:Transition density πZk |Zk−1,Θ
Observation density (likelihood) πYk |ZkPrior on static parameters πΘ
Prior on initial condition πZ0
Posterior density:
πZ0:N ,Θ | y0:N ∝ πΘπZ0
(N∏k=1
πZk |Zk−1,Θ
) N∏j=1
πyj |Zj
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 18 / 53
Questions yet to answer
How to simulate from or explore general non-Gaussian posterior
distributions? (This lecture)
How to make Bayesian inference computationally tractable when the
forward model is expensive (e.g., a PDE) and the parameters are
high- or infinite-dimensional? (Lecture #3)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 19 / 53
MCMC overview
Markov chain Monte Carlo (MCMC)
Metropolis-Hastings algorithm, transition kernels, ergodicity
Mixture and cycles of kernels
Gibbs sampling
Gradient-exploiting MCMC, adaptive MCMC, other practicalities
Using approximations (e.g., approximate likelihoods) within MCMC
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 20 / 53
Why Markov chain Monte Carlo (MCMC)?
In general, MCMC provides a means of sampling (“simulating”) from an
arbitrary distribution.
The density π(x) need be known only up to a normalizing constant
Utility in inference and prediction: write both as posterior
expectations, Eπf .
Then
Eπf ≈1
n
n∑i
f(
x (i))
x (i) will be asymptotically distributed according to π
x (i) will not be i.i.d. In other words, we must pay a price!
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 21 / 53
Why Markov chain Monte Carlo (MCMC)?
In general, MCMC provides a means of sampling (“simulating”) from an
arbitrary distribution.
The density π(x) need be known only up to a normalizing constant
Utility in inference and prediction: write both as posterior
expectations, Eπf .
Then
Eπf ≈1
n
n∑i
f(
x (i))
x (i) will be asymptotically distributed according to π
x (i) will not be i.i.d. In other words, we must pay a price!
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 21 / 53
Construction of an MCMC sampler
Define a Markov chain (i.e., discrete time). For real-valued random
variables, the chain has a continuous-valued state space (e.g., Rd).
Ingredients of the definition:
Initial distribution, x0 ∼ π0Transition kernel K (xn, xn+1).
P (Xn+1 ∈ A|Xn = x) =
∫A
K (x , x ′) dx ′
(Analogy: consider matrix of transition probabilities for a finite state
space.)
Markov property: Xn+1 depends only on Xn.
Goal: design transition kernel K such that chain converges
asymptotically to the target distribution π independently of the initial
distribution (starting point).
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 22 / 53
Construction of an MCMC sampler (cont.)
Goal: choose transition kernel K such that chain converges
asymptotically to the target distribution π independently of the starting
point.
Use realizations of Xn,Xn−1, . . . in a Monte Carlo estimator of
posterior expectations (an ergodic average)
Would like to converge to the target distribution quickly and to
have samples as close to independent as possible
Price for non-i.i.d. samples: greater variance in MC estimates of
posterior expectations
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 23 / 53
Metropolis-Hastings algorithm
A simple recipe! From xn to xn+1:
1 Draw a proposal y from q (y |xn)2 Calculate acceptance ratio
α(xn, y) = min
{1,π(y)q(xn|y)
π(xn)q(y |xn)
}3 Put
xn+1 =
{y , with probability α(xn, y)xn, with probability 1− α(xn, y)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 24 / 53
Metropolis-Hastings algorithm
Notes on the algorithm:
If q(y |xn) ∝ π(y) then α = 1. Thus we “correct” for sampling from
q, rather than from π, via the Metropolis acceptance step.
q does not have to be symmetric. If the proposal is symmetric, the
acceptance probability simplifies (a “Hastings” proposal).
π need be evaluated only up to a multiplicative constant
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 25 / 53
Metropolis-Hastings algorithm
What is the transition kernel of the Markov chain we have just defined?
Hint: it is not q!
Informally, it is
K (xn, xn+1) = p (xn+1|accept)P[accept] + p (xn+1|reject)P[reject]
More precisely, we have:
K (xn, xn+1) = p (xn+1|xn)= q(xn+1|xn)α(xn, xn+1) + δxn (xn+1) r(xn),
where r(xn) ≡∫
q(y |xn) (1− α(xn, y)) dy
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 26 / 53
Metropolis-Hastings algorithm
What is the transition kernel of the Markov chain we have just defined?
Hint: it is not q!
Informally, it is
K (xn, xn+1) = p (xn+1|accept)P[accept] + p (xn+1|reject)P[reject]
More precisely, we have:
K (xn, xn+1) = p (xn+1|xn)= q(xn+1|xn)α(xn, xn+1) + δxn (xn+1) r(xn),
where r(xn) ≡∫
q(y |xn) (1− α(xn, y)) dy
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 26 / 53
Metropolis-Hastings algorithm
What is the transition kernel of the Markov chain we have just defined?
Hint: it is not q!
Informally, it is
K (xn, xn+1) = p (xn+1|accept)P[accept] + p (xn+1|reject)P[reject]
More precisely, we have:
K (xn, xn+1) = p (xn+1|xn)= q(xn+1|xn)α(xn, xn+1) + δxn (xn+1) r(xn),
where r(xn) ≡∫
q(y |xn) (1− α(xn, y)) dy
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 26 / 53
Metropolis-Hastings algorithm
Now, some theory. What are the key questions?
1 Is π a stationary distribution of the chain? (Is the chainπ-invariant?)
Stationarity: π is such that Xn ∼ π ⇒ Xn+1 ∼ π2 Does the chain converge to stationarity? In other words, as n →∞,
does L(Xn) converge to π?3 Can we use paths of the chain in Monte Carlo estimates?
A sufficient (but not necessary) condition for (1) is detailed balance
(also called ‘reversibility’):
π(xn)K (xn, xn+1) = π(xn+1)K (xn+1, xn)
This expresses an equilibrium in the flow of the chain
Hence∫π(xn)K (xn, xn+1) dxn =
∫π(xn+1)K (xn+1, xn) dxn =
π(xn+1)∫
K (xn+1, xn) dxn = π(xn+1).
As an exercise, verify detailed balance for the M-H kernel defined on
the previous slide.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 27 / 53
Metropolis-Hastings algorithm
Now, some theory. What are the key questions?
1 Is π a stationary distribution of the chain? (Is the chainπ-invariant?)
Stationarity: π is such that Xn ∼ π ⇒ Xn+1 ∼ π2 Does the chain converge to stationarity? In other words, as n →∞,
does L(Xn) converge to π?3 Can we use paths of the chain in Monte Carlo estimates?
A sufficient (but not necessary) condition for (1) is detailed balance
(also called ‘reversibility’):
π(xn)K (xn, xn+1) = π(xn+1)K (xn+1, xn)
This expresses an equilibrium in the flow of the chain
Hence∫π(xn)K (xn, xn+1) dxn =
∫π(xn+1)K (xn+1, xn) dxn =
π(xn+1)∫
K (xn+1, xn) dxn = π(xn+1).
As an exercise, verify detailed balance for the M-H kernel defined on
the previous slide.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 27 / 53
Metropolis-Hastings algorithm
Now, some theory. What are the key questions?
1 Is π a stationary distribution of the chain? (Is the chainπ-invariant?)
Stationarity: π is such that Xn ∼ π ⇒ Xn+1 ∼ π2 Does the chain converge to stationarity? In other words, as n →∞,
does L(Xn) converge to π?3 Can we use paths of the chain in Monte Carlo estimates?
A sufficient (but not necessary) condition for (1) is detailed balance
(also called ‘reversibility’):
π(xn)K (xn, xn+1) = π(xn+1)K (xn+1, xn)
This expresses an equilibrium in the flow of the chain
Hence∫π(xn)K (xn, xn+1) dxn =
∫π(xn+1)K (xn+1, xn) dxn =
π(xn+1)∫
K (xn+1, xn) dxn = π(xn+1).
As an exercise, verify detailed balance for the M-H kernel defined on
the previous slide.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 27 / 53
Metropolis-Hastings algorithm
Beyond π-invariance, we also need to establish (2) and (3) from the
previous slide. This leads to additional technical requirements:
π-irreducibility: for every set A with π(A) > 0, there exists n suchthat Kn(x ,A) > 0 ∀x .
Intuition: chain visits any measurable subset with nonzero probability
in a finite number of steps. Helps you “forget” the initial condition.
Sufficient to have q(y |x) > 0 for every (x , y) ∈ χ× χ.
Aperiodicity: “don’t get trapped in cycles”
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 28 / 53
Metropolis-Hastings algorithm
When these requirements are satisfied (i.e., chain is irreducible and
aperiodic, with stationary distribution π) we have
1
limn→∞
∥∥∥∥∫ Kn (x , ·)µ(dx)− π(·)∥∥∥∥TV
= 0
for every initial distribution µ.
K n is the kernel for n transitions
This yields the law of Xn:∫
K n (x , ·)µ(dx) = L(Xn)The total variation distance ‖µ1 − µ2‖TV = supA |µ1(A)− µ2(A)| is
the largest possible difference between the probabilities that the two
measures can assign to the same event.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 29 / 53
Metropolis-Hastings algorithm
When these requirements are satisfied (i.e., chain is irreducible and
aperiodic, with stationary distribution π) we have
2 For h ∈ L1π,
limn→∞
1
n
n∑i
h(
x (i))
= Eπ[h] w.p. 1
This is a strong law of large numbers that allows computation of
posterior expectations.
Obtaining a central limit theorem, or more generally saying anything
about the rate of convergence to stationarity, requires additional
conditions (e.g., geometric ergodicity).
See [Roberts & Rosenthal 2004] for an excellent survey of MCMC
convergence results.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 29 / 53
Metropolis-Hastings diagnostics
What about the quality of MCMC estimates?
What is the price one pays for correlated samples?
Compare Monte Carlo (iid) and MCMC estimates of Eπh (and for the
latter, assume we have a CLT):
Monte Carlo
Var[hn]
=Varπ [h(X )]
nMCMC
Var[hn]
= θVarπ [h(X )]
nwhere
θ = 1 + 2
∞∑s>0
corr (h(Xi), h(Xi+s))
is the integrated autocorrelation time.
Effective sample size (ESS) is then n/θ
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 30 / 53
Metropolis-Hastings diagnostics
What about the quality of MCMC estimates?
What is the price one pays for correlated samples?
Compare Monte Carlo (iid) and MCMC estimates of Eπh (and for the
latter, assume we have a CLT):
Monte Carlo
Var[hn]
=Varπ [h(X )]
nMCMC
Var[hn]
= θVarπ [h(X )]
nwhere
θ = 1 + 2
∞∑s>0
corr (h(Xi), h(Xi+s))
is the integrated autocorrelation time.
Effective sample size (ESS) is then n/θ
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 30 / 53
Metropolis-Hastings diagnostics
What about the quality of MCMC estimates?
What is the price one pays for correlated samples?
Compare Monte Carlo (iid) and MCMC estimates of Eπh (and for the
latter, assume we have a CLT):
Monte Carlo
Var[hn]
=Varπ [h(X )]
nMCMC
Var[hn]
= θVarπ [h(X )]
nwhere
θ = 1 + 2
∞∑s>0
corr (h(Xi), h(Xi+s))
is the integrated autocorrelation time.
Effective sample size (ESS) is then n/θ
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 30 / 53
Metropolis-Hastings algorithm
Now try a very simple computational demonstration (in MATLAB):
MCMC sampling from a univariate distribution
Look at autocorrelation and visual diagnostics (e.g., trace of chain)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 31 / 53
Other Metropolis-Hastings diagnostics
Example: multivariate potential scale reduction factor (MPSRF) [Brooks
& Gelman 1998]
Run multiple “replicate” chains from over-dispersed starting points.
Compute:
Pooled-sample covariance estimate (across all chains) V (tends to
over-estimate)
Average of individual-chain sample covariance estimates W (tends to
under-estimate)
Let R1/2 be the largest generalized eigenvalue of the pencil (V,W).
Diagnostic: value of statistic R1/2 approaches 1 (from above) as
the chains become similar
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 32 / 53
More sophisticated Metropolis-Hastings
M-H construction was extremely general.
Achieving efficient sampling (good “mixing”) requires moreexploitation of problem structure.1 Mixtures of kernels2 Cycles of kernels; Gibbs sampling3 Adaptive MCMC4 Gradient- and Hessian-exploiting MCMC5 MCMC in infinite dimensions
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 33 / 53
Mixtures and cycles
Mixtures of kernels
Let Ki all have π as limiting distribution
Use a convex combination: K ∗ =∑i νiKi
νi is the probability of picking transition kernel Ki at a given step of
the chain
Kernels can correspond to transitions that each have desirable
properties, e.g., local versus global proposals
Cycles of kernels
Split multivariate state vector into blocks that are updated
separately; each update is accomplished by transition kernel Kj
Need to combine kernels. Cycle = a systematic scan, K ∗ =∏j Kj
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 34 / 53
Componentwise Metropolis-Hastings
This is an example of using a cycle of kernels
Let x = (x1, . . . , xd) ∈ Rd
Proposal qi(y |x) updates only component i
Walk through components of the state sequentially, i = 1 . . . d :
Propose a new value for component i using
qi(
y i |x1n+1, . . . , x i−1n+1, x in, x i+1n , . . . , xdn)
Accept (x in+1 = y i) or reject (x in+1 = x in) this component update with
acceptance probability
αi (xi , yi ) = min
{1,π(yi )qi (x in|yi )π(xi )qi (y i |xi )
}where xi and yi differ only in component i
yi ≡(
x1n+1, . . . , x i−1n+1, y , x i+1n , . . . , xdn)
and
xi ≡(
x1n+1, . . . , x i−1n+1, x in, x i+1n , . . . , xdn)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 35 / 53
Gibbs sampling
One very useful cycle is the Gibbs sampler.
Requires the ability to sample directly from the full conditionaldistribution π(xi |x∼i).
x∼i denotes all components of x other than xiIn problems with appropriate structure, generating independent
samples from the full conditional may be feasible while sampling from
π is not.
xi can represent a block of the state vector, rather than just an
individual component
A Gibbs update is a proposal from the full conditional; the
acceptance probability is identically one!
αi(xi , yi) = min
{1,π(yi) qi(x in|yi)π(xi) qi(y i |xi)
}= min
{1,π(yi |x∼i)π(x∼i)π(x in|x∼i)π(x in|x∼i)π(x∼i)π(y i |x∼i)
}= 1.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 36 / 53
Gibbs sampling example #1
Correlated bivariate normal
x ∼ N
([µ1µ2
],
[σ21 ρσ1σ2
ρσ1σ2 σ22
])Full conditionals are:
x1|x2 ∼ N
(µ1 +
σ1σ2ρ(x2 − µ2), (1− ρ2)σ21
)x2|x1 ∼ . . .
See computational demo
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 37 / 53
Gibbs sampling example #2
Bayesian linear regression with a variance hyperparameter
yi = βTxi + σzi , yi ∈ R; β, xi ∈ Rd ; zi ∼ N(0, 1)
This problem has a non-Gaussian posterior but is amenable to block
Gibbs sampling
Let the data consist of n observations Dn ≡ {(yi , xi)}ni=1Bayesian hierarchical model, likelihood and priors:
y |β, σ2 ∼ N(Xβ, σ2In
)β |σ2 ∼ N
(0, τ2σ2Id
)1/σ2 ∼ Γ (α, γ)
where X ∈ Rn×d has rows xi and y ∈ Rn is a vector of y1 . . . yn.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 38 / 53
Gibbs sampling example #2 (cont.)
Posterior density:
π(β, σ2
)≡ p
(β, σ2|Dn
)∝
1
σnexp
(−
1
2σ2(y − Xβ)T (y − Xβ)
)1
(τσ)dexp
(−
1
2τ2σ2βTβ
)(
1
σ2
)α−1exp
(−γ/σ2
)Full conditionals β|σ2,Dn and σ2|β,Dn have a closed form! Try to
obtain by inspecting the joint density above. (See next page for
answer.)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 39 / 53
Gibbs sampling example #2 (cont.)
Full conditional for β is Gaussian:
β |σ2,Dn ∼ N(µ, σ2ΣΣΣ
)where
ΣΣΣ−1 =
(1
τ2Id + XTX
)and µ = ΣΣΣXTy.
Full conditional for 1/σ2 is Gamma:
1/σ2 |β,Dn ∼ Γ (α, γ)
where
a = a + n/2 + d/2 and γ = γ +1
2τ2βTβ +
1
2(y − Xβ)T (y − Xβ) .
Alternately sample from these FCs in order to simulate the joint
posterior.
Also, this is an example of the use of conjugate priors.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 40 / 53
Metropolis-within-Gibbs
What if we cannot sample from the full conditionals?
Solution: “Metropolis-within-Gibbs”
This is just componentwise Metropolis-Hastings (which is where we
started)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 41 / 53
Metropolis-within-Gibbs
What if we cannot sample from the full conditionals?
Solution: “Metropolis-within-Gibbs”
This is just componentwise Metropolis-Hastings (which is where we
started)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 41 / 53
Adaptive Metropolis
Intuitive idea: learn a better proposal q(y |x) from past samples.Learn an appropriate proposal scale.
Learn an appropriate proposal orientation and anisotropy; this is
essential in problems with strong correlation in π
Adaptive Metropolis scheme of [Haario et al. 2001]:Covariance matrix at step n
C ∗n = sd Cov (x0, . . . , xn) + sdεId
where ε > 0, d is the dimension of the state, and sd = 2.42/d
(scaling rule-of-thumb).
Proposals are Gaussians centered at xn. Use a fixed covariance C0 for
the first n0 steps, then use C ∗n .
Chain is not Markov, and previous convergence proofs do not apply.
Nonetheless, one can prove that the chain converges to π. See paper
in references.
Many other adaptive MCMC ideas have been developed in recent
years
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 42 / 53
Adaptive Metropolized independence samplers
Independence proposal: does not depend on current state
Consider a proposal q(x ;ψ) with parameter ψ.
Key idea: minimize Kullback-Leibler divergence between this
proposal and the target distribution:
minψ
DKL (π(x)‖q(x ;ψ))
Equivalently, maximize∫π(x) log q(x ;ψ)dx
Solve this optimization problem with successive steps of stochastic
approximation (e.g., Robbins-Monro), while approximating the
integral via MCMC samples
Common choice: let q be a mixture of Gaussians or other
exponential-family distributions
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 43 / 53
Online MCMC demo
Very cool demo, thanks to Chi Feng (MIT):
https://chi-feng.github.io/mcmc-demo
Let’s look at RWM and AM on various targets
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 44 / 53
Langevin MCMC
Intuitive idea: use gradient of the posterior to steer samples towards
higher density regions
Consider the SDE
dXt =1
2∇ logπ(Xt)dt + dWt
This SDE has π as its stationary distribution
Discretize the SDE (e.g., Euler-Maruyama)
X t+1 = X t +σ2
2∇ logπ(X t) + σεt , εt ∼ N(0, I )
Discretized process X t no longer has π as its stationary distribution!
But we can use X t+1 as a proposal in the regular
Metropolis-Hastings framework, and accept or reject it accordingly.
σ2 (discretization time step) is an adjustable free parameter.
Langevin schemes require access to the gradient of the posterior.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 45 / 53
Preconditioned Langevin
Introduce a positive definite matrix A to the Langevin SDE:
dXt =1
2A∇ logπ(Xt)dt + A1/2dWt
Let A reflect covariance structure of target
For example: let A be the local inverse Hessian of the log-posterior,
or the inverse Hessian at the posterior mode, or posterior-averaged
Hessian information, or some other estimate of the posterior
covariance
Key idea for inverse problems: use low-rank approximations of the
posterior covariance/precision developed for the linear-Gaussian case
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 46 / 53
Preconditioned Langevin
Introduce a positive definite matrix A to the Langevin SDE:
dXt =1
2A∇ logπ(Xt)dt + A1/2dWt
Let A reflect covariance structure of target
For example: let A be the local inverse Hessian of the log-posterior,
or the inverse Hessian at the posterior mode, or posterior-averaged
Hessian information, or some other estimate of the posterior
covariance
Key idea for inverse problems: use low-rank approximations of the
posterior covariance/precision developed for the linear-Gaussian case
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 46 / 53
Hamiltonian MCMC
Let x be “position” variables; introduce auxiliary “momentum”
variables w
Consider a separable Hamiltonian, H(x ,w) = U(x) + wTM−1w/2.
Put U(x) = − logπ(x).
Hamiltonian dynamics are reversible and conserve H. Use them to
propose new states x!
In particular, sample from p(x ,w) = 1Z exp (−H(x ,w)):
First, sample the momentum variables w from their Gaussian
distribution
Second, integrate Hamilton’s equations to propose a new state
(x ,w); then apply Metropolis accept/reject
Features:Enables faraway moves in x-space while leaving the value of the
density (essentially) unchanged. Good mixing!
Requires good symplectic integrators and access to derivatives
Recent extension: Riemannian manifold HMC [Girolami & Calderhead
JRSSB 2011]
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 47 / 53
Online MCMC demo
Back to the demo:
https://chi-feng.github.io/mcmc-demo
Now look at MALA and HMC/NUTS on various targets
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 48 / 53
MCMC in infinite dimensions
Would like to construct a well-defined MCMC sampler for functions
u ∈ H.
First, the posterior measure µy should be a well-defined probability
measure on H (see Stuart Acta Numerica 2010). For simplicity, let
the prior µ0 be N (0,C ).
Now let q be the proposal distribution, and consider pair of measures
ν(du, du′) = q(u, du′)µy (du), ν⊥(du, du′) = q(u′, du)µy (du′);
Then the MCMC acceptance probability is
α(uk , u′) = min
{1,
dν⊥
dν(uk , u′)
}To define a valid transition kernel, we need absolute continuity
ν⊥ � ν; in turn, this places requirements on the proposal q
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 49 / 53
MCMC in infinite dimensions
Would like to construct a well-defined MCMC sampler for functions
u ∈ H.
First, the posterior measure µy should be a well-defined probability
measure on H (see Stuart Acta Numerica 2010). For simplicity, let
the prior µ0 be N (0,C ).
Now let q be the proposal distribution, and consider pair of measures
ν(du, du′) = q(u, du′)µy (du), ν⊥(du, du′) = q(u′, du)µy (du′);
Then the MCMC acceptance probability is
α(uk , u′) = min
{1,
dν⊥
dν(uk , u′)
}To define a valid transition kernel, we need absolute continuity
ν⊥ � ν; in turn, this places requirements on the proposal q
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 49 / 53
MCMC in infinite dimensions (cont.)
One way to produce a valid transition kernel is the preconditioned
Crank-Nicolson (pCN) proposal (Cotter et al. 2013):
u′ = (1− β2)1/2uk + βξk , ξk ∼ N (0,C ), β ∈ (0, 1).
Practical impact: sampling efficiency does not degenerate as
discretization of u is refined
More sophisticated versions: combine pCN with Hessian/geometry
information, e.g., DILI (dimension-independent likelihood-informed)
proposals (Cui, Law, M 2016)
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 50 / 53
MCMC practicalities
Effective use of MCMC still requires some (problem-specific) experience.
Some useful rules of thumb:
Adaptive schemes are not a panacea.
Whenever possible, (re-)parameterize the problem in order to
minimize posterior correlations.
What to do, if anything, about “burn-in?”
Visual inspection of chain components is often the first and best
convergence diagnostic.
Also look at autocorrelation plots. Run multiple chains from
different starting points. Evaluate MPSRF or other diagnostics.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 51 / 53
MCMC practicalities
Additional advice:
“The best Monte Carlo is a dead Monte Carlo”: If you can tackleany part of the problem analytically, do it!
Example: Rao-Blackwellization in Cui et al., “Likelihood-informed
dimension reduction for nonlinear inverse problems,” Inverse Problems
30: 114015 (2014).
πprior πposterior
Data dominated
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 52 / 53
MCMC references
A small selection of useful “general” MCMC references.
C. Andrieu, N. de Freitas, A. Doucet, M. I. Jordan, “An introduction to MCMC
for machine learning,” Machine Learning 50 (2003) 5–43.
S. Brooks, A. Gelman, G. Jones and X. Meng, editors. Handbook of MCMC.
Chapman & Hall/CRC, 2011.
A. Gelman, J. B. Carlin, H. S. Stern, D. Dunson, A. Vehtari, D. B. Rubin.
Bayesian Data Analysis. Chapman & Hall CRC, 3rd edition, 2013.
P. J. Green. “Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination.” Biometrika, 82: 711–732, 1995.
H. Haario, M. Laine, A. Mira, and E. Saksman. “DRAM: Efficient adaptive
MCMC.” Statistics and Computing, 16(4): 339–354, 2006.
C. P. Robert, G. Casella, Monte Carlo Statistical Methods, 2nd Edition,
Springer, 2004.
G. Roberts, J. Rosenthal. “General state space Markov chains and MCMC
algorithms.” Probability Surveys, 1: 20–71, 2004.
Marzouk (MIT) SFB 1294 Spring School 19–22 March 2018 53 / 53