chapter 3: monte carlo methods - personal...
TRANSCRIPT
Chapter 3: Monte Carlo methods
Maarten Jansen
Overview
1. Aspects of Monte Carlo Methods
1.1 Monte Carlo integration and importance sampling
1.2 Random number generators (slide 30)
1.2.1 Quantile method (slide 31)
1.2.2 Rejection sampling (slide 37)
2. Markov Chain Monte Carlo Methods
2.1 Markov Chains
2.2 Models for multivariate RV (slide 60)
2.2.1 Markov Random Fields (MRF) (slide 61)
2.2.2 Gibbs Random Fields (GRF) (slide 65)
2.2.3 The Hammersley-Clifford Theorem (slide 68)
2.3 MCMC samplers for integration
2.3.1 Gibbs sampler (slide 79)
2.3.2 Metropolis-Hastings sampler (slide 90)
2.4 Simulated annealing - MCMC optimization
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.1
1. Aspects of Monte Carlo Methods
Monte Carlo simulation or stochastic simulation
• tries to re-formulate a problem such that its solution is the unknown
parameter of an artificial random variable
• generates instances (an artificial sample) from that random variable
• applies statistical techniques to
– find (estimate) the parameter from the artificial sample
– evaluate the quality of the numerical outcome
• but it is essentially a method from numerical analysis
• Many of the applications of this numerical method come from statistical
problems:
statistical problem numerical solution statistical technique
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.2
Two main categories of problems
• Integration
• Optimization
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.3
1.1 Monte Carlo integration and importance sampling
Suppose we want to evaluate I =
∫ b
a
y(x)dx
• Suppose X ∼ uniform[a, b], then I = (b− a) · E(y(X))
• Generate Xi, with i = 1, . . . , n
• Estimate I =b− a
n
n∑
i=1
y(Xi)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.4
Accuracy of the stochastic approximation
We use statistical measures to evaluate the approximation
1. BiasE(I) = b−a
n
∑ni=1 E(y(Xi)) = (b− a) · E(y(X))
= (b− a)∫ ba y(x) · 1
b−a· dx = I
The estimator is unbiased
2. Variance var(I) =(b− a)2
nvar(y(X)) =
(b− a)2
n
∫ b
a(y(x)− I)2 · 1
b− a· dx
Variance has two components:
– Order of magnitude:
∗ σI= O(n−1/2)
∗ typical result for variance of sample mean
∗ Independent from dimension
– Variance of one observation
Two questions
• How does this compare to competitors?
• How can we improve? → not on order of magnitude
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.5
Competitors: numerical integration (= quadrature)
Numerical integration is based on the principle: approximate the integrand by
a function that is easy to integrate.
The approximation is based on a limited number of observations of the
integrand only, and it is constructed using interpolation or smoothing.
The error of numerical integration methods depends on several factors
• The smoothness of the integrand, in particular: is the integrand easy to
approximate (see figures below)
• The number of function evaluations or observations n
• The location xi in which integrand is observed or evaluated
• The dimension (curse of dimensionality)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.6
Functions that are difficult to approximate
Functions with
1. infinite slope
2. singularities
3. heavy oscillations
These features require locally dense observations/function evaluations
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.7
A very brief overview of quadrature methods
• For given xi, y(xi), quadrature formulas are based on
– Approximation of the integrand by polynomials:
∗ Rectangular rule or Midpoint rule
∗ Trapezoid rule
∗ Simpson’s rule
– Breaking up the interval [a, b] into subintervals→ composite rules
• When xi are free to choose, order of approximation can be optimised by
chosing the xi to be the zeros of orthogonal polynomials→ Gauss
Quadrature
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.8
Accuracy of quadrature methods
Assuming that the integrand is “sufficiently smooth”, we have in one
dimension the approximation Iq for I satisfies
|I − Iq| ≤ C · n−1,
and for many methods
|I − Iq| ≤ C · n−α,
with α > 1
Compare with random simulation[E(I − I)2
] 12 ∼ n− 1
2
(Note that error measures are different)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.9
Curse of dimensionality
Observation
If n1 observations (function evaluations) are needed for given accuracy of a
numerical integration technique in one dimension, then the same technique
extended into higher dimensions requires nd1 observations.
Reason
• – Accuracy of numerical integration is a deterministic thing: we must cover every area in
the region of integration to be sure that accuracy is met.
– Accuracy thus directly linked to interpoint-distance
– High dimensions means many dimensions in which two points can be distant from
each other.
– Much more observations needed for same interpoint distance
• Quadrature is based on clever approximations of functions. It’s hard to be clever in high
dimensions: hard to find equally good approximations.
No curse of dimensionality for stochastic simulation
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.10
Applications in statistics
• Computation of expected values E(h(X)) =
∫ ∞
−∞fX(x)h(x)dx
• Computation of probabilities P (X ∈ A) = E(χA(X)) =
∫
A
fX(x)dx
(χA(X) is the characteristic or indicator function of A)
• Computation of quantiles QX(p) = F−1X (p) with FX(u) =
∫ u
−∞fX(x)dx
These problems appear in
• Bootstrapping and simulation
• Bayesian analysis: computation of posterior means, medians
• . . .
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.11
Non-uniform sampling
We have above the general expression µ = E(h(X)) =
∫ ∞
−∞fX(x)h(x)dx
which we can estimate by µ =1
n
n∑
i=1
h(Xi)
So, if we have an integral I =
∫ b
a
y(x)dx
then we can define h(x) as h(x) =y(x) · χ[a,b](x)
fX(x)(if this ratio is bounded near zeros of fX(x))
and estimate I =1
n
n∑
i=1
h(Xi) =1
n
n∑
i=1
y(Xi) · χ[a,b](Xi)
fX(Xi)
where all Xi are IID and have density fX(x).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.12
Examples
• fX(x) = y(x)/x i.e., h(x) = x
(only possible if y(x)/x is positive with integral equal to 1)
Then I = µX = E(X) =
∫ ∞
−∞x · fX(x) dx and I =
1
n
n∑
i=1
Xi
• h(x) = χA(x), then I = p = P (X ∈ A) =
∫
AfX(x) dx and
I =1
n
n∑
i=1
χA(Xi) =#{i|Xi ∈ A}
n
• fX(x) =1
b− a· χ[a,b](x) and take h(x) such that h(x) · fX(x) = y(x) (where we asume
that y(x) is zero outside [a, b] — note that h(x) outside [a, b] is free to choose)
I =
∫ b
ay(x) dx and I =
1
n
n∑
i=1
h(Xi) =b− a
n
n∑
i=1
y(Xi)
From these examples, it is clear that there are many ways to estimate an integral. We formalise
this problem.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.13
The importance function
If X has density function fX(x) and we want to estimate
µ = E(h(X)) =
∫ ∞
−∞h(x) · fX(x) dx,
then we may estimate this from a sample Xi as
µ =1
n
n∑
i=1
h(Xi)
If it is easier to sample from fU (u) (for instance, uniform random variables are
easy to generate), then we can write
E(h(X)) =∫∞−∞ h(u) · fX(u) du
=∫∞−∞ h(u) · fX(u)
fU (u) · fU (u) du
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.14
We call the new sampling distribution
fU (u) the importance function
and we denote w(u) =fX(u)
fU (u)
As a result µ = E(h(X)) =
∫ ∞
−∞h(u) · w(u) · fU (u) du
We can now estimate µ =1
n
n∑
i=1
h(Ui) · w(Ui)
The question is now how to choose fU (u)
• It must be easy to generate samples from it
• The variance of the estimator must be as low as possible
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.15
The variance of importance sampling
The variance equals var(µ) =1
n· var
[h(U) · w(U)
]
We can develop this as
var(µ) = E([
h(U) · w(U)]2)−
(E[h(U) · w(U)
])2
= E([
h(U) · w(U)]2)− µ2
= E([|h(U)| · w(U)
]2)− µ2
≥(E[|h(U)| · w(U)
])2 − µ2 = (E|h(X)|)2 − µ2
The lower bound is independent from fU (u). The inequality becomes an
equality if for V = |h(U)| · w(U) it holds that E(V 2)= (EV )2, or,
var(V ) = E(V 2)− (EV )2 = 0, thus if V is deterministic (with prob. 1).
So minimum variance is obtained if |h(U)| · w(U) = K, for any random U , i.e.,
∀u ∈ R.
We have |h(U)| · w(U) = K ⇔ fU (u) =|h(u)|·fX(u)
K
Imposing
∫ ∞
−∞fU (u) du = 1, minimum variance for
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.16
fU (u) =|h(u)| · fX(u)∫∞
−∞ |h(u)| · fX(u) du
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.17
Interpretation of this result
• The result is of little immediate use. Indeed, full application requires
knowledge of
∫ ∞
−∞|h(u)| · fX(u) du
If h(u) ≥ 0, ∀u ∈ R, this is eactly the integral we are after. In the other
case, computation of this integral is probably equally difficult as the
original question.
• var(µ) can be much lower than when estimating µ with samples from
fX(x).
• The basic idea is that fU (u) should behave not just as fX(x), but it
should also “follow” |h(u)|. Regions where h(u) is large in magnitude
should be sampled more.
• Pay special attention to tails of |h(u)| · fX(u)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.18
Example with mixture of uniform sampling
• Mixture of L uniform random variables
• Uniform on (non-convex) subdomains Iℓ defined by
Iℓ ={x∣∣∣|y(x)| ≥ ℓ/Lmax |y(x)|
}
• mixture probability mass functions pℓ = |Iℓ|/∑L
ℓ=1 |Iℓ|
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.19
Example from Bayesian statistics
Suppose we observe Xi ∼ N(M,σ2) with σ2 known. We want to estimate the
mean M , for which we impose a Cauchy prior model
fM (m) =1
π(1 + (m− µ)2),
where hyperparameter µ may express prior knowledge of expected values
(could be zero, e.g.)
The conditional sample density is
fX|M (x|m) =∏n
i=11√2πσ· e−(xi−m)2/2σ2
= 1(2π)n/2σn · e−
∑ni=1(xi−m)2/2σ2
= 1(2π)n/2σn · e−(x−m)2/(2σ2/n) · e−
∑ni=1(xi−x)2/2σ2
Then the joint distribution is
fM,X(m,x) = fM (m) · fX|M (x|m)
= 1π(1+(m−µ)2) · 1
(2π)n/2σn · e−(x−m)2/(2σ2/n) · e−∑n
i=1(xi−x)2/2σ2
So the marginal distribution of X becomes
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.20
fX(x) =∫∞−∞ fM,X(m,x) dm =
∫∞−∞ fM (m) · fX|M (x|m) dm
= C(x) ·∫∞−∞
11+(m−µ)2 · e−(x−m)2/(2σ2/n) dm
with
C(x) =1
π· 1
(2π)n/2σn· e−
∑ni=1(xi−x)2/2σ2
(Note that the integral exists thanks to the rapid decay of the normal bell
curve)
The posterior distribution of M , given the observation X becomes
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.21
fM |X(m|x) =fM (m)·fX|M (x|m)
fX(x)
=fM (m)·fX|M (x|m)∫ ∞
−∞fM (m) · fX|M (x|m) dm
=C(x) · 1
1 + (m− µ)2· e−(x−m)2/(2σ2/n)
C(x) ·∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
=
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n)
∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
Possible values of interest are
the posterior mean
E(M |X = x) =
∫ ∞
−∞
m
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.22
and the posterior variance which is
var(M |X = x) = E(M2|X = x)−[E(M |X = x)
]2with
E(M2|X = x) =
∫ ∞
−∞
m2
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
∫ ∞
−∞
1
1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.23
Computation of the integrals
At least two possibilities:
• Sample from normal density with expected value x and variance σ2/n:
Xn ∼ N(x, σ2/n),
Then
E(Mk|X = x) =E(
Xkn
1+(Xn−µ)2
)
E(
11+(Xn−µ)2
)
• Sample from Cauchy density with center (median) µ
fU (u) = 1/[π(1 + (u− µ)2)
]
Then
E(Mk|X = x) =E(uk · e−(u−x)2/(2σ2/n)
)
E(e−(u−x)2/(2σ2/n)
)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.24
• Sample from another distribution “as close as possible” to the integrand.
In this case, the normal density is a much better choice than the Cauchy
• The tails of the integrand are lighter than the normal tail, so the heavy
tails of the Cauchy produce a lot of large samples whose values are not
representative for the integral
• The normal density has sample size (n) dependent variance, so that
samples get more concentrated for large n, which corresponds to the
true shape of the integrand
As an illustration, we plot the estimates of the standard errors in estimating
the following parameter
I =∫∞−∞
1π·(1+(u−µ)2) · 1√
2πσ/√n· e−(u−x)2/(2σ2/n) du
= E(
1π·(1+(Xn−µ)2)
)
= E(
1√2πσ/
√n· e−(U−x)2/(2σ2/n)
) with
Xn ∼ N(x, σ2/n) and U ∼ Cauchy(µ, 1)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.25
Then we simulate Xn,i and Ui for i = 1, . . . , nMC and define
I1 = 1nMC
∑nMC
i=11
π·(1+(Xn,i−µ)2)
I2 = 1nMC
∑nMC
i=11√
2πσ/√n· e−(Ui−x)2/(2σ2/n)
We can easily estimate the variances
var(I1) = 1nMC
var(
1π·(1+(Xn−µ)2)
)
var(I2) = 1nMC
var(
1√2πσ/
√n· e−(Ui−x)2/(2σ2/n)
)
The estimates of the standard errors of a single observation (to be divided by√nMC) are depicted below (together with the log of the estimates, to better
show the behavior)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.26
0 50 100 150 2000
0.1
0.2
0.3
0.4
n
estimated st.dev. of one observation
Cauchy samplesNormal samples
0 50 100 150 200-8
-6
-4
-2
0
n
log(estimated st.dev.) of one observation
Cauchy samplesNormal samples
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.27
Importance function must have sufficiently heavy tails
Previous examples have illustrated that it is of little use that the sampling
density function has heavier tails than the integrand.
The opposite, is however, much worse (so if no perfect match can be realized,
a slightly too heavy tail is preferable)
We have
var(µ) =1
n· var
[h(U) · w(U)
]=
1
n· E[(h(U) · w(U))2
]− µ2
Herein
E[(h(U) · w(U))2
]=
∫∞−∞[h(u)]2[w(u)]2fU (u) du
=∫∞−∞[h(u)]2 fX(u)2
fU (u) du
=∫∞−∞
h(u)fX(u)fU (u) · h(u)fX(u) du
If h(u)fX(u)
has a heavier tail than fU (u), then the first factor tends to infinity for u→∞.
The integral may then be large or even infinity, depending on the tail of
h(u)fX(u)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.28
Conclusions about importance sampling
Importance sampling allows to
• estimate expected values (integrals) with a random variable X whose
distribution does not allow easy simulations, by drawing from another
random variable U which is easier, followed by proper re-weighting.
• optimize (to some extend) the choice of sample distribution to estimate
integrals.
We will later discuss rejection sampling, which also samples from an auxiliary distribution. The
outcome is then rejected or accepted with an appropriate probability such that, a posteriori,
(given the event of rejection or acceptance) the variable takes the aimed distribution. Unlike
importance sampling, the correction of rejection sampling thus proceeds at the level of the
random number generator itself (and not at the level of computing the integral). We therefore
discuss random number generators.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.29
1.2 Random number generators
Importance sampling assumes that we can generate numbers from a given
distribution. How can we do that?
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.30
1.2.1 Quantile or inversion method
Theorem If U ∼ uniform[0, 1] and QX(p) is the quantile function of X , then
QX(U) has the same distribution as X , i.e.:
U ∼ uniform[0, 1]⇔ QX(U)d= X
Proof uses monotonicity of QX(p) or its inverse FX(x)
P (QX(U) ≤ x) = P (U ≤ Q−1X (x))
= FU (Q−1X (x))
= Q−1X (x)
= FX(x)
Example 1: Let X ∼ exp(λ), then FX(x) = 1− e−λx, so
QX(p) = − log(1− p)/λ, so if U ∼ uniform[0, 1], then take
X = − log(1− U)/λ,
or, because 1− U is also uniform (for symmetry), we can take
Y = − log(U)/λ
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.31
Quantile or inversion method(2)
Example 1: Let X ∼ Cauchy with median µ, i.e., fX(x) = 1π[1+(x−µ)2]
If µ 6= 0, then X can be generated by adding µ to a Cauchy random variable
with median 0.
So, we assume that µ = 0.
Then FX(x) = 12 + 1
π arctan(x), and QX(U) = tan [π(U − 1/2)]
Note: if X ∼ Cauchy(µ = 0) then −X ∼ Cauchy(µ = 0) and
1/X ∼ Cauchy(µ = 0).
Indeed, for Y = 1/X, fY (y) = fX (x(y))
∣∣∣∣dx(y)
dy
∣∣∣∣ =1
π
1
1 + 1/y21
y2=
1
π
1
1 + y2
So, if X ∼ Cauchy(µ = 0) then Y = −1/X ∼ Cauchy(µ = 0).
And if X = tan [π(U − 1/2)] ∼ Cauchy(µ = 0), then
Y = −1/X = tan(πU) ∼ Cauchy(µ = 0)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.32
Example: Box-Muller transform for normal
Problem: normal CDF FZ(z) = Φ(z) has no closed formula, working with
quantile QZ(U) is not possible, unless software provides detailed tables of
QZ(p)
Solution: we go for two independent normal RV: (Z1, Z2) ∼ N2(0, I2), then
we know:
• Z21 + Z2
2 ∼ χ2(2) = exp(1/2)
Indeed
1. P (Z2 < x) = P (−√x < Z <
√x) = Φ(
√x)− Φ(−
√x)
Hence fZ2 (x) =[φ(√x) + φ(−
√x)]/(2√x) = e−x/2/
√2πx
which is: Z2 ∼ χ2(1) = Γ(1/2, 1/2)
2. Let Y = Z21 + Z2
2 , then
fY (y) =
∫ y
0fZ2
1(z)fZ2
2(y − z) dz =
∫ y
0
e−z/2
√2πz
e−(y−z)/2
√2π(y − z)
dz
=e−y/2
2π
∫ y
0
dz√
z(y − z)=
e−y/2
2π
∫ 1
0
dt√
t(1− t)
=e−y/2
2πB(1/2, 1/2) =
e−y/2
2π
Γ(1/2)Γ(1/2)
Γ(1/2 + 1/2)= e−y/2/2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.33
•√Z21 + Z2
2 ∼ Rayleigh
Indeed, let Y =√Z21 + Z2
2 , then
FY (y) = P (Y ≤ y) = P (Y 2 ≤ y2) = 1− e−y2/2 because F (x) = 1− eλx
is the CDF of the exponential distribution
• Z1
Z2∼ Cauchy(µ = 0)
Indeed, let X = Z1/Z2, so, Z1 = XZ2, then
FX(x) = P (X ≤ x) =
∫ ∞
−∞fZ2 (z)P (X ≤ x|Z2 = z)dz
=
∫ 0
−∞fZ2
(z)P (Z1 ≥ zx)dz +
∫ ∞
0fZ2
(z)P (Z1 ≤ zx)dz
= 2
∫ ∞
0fZ2
(z)P (Z1 ≤ zx)dz
and so,
fX(x) = 2
∫ ∞
0fZ2 (z)fZ1(zx) z dz =
2
2π
∫ ∞
0ze−(1+x2)z2/2dz =
1
π
∫ ∞
0e−u du
1 + x2
We propose to generate a Cauchy RV X1 and an exponential RV
X2 ∼ exp(1/2), using X1 = tan(2πU2) and X2 = −2 log(U1)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.34
Note that X1 = tan(2πU2) is Cauchy with the same parameters as
X ′1 = tan(πU2), since tan(πu) has period 1. We take X1 = tan(2πU2) instead
of X1 = tan(πU2) for reasons explained below.
Then solve the system
Z1/Z2 = X1 = tan(2πU1)
Z21 + Z2
2 = X2 = −2 log(U2) = log(1/U22 )
So suppose U1 and U2 are 2 independent, uniform r.v. on [0, 1] and letZ1 =
√log(1/U2
2 ) cos(2πU1)
Z2 =√log(1/U2
2 ) sin(2πU1).
Here sin(2πU1) and cos(2πU1) have the same distribution on [−1, 1]. This
would not be the case for cos(πU1) ∈ [0, 1]. This is why we take
X1 = tan(2πU1).
This is a R2 → R
2 transformation Z = g(U ) =√log(1/U2
2 ) ·
cos(2πU1)
sin(2πU1)
.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.35
The inverse g−1:
U2 = e−12 (Z
21+Z2
2 )
U1 = 12π arctan
(Z2
Z1
).
Using ddx arctan(x) = 1
1+x2 ,
fZ1,Z2(z1, z2) = fU1,U2
(12π arctan
(z2z1
), e−
12 (z
21+z2
2))|J |
= 1 ·
∣∣∣∣∣∣det
∂u1
∂z1∂u1
∂z2∂u2
∂z1∂u2
∂z2
∣∣∣∣∣∣
=
∣∣∣∣∣∣det
12π · 1
1+(z2/z1)2· −z2
z21
12π · 1
1+(z2/z1)2· 1z1
e−12 (z
21+z2
2) · (−z1) e−12 (z
21+z2
2) · (−z2)
∣∣∣∣∣∣
= 12π · e−
12 (z
21+z2
2) · 11+(z2/z1)2
·(
z22
z21+ 1)
= 1√2π
e−z212 · 1√
2πe−
z222 .
Hence (Z1, Z2) ∼ NID(0, 1)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.36
1.2.2 Rejection sampling
Suppose we want to generate random numbers with density f(x) and
cumulative distribution F (X).
Theorem
Let X ∼ gX and ∀x ∈ R : f(x) ≤M · gX(x) and let U ∼ uniform[0, 1],
independent from X , then F (x) = P
(X ≤ x
∣∣∣∣U ≤f(X)
M · gX(X)
)
Proof
P(X ∈ A & U ≤ f(X)
M·gX (X)
)=
∫A gX(x) · P
(U ≤ f(X)
M·gX (X)
∣∣∣X = x)dx
=∫A gX(x) · P
(U ≤ f(x)
M·gX (x)
)dx
=∫A gX(x) · f(x)
M·gX (x)dx
=∫A
f(x)M
dx =∫A f(x) dx
MHence, if A =]−∞, x], then
P(X ∈ A
∣∣∣U ≤ f(X)M·gX (X)
)=
P(X∈A & U≤
f(X)M·gX (X)
)
P(U≤
f(X)M·gX (X)
)
=P(X∈A & U≤
f(X)M·gX (X)
)
P(X∈R & U≤
f(X)M·gX (X)
)
=
∫A f(x) dx
M1M
=∫A f(x) dx = F (x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.37
Algorithm
Situation and aim
We want a random number X with density function f(X). We have no expression for F (x) that
we can invert. We can generate numbers according to a different law gX(x) and we know that
f(x) ≤M · gX(x), for all values of x.
Pseudo-code
continue-search = TRUE
While continue-search
• Generate X ∼ gX
• Generate U ∼ uniform[0, 1]
• If U ≤ f(X)/[M · gX(X)
]
then continue-search = FALSE
The output is X
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.38
How to choose gX(x)?
• X ∼ gX should be easy to generate
• gX(x) should be as close as possible to f(x), such that M can be close
to 1, and rejection probabilities are low. Otherwise, computational efforts
increase.
• Some combinations don’t work: for instance, one can never generate
Cauchy variables by rejection sampling applied to normal variables,
simply because there is no M satisfying the condition.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.39
Example 1: generating Gamma-distributed r.v.
Let X ∼ Gamma(λ, α), i.e.,fX(x) = xα−1λαe−λx
Γ(α)
If α is integer, then we can write X =α∑
i=1
Xi with independent
Xi ∼ exp(λ)
If α is not integer, denote δ = α− ⌊α⌋ and r = ⌊α⌋
Then we can decompose or generate X as
X = (Xr +Xδ)/λ with Xr ∼ Gamma(1, r) and Xδ ∼ Gamma(1, δ) and
both independent.
We can generate Xr as sum of exponentials, but for Xδ, the quantile method
does not work, so we need another direction.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.40
Generating Gamma values with small α
The distribution function of X ∼ Gamma(1, δ) is fX(x) =xδ−1e−x
Γ(δ)
It is depicted below for δ = 0.23
0 2 4 60
5
10
15
Not straightforward to bound by some M · gX(x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.41
A mixture distribution as upper bound
We will use a mixture distribution. Suppose
V ∼ uniform[0, 1]
W = χ[0,p](V ) for some value p
X1 ∼ g1 with g1(x) = δ · xδ−1 on [0, 1]
X2 − 1 ∼ exp(1), hence g2(x) = e−(x−1) = e · e−x on [1,∞[
X = W ·X1 + (1−W ) ·X2
In other words, X = X1 with probability p and X = X2 with probability 1− p.
Therefore, generate two uniform RV: U and V . If V < p, then X = QX1(U),
otherwise X = QX2(U). In one formula:
X = I(V < p) QX1(U) + I(V ≥ p) QX2(U)
where QX1(U) = U1/δ and QX2(U) as on slide 31
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.42
The mixture distribution and density
The cumulative distribution of X , denoted as GX(x) is then
GX(x) = P (W = 0) · P (X ≤ x|W = 0) + P (W = 1)P (X ≤ x|W = 1)
= (1− p) · P (X2 ≤ x|W = 0) + p · P (X1 ≤ x|W = 1)
= (1− p) ·G2(x) + p ·G1(x)
GX(x) = (1− p) ·G2(x) + p ·G1(x)
and from there gX(x) = p · g1(x) + (1− p) · g2(x)In our case gX(x) = p · δ · xδ−1 · χ[0,1](x) + (1− p) · e · e−x · χ[1,∞[(x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.43
Optimizing the parameters in the function gX(x)
The value of p can be chosen to minimize the number of rejections, i.e., to minimize M .
• We need that M · gX(x) ≥ fX(x)
• For x ∈ [0, 1], this means that
M · p · δ · xδ−1 ≥ e−x · xδ−1
Γ(δ)⇔M ≥ e−x
pδΓ(δ)
The maximum in the right hand side is reached if x = 0, hence M ≥ 1
pδΓ(δ)
• For x ≥ 1, this becomes
M · (1− p) · e · e−x ≥ e−x · xδ−1
Γ(δ)⇔M ≥ xδ−1
(1− p)eΓ(δ)
The maximum in the right hand side is reached if x = 1, hence M ≥ 1
(1− p)eΓ(δ)
• The minimum M can be obtained if both lower bounds for M are equal, i.e., if
pδΓ(δ) = (1− p)eΓ(δ)⇔ p =e
e+ δ
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.44
The resulting algorithm
We have M =1
pδΓ(δ)=
e+ δ
eδΓ(δ)So, for x ∈ [0, 1], we find
fX(x)
M · gX(x)=
e−x · xδ−1/Γ(δ)
M · p · δ · xδ−1= e−x which is smaller than 1
and for x > 1, we findfX(x)
M · gX(x)=
e−x · xδ−1/Γ(δ)
M · (1− p) · e1−x= xδ−1 which is smaller than 1 because δ − 1 is negative.
While search == true,
• Generate independent U, V,W ∼ unif([0, 1])
• if V < p = e/(e+ δ),
then
– X = W 1/δ
– If U < e−X , then search← false
else
– X = − log(W )
– If U < Xδ−1, then search← false
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.45
Example 2: computing the integral on page 25
In Bayesian inference with a Cauchy prior and normal errors, we have to
compute a ratio of the form
r =
∫∞−∞ xm · 1
bπ·[1+(x−a
b )2] · 1√
2πσ· e−(x−µ)2/2σ2
dx
∫∞−∞
1
bπ·[1+(x−a
b )2] · 1√
2πσ· e−(x−µ)2/2σ2 dx
Using rejection sampling, we will generate data X from a distribution
proportional to
fX(x) = K · 1
1 +(x−ab
)2 · gX(x),
where gX(x) is the normal density function.
It then holds that r = E(Xr)
We can draw observations from X even if we know fX(x) only up to a constant.
Indeed, let fX(x) = K · f(x) with K unknownfX(x)
M · gX(x)=
K · f(x)M · gX(x)
Herein f(x) and gX(x) are known and f(x)/gX(x) is bounded by C, then take
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.46
M ≥ KC.
In the example above
fX(x)
M · gX(x)=
K · gX(x) · 1
1+(x−ab )
2
M · gX(x)=
K
M[1 +
(x−ab
)2]
Take M = K, then the result is bounded by 1.
So generate X ∼ gX , then check if
1
1+(X−ab )
2 ≤ U
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.47
2. Markov Chain Monte Carlo Methods
• Monte Carlo Methods are based on independent sampling, law of large
numbers, central limit theorem
• Independent sampling may be difficult to realize, especially when we
sample from a large dimensional vector X
• Markov Chain Monte Carlo (MCMC) Methods simulate a sequence of
dependent observations
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.48
2.1 Markov Chains
Discrete time Markov Chain←→ continuous time Markov Chain
We consider discrete time MC
Discrete state space MC←→ general state space MC
A Discrete state space MC is a sequence of RV’s (Xn;n ∈ N) for which
Xn ∈ E. The state space E is countable and thus homomorphic with Z. (We
can take E = Z.) The sequence satisfies the Markov condition, i.e.,
P (Xn+1 = j|X0 = i0, X1 = i1, . . . , Xn = in) = P (Xn+1 = j|Xn = in)
Define P(n)ij = P (Xn+1 = j|Xn = i)
The Markov Chain is stationary or homogeneous if P(n)ij does not depend
on n. We can write Pij = P (Xn+1 = j|Xn = i)
The matrix with elements Pij is called the transition matrix
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.49
Irreducibility
(We further assume stationary MC, unless otherwise stated)
n-step Transitions
If P is the transition matrix of a discrete space Markov process, then
P (Xm+n = j|Xm = i) = (P n)ij
Accessibility
A state j is accessible from a state i if ∃n ∈ N, such that (P n)ij > 0. We
denote i→ j
Two states are communicating if they are mutually accessible from each
other. We denote i↔ j
If all states communicate, the MC is said to be irreducible
Period
The smallest di for which(P di
)ii> 0 is called the period of state i. It follows
that (P n)ii > 0⇔ n = k · di with k ∈ N
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.50
If i↔ j, then di = djProof
∃n ∈ N for which (Pn)ij > 0 and ∃m ∈ N for which (Pm)ji > 0
Now suppose that (P r)jj > 0, then(Pn+r+m
)ii
> (Pn)ij · (P r)jj · (Pm)ji > 0, hence we
know that r = ki · di and n+ r +m = kj′ · dj .
We also have(P 2r
)jj
> (P r)jj · (P r)jj > 0, hence n+ 2r +m = kj′′ · dj , and so
r = (n+ 2r +m)− (n+ r +m) = (kj′′ − kj′ ) · dj = kj · dj .
So, any r = ki · di can be written as r = kj · dj
A similar argument leads to the conclusion that any r = kj · dj can be written as r = ki · di. This
is only possible if dj = di
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.51
Transient states
Denote Tii the first n > 0 so that Xn = i given that X0 = i
We know that Tii = k · di and P (Tii = k · di) > 0.
Denote Vii =∞∑
k=1
I(Xk·di = i)|{X0 = i} =∞∑
n=1
I(Xn = i)|{X0 = i}
(with I(A) the indicator function of event A)
A state i is transient if E(Vii) <∞, that is, if an infinite number of steps in
the Markov Chain leads at most to a finite number of visits to state i.
E(Vii) =
∞∑
n=1
E(I(Xn = i)|X0 = i) =
∞∑
n=1
P (Xn = i|X0 = i)
So
E(Vii) <∞⇔∞∑
n=1
P (Xn = i|X0 = i) <∞
This is equivalent to P (Tii <∞) < 1
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.52
Indeed, suppose that P (Tii <∞) = 1, and denote T(r)ii the number of steps until the rth
occurence of state i. Then, because of the Markov condition, T(r)ii =
∑rℓ=1 Tii,ℓ with Tii,ℓ IID
observations from Tii. P (T(r)ii <∞) = P
(r⋂
ℓ=1
(Tii,ℓ <∞)
)
=r∏
ℓ=1
P (Tii,ℓ <∞) = 1 for any
finite r. So Vii ≥ r, a.s. for any r ∈ N.
This implies that µii = E(Tii) =∞ but the opposite does not hold (see
below).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.53
Recurrent states
A state is called recurrent if it is not transient, i.e., if it is visited an infinite
number of times.
If the expected time until the first visit is infinite, i.e., if µii = E(Tii) =∞, then
the state is called null-recurrent, otherwise it is called positive or ergodic.
A null-recurrent state is visited an infinite number of times, but the relative
number of visits tends to zero: E(Vii) =
∞∑
n=1
(P n)ii =∞ and
1
N
N∑
n=1
(P n)ii → 0
A positive state has 1N
∑Nn=1 (P
n)ii → 1µii
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.54
Proof (as of yet incomplete)
We prove that in a positive state, it holds that1
N
N∑
n=1
P (Xn = i|X0 = i)→ 1
E(Tii)
• Law of total probability + Markov condition for n > 0:
P (Xn = i|X0 = i) =∑n
k=1 P (Xn = i|Xk = i) · P (Tii = k)
=∑n
k=1 P (Xn−k = i|X0 = i) · P (Tii = k)
Also, P (Xn = i|X0 = i) = 1 for n = 0.
• If we define tk = P (Tii = k), with t0 = 0 and pn = P (Xn = i|Xk = i), then we have
pn =∑n
k=1 pn−k · tk =∑n
k=0 pn−k · tk for n > 0 and p0 = 1 6= p0 · t0 = 0.
• Denoting t = (tk, k ∈ N) and p = (tn, n ∈ N), then the sum above is the convolution of
the sequences t and p: t ∗ p. Since the expression does not hold for n = 0, we have to
correct with a Kronecker sequence δ0 = (1, 0, 0, 0, . . .). We get: p = t ∗ p+ δ0
• Denote a(s) =
∞∑
k=0
aksk, then the equation above becomes
p(s) = p(s) · t(s) + 1⇔ p(s) =1
1− t(s)
• Since t(1) =∑∞
k=1 P (Tii = k) = P (Tii <∞), a recurrent Markov process has a
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.55
singularity in for p(s) in s = 1. Further, lims→1
(1− s) · p(s) = lims→1
1− s
1− t(s)=
1
t′(1)and
t′(1) =∑∞
k=1 k · P (Tii = k) = E(Tii) = µii
• On the other hand,
lims→1(1− s) · p(s) = limu→∞1u· p(1− 1/u) = limn→∞
1n·∑∞
k=0 pk(1− 1/n)k
= limn→∞1n·∑n
k=0 pk + limn→∞1n·∑n
k=0 pk[(1− 1/n)k − 1
]
+ limn→∞1n·∑∞
k=n+1 pk(1− 1/n)k
• ...
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.56
Equilibrium distribution
Theorem In an irreducible discrete time, discrete state space MC the states
are either all transient, all null-recurrent, or all positive (ergodic).
All finite state MC are positive
Denote pn,i = P (Xn = i), and row vector pn = (. . . , pn,i . . .), then
pn+1 = pn · P pn+1,i =∑
ℓ∈Z
Pℓipn,ℓ
P · 1 = 1 (because transition probabilities sum to one)
λ = 1 is an eigenvalue
The left eigenvector is an invariant or stationary or equilibrium distribution
p · P = p
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.57
Reversed Markov Processes
If (Xn;n = 0, . . .) is a Markov Chain with transition matrix P and equilibrium
distribution p, then
P (Xn = j|Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)
= P (Xn = j|Xn+1 = i) =pjpi· Pji
Proof (Bayes, Chain rule for conditional probabilities and Markov Condition)
P (Xn = j|Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)
=P (Xn=j)·P (Xn+1=i,Xn+2=i2,...,Xn+m=im|Xn=j)
P (Xn+1=i,Xn+2=i2,...,Xn+m=im)
=P (Xn=j)·P (Xn+1=i|Xn=j)·P (Xn+2=i2,...,Xn+m=im|Xn+1=i)
P (Xn+1=i)·P (Xn+2=i2,...,Xn+m=im|Xn+1=i)
= P (Xn = j|Xn+1 = i) =P (Xn=j)·P (Xn+1=i|Xn=j)
P (Xn+1=i)
=pj ·Pji
pi
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.58
Reversible Markov Processes
If there exists a distribution pi = P (X = i) that satisfiespj · Pji
pi= Pij then
the Markov chain is called Reversible
Remark Reversibility thus means not that the reversed Markov process exists (it always exists),
but that its transition probabilities for i→ j are the same as the forward probabilities for the same
transitions i→ j (so NOT for j → i)
The distribution is then the equilibrium distribution. Indeed, from summation
of pj · Pji = piPij we obtain:∑
j
pj · Pji = pi∑
j
Pij = pi, which is, in matrix
form, p · P = p, the invariant distribution equation.
The reverted process is of course the same.
The conditionpj · Pji
pi= Pij is called the detailed balance equation (since it
implies the “global” balance equation)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.59
2.2 Models for multivariate random variables
1. In the next slides we consider vectors X with multivariate distributions.
We discuss two ways to define/fix any multivariate distribution
• Markov Random Field (MRF), which is special case of a graphical
model
• Gibbs Random Field (GRF)
2. The Markov property (dependence through adjacency) plays a role
both on the level of the sampling process as on the level of the sampled
multivariate random variable: Markov Chains for the sampling, Markov
Random Fields for the sampled variable
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.60
2.2.1 Markov Random Field (MRF)
Given a multivariate random variable X, a graphical model can be used to
represent the intra-dependencies.
An undirected graph is a ordered pair of sets G = (V,E), where
V = {1, . . . , p} is the set of vertices, sites or nodes, which are here indices
into X. The set E contains the (undirected) edges in the graph, where an
undirected edge is an unordered pair of vertices.
In a Markov Random Field, two vertices i and j are connected by an edge if
and only if the corresponding components of x are conditionally dependent,
i.e., given all the other components’ values.
P(Xi = xi
∣∣{X1, . . . , Xp}\{Xi})6= P
(Xi = xi
∣∣{X1, . . . , Xp}\{Xi, Xj})
The two sites are then called neighbours.
A neighbourhood of i is defined as ∂i = {j|{i, j} ∈ E}Formally, denoting by 2V all subsets of V , we have
∂ : V → 2V : i 7→ ∂i = {j|{i, j} ∈ E}
Markov property: it holds that P(Xi = xi
∣∣XV \{i})= P
(Xi = xi
∣∣X∂i
)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.61
Examples of MRFs
• In principle any multidimensional probability distribution can be seen as
a MRF. In general, all components are conditionally dependent, so
∂i = {1, . . . , p}\{i}• A (finite sample from a) Markov Chain is also a MRF. Indeed, (thanks to
the notion of reversed Markov Processes)∂i = {i− 1, i+ 1}
– Forward Markov Chain: • → • → • → •– Reversed MC: • ← • ← • ← •– MRF representation: • − • − • − •
• A two-dimensional MRF:• − • − • − •| | | |• − • − • − •| | | |• − • − • − •| | | |• − • − • − •
– Dimension of random vector X is p = 16
– X has a 2D-geometric background
– Components of X can be represented
with a 2D index: Xs = X(i,j)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.62
A short note on graphical models
Markov Random Fields are an example of graphical models
Graphical models are used to define or represent multivariate random
variables X
MRF are undirected graphs, edges define neighbourhoods ∂i
MC (Markov Chains) are an example of Bayesian networks: directed,
acyclic graphs: Edges define parents of nodes par(i)
The construction of the joint probability in a directed graph is immediate
fX(x) =
p∏
i=1
fXi|Xpar(i)(xi|xpar(i))
• when par(i) = Ø, then the conditional distribution should be interpreted
as the marginal distribution
• The construction is always possible because the graph is acyclic
• For MRF’s/undirected graphs, the joint pdf/pmf is not so straightforward,
we need the concept of Gibbs Random Fields (see slide 65)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.63
Example of modelling by Bayesian networks
Let X = (X1, X2, X3), then
• the graph • ← • → • represents the situation where X1 and X3 are
dependent, but, given the value of X2, (=conditionnally) they are
independent. The dependence occurs through X2
• the graph • → • ← • represents the situation where X1 and X3 are
independent, but X2 depends on both. If X2 is observed, this gives
information on both X1 and X3, so X1 and X3 are conditionnally
dependent. (By observation of X2, we learn about both X1 and X3)
These models are used, for instance, in studies of causality, and are popular
in several (other) domains of statistical learning
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.64
2.2.2 Gibbs Random Field (GRF)
Let X be a multivariate random variable of dimension p, and let E be a set of
edges defined on V = {1, . . . , p}.
Unlike in MRF, the edges in a GRF are not defined on the basis of a
conditional probability. They are used to define the global probability, as
follows:
A clique (or complete subset) is defined as
C ⊂ V is a clique ⇔ ∀i ∈ C : C ⊂ {i} ∪ ∂i
The set of cliques is denoted as C C = {C ⊂ V |∀i ∈ C : C ⊂ {i} ∪ ∂i}
A probability distribution that can be decomposed into factors associated with
the cliques is called a Gibbs Random Field (GRF)
fX(x) is a GRF ⇔ fX(x) =∏
C∈CfC(xC) =
1
Zexp
(−∑
C∈CHC(xC)
)
The functions HC(xC) are (up to constant) the logarithms of fC(xC). They
are called clique potentials. The normalizing constant Z is called a partition
function.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.65
Gibbs Random Field - further discussion
Use of GRF’s
• GRF’s can be used to define a joint probability on an undirected graph
• MRF’s represent local, conditional probabilities
• THe Hammersley-Clifford theorem (slide 69) finds connection GRF-MRF
Examples of GRF’s
• In principle any multidimensional probability distribution can be seen as
a GRF. In general, all components are conditionally dependent, and the
cliques are all subsets of V . All clique potentials are zero, except for
C = V , whose potential is HV (x) = − log(fX(x)).
• Ising model (see slide 67)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.66
Example of GRF: Ising model
A two dimensional lattice {(i, j)|0 ≤ i ≤ m, 0 ≤ j ≤ n} (see slide 62) can be
equiped with a neighbourhood system by defining for each internal site
∂(i, j) = {(i− 1, j), (i+ 1, j), (i, j − 1), (i, j + 1)}The cliques are then singletons and (horizontal and vertical) pairs of sites
C ={{(i, j)}
}∪{{(i, j), (i+ 1, j)}
}∪{{(i, j), (i, j + 1)}
}
In the case where the observations are binary, say X(i,j) ∈ {−1, 1}, a popular
GRF model is the Ising model
HC(xC) = τ · xC,1 · xC,2 for the pairs and Hs(xs) = γ · xs for the singletons.
The pair’s potentials express the interaction between adjacent sites, while the
singleton potentials express a drift towards one of the two states.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.67
2.2.3 The Hammersley-Clifford Theorem: conditions
MRFs are defined by conditional probabilities, based on a neighbourhood
system.
GRFs are defined by a joint probability, decomposed into clique potentials.
The Hammersley-Clifford Theorem states that under mild conditions, both
definitions are equivalent, i.e., a MRF is also a GRF and vice versa.
Two important conditions: existence of joint pdf + positivity
Existence of fX(x): See slide 77
Positivity condition
A probability distribution is said to satisfy the positivity condition if ∀i =
1, . . . , p : fXi(xi) > 0 implies that for x = (x1, . . . , xi, . . . , xp) we have
fX(x) > 0
A counterexample of such a distribution is a uniform distribution on the unit
disk: fX(0.9, 0.8) = 0 although fX1(0.9) > 0 and fX2(0.8) > 0
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.68
The Hammersley-Clifford Theorem
Theorem
If fX(x) exists and satisfies the positivity condition, then X is a MRF with
neighbourhood system ∂ if and only if it is a GRF whose cliques C follow from
the neighbourhood system ∂.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.69
⇐: GRF → MRF
Suppose X is a GRF with cliques C based on neighbourhood system ∂. Further denote
I = {1, . . . , p}, and i∂i = {i} ∪ ∂i. Let Ci = {C ∈ C|i ∈ C} be the cliques that contain site i.
Then
P (Xi = xi|XI\{i} = xI\{i}) =P (Xi=xi,XI\{i}=xI\{i})
P (XI\{i}=xI\{i})
=P (Xi=xi,XI\{i}=xI\{i})∑yi
P (Xi=yi,XI\{i}=xI\{i})
=
∏C∈Ci
fC(xi,xC∩∂i)·∏
C∈C\CifC(xC )
∑yi
∏C∈Ci
fC (yi,xC∩∂i)·∏
C∈C\CifC(xC)
=
∏C∈Ci
fC(xi,xC∩∂i)∑yi
∏C∈Ci
fC (yi,xC∩∂i)
=
∏C∈Ci
fC(xi,xC∩∂i)∑yi
∏C∈Ci
fC (yi,xC∩∂i)·∑
yI\i∂i
∏C∈C\Ci
fC (xC∩∂i,yC\∂i)∑
yI\i∂i
∏C∈C\Ci
fC (xC∩∂i,yC\∂i)
=
∑yI\i∂i
∏C∈C fC(xC∩i∂i,yC\i∂i)
∑yI\∂i
∏C∈C fC(xC∩∂i,yC\∂i)
=P (Xi∂i=xi∂i)P (X∂i=x∂i)
= P (Xi = xi|X∂i = x∂i)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.70
The construction of a GRF out of a MRF
For the other direction (from MRF to GRF) we need a few auxiliary definitions and results.
Given a function g : Rp → R : x 7→ g(x) and let o ∈ Rp be a reference state for which g(o) > 0.
Then define for each A ⊂ I = {1, . . . , p} the function
GA(x) = g(u(x)) where u : Rp → Rp and ui = xi if i ∈ A and ui = oi if i ∈ I\A Further
define HA(x) =∑
B⊆A
(−1)#(A\B)GB(x)
Then we have the following results
• HØ(x) is a constant HØ(x) = g(o),∀x
• HA(x) does not depend on the components of x with index outside A
If xA = yA, then HA(x) = HA(y)
• If one of the components of x with index in A takes the corresponding reference value,
then HA(x) = 0. for A 6= Ø, if xi = oi for at least one i ∈ A, then HA(x) = 0
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.71
Proof. Define
Bi = {B ⊂ A|i 6∈ B} , B = B ∪ {i} , Bi = {B = B ∪ {i}|B ∈ Bi},then Bi and Bi constitute a equal partition of 2A = {B ⊂ A}. For a pair
{B,B = B ∪ {i}}, and for any x with xi = oi, we have that GB(x) = GB(x), and so
HA(x) =∑
B∈Bi
[(−1)#(A\B)GB(x) + (−1)#(A\B)GB(x)
]= 0
• (Mobius Inversion) g(x) = GI(x) =∑
A⊆I
HA(x)
Proof∑A⊆I HA(x) =
∑A⊆I
∑B⊆A(−1)#(A\B)GB(x)
=∑
B⊆I GB(x)∑
A:B⊆A⊆I (−1)#(A\B)
(We have switched the order of summations and moved GB(x) forward)
Denote D = A\B, then B ⊆ A ⊆ I ⇔ Ø ⊆ D ⊆ I\B, and so we get∑
A⊆I
HA(x) =∑
B⊆I
GB(x)∑
D⊆I\B
(−1)#D
Unless B = I, the number of subsets D ⊆ I\B is even, and exactly half of those subsets
have an even #D, and the other half have an odd #D, hence all but one terms in the outer
sum are zero, leading to∑
A⊆I
HA(x) = GI(x) = g(x)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.72
Proof of Hammersley-Clifford ⇒ MRF → GRF
Theorem
If g(x) = − log fX(x) where fX(x) is the joint probability distribution of a MRF on x with cliques
C, then in the construction above HA(x) = 0 if A 6∈ C.
Proof
Suppose that A 6∈ C, then there must be two elements, say i, j ∈ A so that i 6∈ ∂j and vice versa.
For the given i, define as before
Bi = {B ⊂ A|i 6∈ B}B = B ∪ {i}Bi = {B = B ∪ {i}|B ∈ Bi},Then
HA(x) =∑
B∈Bi
[(−1)#(A\B)GB(x) + (−1)#(A\B)GB(x)
]
=∑
B∈Bi(−1)#(A\B)
[GB(x)−GB(x)
]
Denoting u = (xBoI\B), we have
GB(x) = − log fX(u)
= − log[fXI\{i}
(uI\{i}) · fXi|XI\{i}(xi|uI\{i})
]
= − log fXI\{i}(uI\{i})− log fXi|X∂i
(xi|u∂i)
Denoting u = (xBoI\B), we see that u and u differ only in i, so uI\{i} = uI\{i}, and so we
can write
GB(x) = − log fXI\{i}(uI\{i})− log fXi|X∂i
(oi|u∂i)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.73
The difference between both is then
GB(x)−GB(x) = − log fXi|X∂i(xi|u∂i) + log fXi|X∂i
(oi|u∂i)
The common term that was anihilated, depended on index j, but what remains does not, as
j 6∈ ∂i, hence all terms in HA(x) do not depend on the value of xj . Hence, HA(x) = HA(y),
where yℓ = xℓ, for ℓ 6= j and yj = oj . We have seen that for such an argument HA(y) = 0, from
which the proof follows.
The proof assumes positivity because the anihilations that take place are implicitly based on
ratios of probabilities (differences of log-probabilities), which are all assumed to be nonzero.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.74
Importance of Hammersley-Clifford in MCMC
The constructive proof of Hammersley-Clifford shows that given the
conditional probabilities in a Markov Model allow to construct the joint
distribution as
fX(x) =1
Z· exp
(−∑
C∈CHC(xC)
)
where for a chosen i ∈ C, and a reference state o
HC(xC) =∑
B⊂C|i∈B
log
(fXi|X∂i
(oi|u∂i)
fXi|X∂i(xi|u∂i)
)
where uj = xj if j ∈ C and uj = oj if j 6∈ C. The partition function Z follows
from the choices of o and i ∈ C.
HC states that conditional probabilities in a Markov model are sufficient to
define the joint probability of a random vector.
This is unlike marginal probabilities: they do not uniquely fix the joint
probability (as they contain no information about the dependence structure)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.75
A construction without cliques
In some applications (such as the one we will need), the clique potentials are just an intermediate
result. It is possible to construct the joint distribution directly from the conditional distributions,
however, without proving that it factorizes into clique potential functions.
Simplified theorem (no cliques)fX(x)
fX(o)=
p∏
i=1
fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})
fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})
Or, otherwise stated
fX(x) ∝p∏
i=1
fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})
fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})
Proof
We start from the right-hand side∏p
i=1
fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})
fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})
=∏p
i=1
fX (o{1,...,i−1}x{i,...,p})
/fXI\{i}
(o{1,...,i−1}x{i+1,...,p})
fX (o{1,...,i}x{i+1,...,p})
/fXI\{i}
(o{1,...,i−1}x{i+1,...,p})
All numerators in this product cancel against the denominator in the previous factor, leaving us
with the first denominator and the last numerator, which is exactly the expression of the left hand
side.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.76
Note on the existence of a joint distrubution
Note Hammersley-Clifford does not guarantee existence of the joint
distribution, but if it exists, it is well defined by the conditional probabilities.
Example Consider X1|X2 ∼ exp(λX2) and X2|X1 ∼ exp(λX1), then
according to the construction above, we find that
f(X1,X2)(x1, x2) ∝ fX1|X2(x1|x2)
fX1|X2(o1|x2)
· fX2|X1(x2|o1)
fX2|X1(o2|o1)
= λx2e−λx2·x1
λx2e−λx2·o1· λo1e−λo1·x2
λo1e−λo1·o2
∝ e−λx2·x1
The function exp(−λx2x1) has no finite integral on [0,∞[×[0,∞[, and
therefore it cannot be normalized to be a (2D) density function.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.77
From HC to Markov Chain Monte Carlo
• Sample from conditional distributions in MRF X (= any multivariate
random variable)
• Creates sequence of samples X1,X2,X3, . . . that are a Markov chain of
Markov Random Fields
•
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.78
2.3 MCMC samplers for integration
2.3.1 The Gibbs sampler
Suppose X is a p-dimensional random vector, and we can sample from
conditional densities fXi|XI\{i}(xi|xI\{i}) = fXi|X∂i
(xi|x∂i)
Then we construct the following sampler
Set initial values x0 = (x0,1, . . . , x0,p)
for n = 1, 2, . . .
for i = 1, . . . , p
Draw Xn,i ∼ fXi|XI\{i}(x|xn;1,...,i−1xn−1;i+1,...,p)
The Gibbs-sampler consists of loops defined by conditional distributions.
Therefore, the sampler is based on the description of fX(x) as a Markov
random field. Moreover, the sequence can be seen as a Markov Chain.
So, the Gibbs sampler does NOT rely on the description of fX(x) as a Gibbs
random field. GRF will be at the basis of the Metropolis-Hastings sampler on
slide 90
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.79
Invariant distribution
On slide 81, we prove:
The joint distribution fX(x) is invariant under the loops of a Gibbs-sampler
We consider the sequence of states after each outer loop (i.e., iterations over n), not the inner
loops (over the vector components).
We consider the case of a discrete state space.
Lemma The transition probabilities over the outer loops satisfy
fXn+1|Xn(x|v) =
p∏
i=1
fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)
Proof (discrete case)
This follows from the chain rule P
(p⋂
i=1
Ai
∣∣∣∣∣B
)
=
p∏
i=1
P
Ai
∣∣∣∣∣∣
i−1⋂
j=1
Aj ∩B
where in our case
Ai = {Xn+1;i = xi} and B = {Xn = v} �
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.80
Invariant distribution: proof
We now consider the case of a discrete state space, and suppose that fXn (x) = fX(x), then
fXn+1(x) =
∑v fXn+1|Xn
(x|v) · fXn (v)
=∑
v
∏pi=1 fXi|XI\{i}
(xi|x1,...,i−1vi+1,...,p) · fX(v)
=∑
vp· · ·∑v1
fX(v) · fX1|XI\{1}(x1|v2,...,p) ·
∏pi=2 fXi|XI\{i}
(xi|x1,...,i−1vi+1,...,p)
=∑
vp· · ·∑v2
fX2,...,p (v2,...,p) · fX1|XI\{1}(x1|v2,...,p) · . . .∏pi=2 fXi|XI\{i}
(xi|x1,...,i−1vi+1,...,p)=
∑vp· · ·∑v2
fX(x1v2,...,p) ·∏p
i=2 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)
= · · ·= fX(x)
In the expressions above, we used that∑
v1
fX(v) = fX2,...,p (v2,...,p) (The notation X2,...,p
refers to the components of X, not to successive Markov Chain realisations like in Xn.
We then used fX2,...,p (v2,...,p) · fX1|XI\{1}(x1|v2,...,p) = fX(x1v2,...,p)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.81
Reversibility
The proof of the invariance property of fX(x) w.r.t. the Gibbs sampler
established a global balance equation, not a detailed balance equation. A
detailed balance equation is necessary for reversibility.
The Gibbs-sampler as a whole is not reversible, meaning
fXn−1|Xn(xn−1|xn) 6= fXn+1|Xn
(xn−1|xn)
The probability that we arrive in xn−1 given xn 6= the probability that we come from xn−1 given
that we are in xn
Each substep (inner loop) on its own is reversible. That is, if we have
generated a new ith component xi, we could “undo” that step (“undo” in
probabilistic sense, that is). In order to undo the complete Gibbs iteration
step, the substeps have to be followed in reverse order.
One can prove that an reversible Gibbs sampler can be constructed by
randomizing the order of substeps.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.82
Convergence
Under mild assumption (positivity of fX(x)), the Gibbs sampler creates a
Markov chain for which Xndist−→X ∼ fX
If the Gibbs sampler Markov chain is irreducible and recurrent, then for any
integrable function h(x) we have
1
M
M∑
n=1
h(Xn)P→ E [h(X)]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.83
Foundations for MCMC
MCMC is used for sampling from multidimensional random variables It has
two aspects
• Sampling proceeds through conditional probabilities/densities
• The subsequent samples are dependent→ Markov Chain
We have to make sure that
• Conditionals define the correct joint distribution in a unique way:
Hammersley-Clifford
• The Markov chain replaces the large number convergence
– The target joint distribution is invariant under the Gibbs sampler
Markov Chain
– The chain converges to the invariant distribution
– Although convergence is a limit property, all generated samples of a
Gibbs sampler can be used in estimating the expected value of
h(X).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.84
An example from Bayesian statistics
A Hidden Markov Random Field (HMM - Hidden Markov Model)
Suppose that we have the following graphical model for observations Y
Y • • • • • •| | | | | |
M • • • • • •| | | | | |
X • − • − • − • − • − •
• We observe Y , where Yi and Yj are dependent, but conditioned on the
hidden or latent states Xi and Xj they are independent.
• The observation consists of two parts: the real signal (expression) M
and the noise Y −M . Goal: inference on fM |Y (m|y)
• The latent state is a binary label: Xi = +1 means that Mi is probably
large, Xi = −1 means that Mi is probably small.
• Large values of Mi are clustered
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.85
A formalisation of the graphical model
Suppose X ∈ {−1, 1}p ∼ Ising(τ, γ), that is
P (X = x) =1
T· exp
[−τ
p∑
i=2
xixi−1
]· exp
[−γ
p∑
i=1
xi
]
with partition function T =∑
x∈{−1,1}p
exp
[−τ
p∑
i=2
xixi−1
]· exp
[−γ
p∑
i=1
xi
]
We observe Yi = Mi + Vi, with Vi independent normal observational errors
with zero mean and common variance σ2 and Mi a mixture:
Mi =1−Xi
2·Ri +
1 +Xi
2· Si with Ri ∼ N(0, κ2) and Si ∼ N(0, K2)
and all these are independent.
The hyperparameters γ, τ,K, κ2, σ2 are assumed to be known.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.86
Bayesian inference 1: posterior law of total probability
We want to know E(M |Y )
The posterior total probability is
fMi|Y (m|y) = fMi|Xi=−1;Y (m|y)·P (Xi = −1|Y = y)+fMi|Xi=1;Y (m|y)·P (Xi = 1|Y = y)
The only dependence between components of Y lies in the Hidden Gibbs/Ising random field, so
fMi|Xi=±1;Y (m|y) = fMi|Xi=±1;Yi(m|yi)
Filling in leads to
fMi|Y(m|y) = fMi|Xi=−1;Yi
(m|yi) · P (Xi = −1|Y = y) + fMi|Xi=1;Yi(m|yi) · P (Xi = 1|Y = y)
For Xi = −1, Yi = Ri + Vi ∼ N(0, σ2 + κ2).
Hence cov(Mi, Yi) = cov(Ri, Ri + Vi) = var(Ri) + 0 = κ2
And from properties of the multivariate normal distribution (See slides Chapter 1, page 20) we
know
(Mi|Yi = y,Xi = −1) = (Ri|Yi = y) ∼ N
(κ2
κ2 + σ2· y, κ2σ2
κ2 + σ2
)
The same holds for Xi = 1, replacing κ2 by K2.
This leads to
E(Mi|Y = y) = yi ·[
κ2
κ2 + σ2· P (Xi = −1|Y = y) +
K2
K2 + σ2· P (Xi = 1|Y = y)
]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.87
Bayesian inference 2: posterior label probabilities
We still need P (Xi = −1|Y = y). We compute the marginal posterior probabilities of Xi from
the joint posterior: P (X = x|Y = y)
A Gibbs sampler for this posterior probability would draw from
P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1;i+1,...,p;Y = y)
= P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1;i+1,...,p;Yi = yi)
=P (Xi=xn,i|XI\{i}=xn;1,...,i−1xn−1,i+1,...,p)·fYi|X
(yi|xn;1,...,i−1xn,ixn−1;i+1,...,p)
fYi|XI\{i}(yi|xn;1,...,i−1xn−1;i+1,...,p)
= P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1,i+1,...,p)·fYi|Xi
(yi|xn,i)
fYi|Xi(yi|1)·P (Xi=1)+fYi|Xi
(yi|−1)·P (Xi=−1)
This expression has three components
1. Conditional probabilities
Yi|Xi = −1 ∼ N(0, κ2 + σ2) and Yi|Xi = 1 ∼ N(0,K2 + σ2)
2. Marginal probabilities
The prior marginal probabilities P (Xi = 1) and P (Xi = −1) have to be computed from
(Markov Chain) Monte Carlo sampling of the prior probability model.
3. The transition probabilities
We know (from the proof of Hammersley-Clifford)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.88
P (Xi = xi|XI\{i} = xI\{i}) = P (Xi = xi|X∂i = x∂i) =
exp
−∑
C|i∈C
HC(xC)
∑
yi
exp
−∑
C|i∈C
HC(yixC\{i})
which is in our case
P (Xi = xi|Xi−1 = xi−1, Xi+1 = xi+1)
=exp(−τ(xixi−1+xixi+1))·exp(−γxi)∑
yi∈{−1,1} exp(−τ(yixi−1+yixi+1)) exp(−γyi)
=exp(−τ(xixi−1+xixi+1))·exp(−γxi)
exp(−τ(xi−1+xi+1)) exp(−γ)+exp(+τ(xi−1+xi+1)) exp(γ)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.89
2.3.2 Metropolis-Hastings sampler
Gibbs-sampler
1. Based on conditional probabilities in multidimensional random vector⇒Markov random field
2. If vector components are highly correlated, conditional sampling leads to
values that are close to old ones: slow move through range of possible
values, hence slow convergence
Metropolis-Hastings sampler
1. Local update of previous sample: Markov Chain of samples (like Gibbs
sampler)
2. Based on joint probabilities⇒ Gibbs random field
3. One or more dimensions (↔ Gibbs sampler is always for random
vectors)
4. Uses rejection sampling: new sample has to be accepted
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.90
A proposal/transition distribution
Given a state Xn, a possible new state Xn′ is generated from a distribution
q(x|Xn = xn)
In principle, this proposal distribution can be any good choice. This
distribution should be easy to work with. It typically describes local updates.
The new state is accepted if
Xn+1 = Xn′ ⇔ U ≤ fX(Xn′) · q(Xn|Xn′)
fX(Xn) · q(Xn′ |Xn)
where U ∼ uniform[0, 1]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.91
The acceptance probability
Given a proposal Xn′ = xn′ the probability that it is accepted equals
α(xn′ ;xn) = min
(1,
fX(xn′) · q(xn|xn′)
fX(xn) · q(xn′ |xn)
)
Remark If the distribution has the form fX(x) =1
Zexp [−H(x)] then the
acceptance probability does not depend on Z. Often Z is very hard to find
(integration/summation over all possible configurations).
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.92
Transition probabilities from one state to the next
The transition probability becomes (case of discrete states)
• For xn+1 6= xn:
P (Xn+1 = xn+1|Xn = xn)
= P (Xn′ = xn+1|Xn = xn) · P (xn+1 accepted |Xn = xn,Xn′ = xn+1)
= q(xn+1|xn) · α(xn+1,xn)
• The probability that the proposed state (whatever the proposal is) will be
rejected, given that the current state is Xn = xn equals
r(xn) := P (rejected|Xn = xn)
=∑
x P (Xn′ = x|Xn = xn) · P (x rejected |Xn = xn,Xn′ = x)
=∑
x q(x|xn) · (1− α(x;xn))
= 1−∑x q(x|xn) · α(x;xn)
=: 1− a(xn)
• For xn+1 = xn we obtain
P (Xn+1 = xn|Xn = xn) = q(xn|xn) · α(xn,xn) + (1− a(xn))
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.93
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.94
Equilibrium distribution
The objective distribution fX(x) is an invariant distribution of a Metropolis-
Hastings sampler
Proof
Denote the transition probabilities Pxy = P (Xn+1 = y|Xn = x)
It holds that Pxy = q(y|x) · α(y;x) + δ(x,y) · (1− a(x))
where δ(x,y) is the Kronecker-delta.
We consider Pxy · fX(x) = q(y|x) ·α(y;x) · fX(x)+ δ(x,y) · (1−a(x)) · fX(x)
We have
α(y;x) · q(y|x) · fX(x) = min(1, fX(y)·q(x|y)
fX(x)·q(y|x)
)· q(y|x) · fX(x)= min (q(y|x) · fX(x), fX(y) · q(x|y))
= min(
fX(x)·q(y|x)fX(y)·q(x|y) , 1
)· q(x|y) · fX(y)
= α(x;y) · q(x|y) · fX(y)
The Kronecker-delta term is only active if x = y, so formally, one can always
write
δ(x,y) · (1− a(x)) · fX(x) = δ(y,x) · (1− a(y)) · fX(y)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.95
We may conclude that Pxy · fX(x) = Pyx · fX(y)
This is a detailed balance equation. Not only is the objective distribution
invariant under the Metropolis-Hastings sampler, but also
The Metropolis-Hastings sampler is reversible
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.96
A special case: the original Metropolis sampler
Suppose that the proposal distribution is symmetric in the sense that
q(x|y) = q(y|x)
This is realized by chosing Y = X + η where η has a zero mean symmetric
distribution g(η), hence q(y|x) = g(y − x).
Then the acceptance probability for a proposal y given a current state x
becomes
α(y;x) = min
(1,
fX(y) · q(x|y)fX(x) · q(y|x)
)= min
(1,
fX(y)
fX(x)
)
This was the original procedure, proposed by Metropolis. It was later refined
by Hastings for arbitrary proposal distributions.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.97
An example of a local update
Consider q(y|x) = fXi|XI\{i}(yi|xI\{i}), if yI\{i} = xI\{i} (and otherwise q(y|x) = 0)
Configurations with yI\{i} 6= xI\{i} have probability (density) zero.
It then holds
fX(x) · q(y|x) = fX(x) · fXi|XI\{i}(yi|xI\{i})
= fX(x) · fX (xI\{i}yi)
fXI\{i}(xI\{i})
and, keeping in mind that yI\{i} = xI\{i},
fX(y) · q(x|y) = fX(y) · fXi|XI\{i}(xi|yI\{i})
= fX(y) · fX (yI\{i}xi)
fXI\{i}(yI\{i})
= fX(yixI\{i}) · fX(x)fXI\{i}
(xI\{i})
= fX(x) · q(y|x)
From this, it follows
1. The acceptance probability α(y;x) = min
(1,
fX(y) · q(x|y)fX(x) · q(y|x)
)= 1
2. The process is reversible
In fact, this is one step of a Gibbs sampler.
1. In a general Metropolis-Hastings sampler, there is a free proposal, which is evaluated: the
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.98
evaluation uses the joint distribution
2. In the specific case of the Gibbs sampler, the proposal uses the conditional distribution and
there is no evaluation afterwards (so no joint distribution)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.99
Convergence
We have seen that the objective distribution is invariant under a
Metropolis-Hastings sampler.
This is not enough for good convergence.
Indeed, the case of one step of a Gibbs sampler illustrates that updates may
be too local: in this case, only one component of the vector is subject to
possible change. That implies that many states x are unreachable. The
Markov Chain is then reducible.
Irreducibility is obtained if q(y|x) > 0 for all pairs (x,y).
As this condition is sometimes too restrictive for every sampler separately,
one may consider a combination of different proposal distributions, e.g.,
sequence of one-at-a-time component Metropolis-Hastings sampler. (e.g.,
the Gibbs sampler)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.100
Choice of the proposal distribution
The speed of convergence in a Metropolis-Hastings sampler depends on the correlation between
subsequent samples.
High correlation→ slow convergence
Subsequent samples should be as independent as possible. (Fully independent samples are
optimal, in the sense that the limiting distribution is reached instantaneously, but they are often
difficult to realize or difficult to sample from)
Inter-sample dependence depends on two adversary objectives
• Acceptance probability: low acceptance probability means high probability that two
subsequent samples are identical, hence, high correlation.
Acceptance probability is enhanced by proposal distributions with small variance, i.e., a
distribution that favours very local updates.
• Correlation between current state and proposed state. This source of correlation is
reduced by proposals that favour large updates.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.101