talk at 2013 wsc, isi conference in hong kong, august 26, 2013
DESCRIPTION
Those are the slides for my conference talk at 2013 WSC, in the "Jacob Bernoulli's "Ars Conjectandi" and the emergence of probability" session organised by Adam JakubowskiTRANSCRIPT
An [under]view of Monte Carlo methods, fromimportance sampling to MCMC, to ABC
(& kudos to Bernoulli)
Christian P. RobertUniversite Paris-Dauphine, University of Warwick, & CREST, Paris
2013 WSC, Hong Kong
Outline
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Approximate Bayesian computation(ABC)
Bernoulli as founding father of Monte Carlo methods
The weak law of large numbers (or Bernoulli’s [Golden] theorem)provides the justification for Monte Carlo approximations:
if x1, . . . , xn are i.i.d. rv’s with density f ,
limn→∞ h(x1) + . . . + h(xn)
n=
∫X
h(x)f (x) dx
Stigler’s Law of Eponimy: Cardano (1501–1576) first stated theresult
Bernoulli as founding father of Monte Carlo methods
...and indeed
h(x1) + . . . + h(xn)
n
converges to
I =
∫X
h(x)f (x) dx
Bernoulli as founding father of Monte Carlo methods
...and indeed
h(x1) + . . . + h(xn)
n
converges to
I =
∫X
h(x)f (x) dx
...meaning that provided we can simulate xi ∼ f (·) long and fast“enough”, the empirical mean will be a good “enough”approximation to I
Early implementations of the LLN
I While Jakob Bernoullihimself apparently did notengage in simulation,
I Buffon (1707–1788) resortedto a (not-yet-Monte-Carlo)experiment in 1735 toestimate the value of theSaint Petersburg game(even though he did notperform a similar experimentfor estimating π)
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
Early implementations of the LLN
I While Jakob Bernoullihimself apparently did notengage in simulation,
I De Forest (1834–1888)found the median of alog-Cauchy distribution,using normal simulationsapproximated to the seconddigit (in 1876)
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
Early implementations of the LLN
I While Jakob Bernoullihimself apparently did notengage in simulation,
I followed closely by theubuquitous Galton using“normal” dice in 1890, afterdevelopping the Quincunx,used both for checking theCLT and simulating from aposterior distribution asearly as 1877
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
Importance Sampling
When focussing on integral approximation, very loose principle inthat proposal distribution with pdf q(·) leads to alternativerepresentation
I =
∫X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
IIS = n−1n∑
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
Importance Sampling
When focussing on integral approximation, very loose principle inthat proposal distribution with pdf q(·) leads to alternativerepresentation
I =
∫X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
IIS = n−1n∑
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
Importance Sampling
When focussing on integral approximation, very loose principle inthat proposal distribution with pdf q(·) leads to alternativerepresentation
I =
∫X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
IIS = n−1n∑
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
things aren’t all rosy...
LLN not sufficient to justify MonteCarlo methods: if
n−1n∑
i=1
h(xi ){f /q}(xi )
has an infinite variance, the estimatorIIS is useless Importance sampling estimation of
P(2 6 Z 6 6) Z is Cauchy and
importance is normal, compared
with exact value, 0.095.
The harmonic mean estimator
Bayesian posterior distribution defined as
π(θ|x) = π(θ)L(θ|x)/m(x)
When θi ∼ π(θ|x),
1
T
T∑t=1
1
L(θt |x)
is an unbiased estimator of 1/m(x)[Gelfand & Dey, 1994; Newton & Raftery, 1994]
Highly hazardous material: Most often leads to an infinitevariance!!!
The harmonic mean estimator
Bayesian posterior distribution defined as
π(θ|x) = π(θ)L(θ|x)/m(x)
When θi ∼ π(θ|x),
1
T
T∑t=1
1
L(θt |x)
is an unbiased estimator of 1/m(x)[Gelfand & Dey, 1994; Newton & Raftery, 1994]
Highly hazardous material: Most often leads to an infinitevariance!!!
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that thisestimator is consistent ie, it will very likely be very close to the correctanswer if you use a sufficiently large number of points from the posteriordistribution.
The bad news is that the number of points required for this estimator to
get close to the right answer will often be greater than the number of
atoms in the observable universe. The even worse news is that it’s easy
for people to not realize this, and to naıvely accept estimates that are
nowhere close to the correct value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance samplingconstraints: proposal ϕ(·) must have lighter (rather than fatter)tails than π(·)L(·) for the approximation
1
/1
T
T∑t=1
ϕ(θt)
πk(θt)L(θt)θt ∼ ϕ(·)
to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance samplingconstraints: proposal ϕ(·) must have lighter (rather than fatter)tails than π(·)L(·) for the approximation
1
/1
T
T∑t=1
ϕ(θt)
πk(θt)L(θt)θt ∼ ϕ(·)
to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
HPD indicator as ϕ
Use the convex hull of MCMC simulations (θt)t=1,...,T
corresponding to the 10% HPD region (easily derived!) and ϕ asindicator:
ϕ(θ) =10
T
∑t∈HPD
Id(θ,θt)6ε
[X & Wraith, 2009]
Bayesian computing (R)evolution
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Approximate Bayesian computation(ABC)
computational jam
In the 1970’s and early 1980’s, theoretical foundations of Bayesianstatistics were sound, but methodology was lagging for lack ofcomputing tools.
I restriction to conjugate priors
I limited complexity of models
I small sample sizes
The field was desperately in need of a new computing paradigm![X & Casella, STS, 2012]
MCMC as in Markov Chain Monte Carlo
Notion that i.i.d. simulation is definitely not necessary, all thatmatters is the ergodic theoremRealization that Markov chains could be used in a wide variety ofsituations only came to mainstream statisticians with Gelfand andSmith (1990) despite earlier publications in the statistical literaturelike Hastings (1970) and growing awareness in spatial statistics(Besag, 1986)Reasons:
I lack of computing machinery
I lack of background on Markov chains
I lack of trust in the practicality of the method
pre-Gibbs/pre-Hastings era
Early 1970’s, Hammersley, Clifford, and Besag were working on thespecification of joint distributions from conditional distributionsand on necessary and sufficient conditions for the conditionaldistributions to be compatible with a joint distribution.
[Hammersley and Clifford, 1971]
pre-Gibbs/pre-Hastings era
Early 1970’s, Hammersley, Clifford, and Besag were working on thespecification of joint distributions from conditional distributionsand on necessary and sufficient conditions for the conditionaldistributions to be compatible with a joint distribution.
“What is the most general form of the conditionalprobability functions that define a coherent jointfunction? And what will the joint look like?”
[Besag, 1972]
Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
Joint distribution of vector associated with a dependence graphmust be represented as product of functions over the cliques of thegraphs, i.e., of functions depending only on the componentsindexed by the labels in the clique.
[Cressie, 1993; Lauritzen, 1996]
Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
A probability distribution P with positive and continuous density fsatisfies the pairwise Markov property with respect to anundirected graph G if and only if it factorizes according to G, i.e.,
(F ) ≡ (G )
[Cressie, 1993; Lauritzen, 1996]
Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
Under the positivity condition, the joint distribution g satisfies
g(y1, . . . , yp) ∝p∏
j=1
g`j (y`j |y`1 , . . . , y`j−1, y ′`j+1
, . . . , y ′`p)
g`j (y′`j|y`1 , . . . , y`j−1
, y ′`j+1, . . . , y ′`p)
for every permutation ` on {1, 2, . . . , p} and every y ′ ∈ Y.
[Cressie, 1993; Lauritzen, 1996]
Clicking in
After Peskun (1973), MCMC mostly dormant in mainstreamstatistical world for about 10 years, then several papers/bookshighlighted its usefulness in specific settings:
I Geman and Geman (1984)
I Besag (1986)
I Strauss (1986)
I Ripley (Stochastic Simulation, 1987)
I Tanner and Wong (1987)
I Younes (1988)
[Re-]Enters the Gibbs sampler
Geman and Geman (1984), building onMetropolis et al. (1953), Hastings (1970), andPeskun (1973), constructed a Gibbs samplerfor optimisation in a discrete image processingproblem with a Gibbs random field withoutcompletion.Back to Metropolis et al., 1953: the Gibbssampler is already in use therein and ergodicityis proven on the collection of global maxima
[Re-]Enters the Gibbs sampler
Geman and Geman (1984), building onMetropolis et al. (1953), Hastings (1970), andPeskun (1973), constructed a Gibbs samplerfor optimisation in a discrete image processingproblem with a Gibbs random field withoutcompletion.Back to Metropolis et al., 1953: the Gibbssampler is already in use therein and ergodicityis proven on the collection of global maxima
Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis -Hastings algorithms would crack almost any problem!Flood of papers followed applying MCMC:
I linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;Wang & al., 1993, 1994)
I generalized linear mixed models (Albert & Chib, 1993)
I mixture models (Tanner & Wong, 1987; Diebolt & X., 1990, 1994;Escobar & West, 1993)
I changepoint analysis (Carlin & al., 1992)
I point processes (Grenander & Møller, 1994)
I &tc
Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis -Hastings algorithms would crack almost any problem!Flood of papers followed applying MCMC:
I genomics (Stephens & Smith, 1993; Lawrence & al., 1993;Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,2000)
I ecology (George & X, 1992)
I variable selection in regression (George & mcCulloch, 1993; Green,1995; Chen & al., 2000)
I spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993))
I longitudinal studies (Lange & al., 1992)
I &tc
MCMC and beyond
I reversible jump MCMC which impacted considerably Bayesian modelchoice (Green, 1995)
I adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal,2009)
I exact approximations to targets (Tanner & Wong, 1987; Beaumont,2003; Andrieu & Roberts, 2009)
I comp’al stats catching up with comp’al physics: free energy sampling(e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead,2011)
I sequential Monte Carlo (SMC) for non-sequential problems (Chopin,2002; Neal, 2001; Del Moral et al 2006)
I retrospective sampling
I intractability: EP – GIMH – PMCMC – SMC2 – INLA
I QMC[MC] (Owen, 2011)
Particles
Iterating/sequential importance sampling is about as old as MonteCarlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
Found in the molecular simulation literature of the 50’s withself-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle filter”.
Particles
Iterating/sequential importance sampling is about as old as MonteCarlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
Found in the molecular simulation literature of the 50’s withself-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle filter”.
pMC & pMCMC
I Recycling of past simulations legitimate to build betterimportance sampling functions as in population Monte Carlo
[Iba, 2000; Cappe et al, 2004; Del Moral et al., 2007]
I synthesis by Andrieu, Doucet, and Hollenstein (2010) usingparticles to build an evolving MCMC kernel pθ(y1:T ) in statespace models p(x1:T )p(y1:T |x1:T )
I importance sampling on discretely observed diffusions[Beskos et al., 2006; Fearnhead et al., 2008, 2010]
Metropolis-Hastings revisited
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisitedReinterpretation andRao-BlackwellisationRussian roulette
Approximate Bayesian computation(ABC)
Metropolis Hastings algorithm
1. We wish to approximate
I =
∫h(x)π(x)dx∫π(x)dx
=
∫h(x)π(x)dx
2. π(x) is known but not∫π(x)dx .
3. Approximate I with δ = 1n
∑nt=1 h(x(t)) where (x(t)) is a
Markov chain with limiting distribution π.
4. Convergence obtained from Law of Large Numbers or CLT forMarkov chains.
Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t), yt) = min
{1,π(yt)
π(x(t))
q(x(t)|yt)
q(yt |x(t))
}Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied: . πis the stationary distribution of (x(t)).
I The accepted candidates are simulated with the rejectionalgorithm.
Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t), yt) = min
{1,π(yt)
π(x(t))
q(x(t)|yt)
q(yt |x(t))
}Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied: . πis the stationary distribution of (x(t)).
I The accepted candidates are simulated with the rejectionalgorithm.
Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t), yt) = min
{1,π(yt)
π(x(t))
q(x(t)|yt)
q(yt |x(t))
}Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied:
π(x)q(y |x)α(x , y) = π(y)q(x |y)α(y , x).
. π is the stationary distribution of (x(t)).
I The accepted candidates are simulated with the rejectionalgorithm.
Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t), yt) = min
{1,π(yt)
π(x(t))
q(x(t)|yt)
q(yt |x(t))
}Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied:
π(x)q(y |x)α(x , y) = π(y)q(x |y)α(y , x).
. π is the stationary distribution of (x(t)).
I The accepted candidates are simulated with the rejectionalgorithm.
Some properties of the HM algorithm
Alternative representation of the estimator δ is
δ =1
n
n∑t=1
h(x(t)) =1
n
Mn∑i=1
nih(zi ) ,
where
I zi ’s are the accepted yj ’s,
I Mn is the number of accepted yj ’s till time n,
I ni is the number of times zi appears in the sequence (x(t))t .
The ”accepted candidates”
q(·|zi ) =α(zi , ·) q(·|zi )
p(zi )6
q(·|zi )p(zi )
,
where p(zi ) =∫α(zi , y) q(y |zi )dy . To simulate from q(·|zi ):
1. Propose a candidate y ∼ q(·|zi )2. Accept with probability
q(y |zi )
/(q(y |zi )
p(zi )
)= α(zi , y)
Otherwise, reject it and starts again.
I this is the transition of the HM algorithm.The transition kernelq enjoys π as a stationary distribution:
π(x)q(y |x) = π(y)q(x |y) ,
The ”accepted candidates”
q(·|zi ) =α(zi , ·) q(·|zi )
p(zi )6
q(·|zi )p(zi )
,
where p(zi ) =∫α(zi , y) q(y |zi )dy . To simulate from q(·|zi ):
1. Propose a candidate y ∼ q(·|zi )2. Accept with probability
q(y |zi )
/(q(y |zi )
p(zi )
)= α(zi , y)
Otherwise, reject it and starts again.
I this is the transition of the HM algorithm.The transition kernelq enjoys π as a stationary distribution:
π(x)q(y |x) = π(y)q(x |y) ,
”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable withprobability parameter
p(zi ) :=
∫α(zi , y) q(y |zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that
q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .
”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable withprobability parameter
p(zi ) :=
∫α(zi , y) q(y |zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that
q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .
”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable withprobability parameter
p(zi ) :=
∫α(zi , y) q(y |zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that
q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .
”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable withprobability parameter
p(zi ) :=
∫α(zi , y) q(y |zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that
q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .
Importance sampling perspective
1. A natural idea:
δ∗ =1
n
Mn∑i=1
h(zi )
p(zi ),
Importance sampling perspective
1. A natural idea:
δ∗ '
∑Mni=1
h(zi )
p(zi )∑Mni=1
1
p(zi )
=
∑Mni=1
π(zi )
π(zi )h(zi )∑Mn
i=1
π(zi )
π(zi )
.
Importance sampling perspective
1. A natural idea:
δ∗ '
∑Mni=1
h(zi )
p(zi )∑Mni=1
1
p(zi )
=
∑Mni=1
π(zi )
π(zi )h(zi )∑Mn
i=1
π(zi )
π(zi )
.
2. But p not available in closed form.
Importance sampling perspective
1. A natural idea:
δ∗ '
∑Mni=1
h(zi )
p(zi )∑Mni=1
1
p(zi )
=
∑Mni=1
π(zi )
π(zi )h(zi )∑Mn
i=1
π(zi )
π(zi )
.
2. But p not available in closed form.
3. The geometric ni is the replacement, an obvious solution thatis used in the original Metropolis–Hastings estimate sinceE[ni ] = 1/p(zi ).
The Bernoulli factory
The crude estimate of 1/p(zi ),
ni = 1 +
∞∑j=1
∏`6j
I {u` > α(zi , y`)} ,
can be improved:
Lemma (Douc & X., AoS, 2011)
If (yj)j is an iid sequence with distribution q(y |zi ), the quantity
ξi = 1 +
∞∑j=1
∏`6j
{1 − α(zi , y`)}
is an unbiased estimator of 1/p(zi ) which variance, conditional onzi , is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).
Rao-Blackwellised, for sure?
ξi = 1 +
∞∑j=1
∏`6j
{1 − α(zi , y`)}
1. Infinite sum but finite with at least positive probability:
α(x(t), yt) = min
{1,π(yt)
π(x(t))
q(x(t)|yt)
q(yt |x(t))
}For example: take a symmetric random walk as a proposal.
2. What if we wish to be sure that the sum is finite?
Finite horizon k version:
ξki = 1 +
∞∑j=1
∏16`6k∧j
{1 − α(zi , yj)
} ∏k+16`6j
I {u` > α(zi , y`)}
Rao-Blackwellised, for sure?
ξi = 1 +
∞∑j=1
∏`6j
{1 − α(zi , y`)}
1. Infinite sum but finite with at least positive probability:
α(x(t), yt) = min
{1,π(yt)
π(x(t))
q(x(t)|yt)
q(yt |x(t))
}For example: take a symmetric random walk as a proposal.
2. What if we wish to be sure that the sum is finite?
Finite horizon k version:
ξki = 1 +
∞∑j=1
∏16`6k∧j
{1 − α(zi , yj)
} ∏k+16`6j
I {u` > α(zi , y`)}
which Bernoulli factory?!
Not the spice warehouse of Leon Bernoulli!
Query:
Given an algorithm delivering iid B(p) rv’s, is it possible to derivean algorithm delivering iid B(p) rv’s when f is known and punknown?
[von Neumann, 1951; Keane & O’Brien, 1994]
I existence (e.g., impossible for f (p) = min(2p, 1))
I condition: for some n,
min{f (p), 1 − f (p)} > min{p, 1 − p}n
I implementation (polynomial vs. exponential time)
I use of sandwiching polynomials/power series
which Bernoulli factory?!
Not the spice warehouse of Leon Bernoulli!
Query:
Given an algorithm delivering iid B(p) rv’s, is it possible to derivean algorithm delivering iid B(p) rv’s when f is known and punknown?
[von Neumann, 1951; Keane & O’Brien, 1994]
I existence (e.g., impossible for f (p) = min(2p, 1))
I condition: for some n,
min{f (p), 1 − f (p)} > min{p, 1 − p}n
I implementation (polynomial vs. exponential time)
I use of sandwiching polynomials/power series
Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is aniid uniform sequence, for any k > 0, the quantity
ξki = 1 +
∞∑j=1
∏16`6k∧j
{1 − α(zi , yj)
} ∏k+16`6j
I {u` > α(zi , y`)}
is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms.
Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is aniid uniform sequence, for any k > 0, the quantity
ξki = 1 +
∞∑j=1
∏16`6k∧j
{1 − α(zi , yj)
} ∏k+16`6j
I {u` > α(zi , y`)}
is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms. Moreover, for k > 1,
V[ξki
∣∣∣ zi] = 1− p(zi)
p2(zi)−
1− (1− 2p(zi) + r(zi))k
2p(zi) − r(zi)
(2− p(zi)
p2(zi)
)(p(zi) − r(zi)) ,
where p(zi) :=∫α(zi , y) q(y |zi) dy . and r(zi) :=
∫α2(zi , y) q(y |zi) dy .
Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is aniid uniform sequence, for any k > 0, the quantity
ξki = 1 +
∞∑j=1
∏16`6k∧j
{1 − α(zi , yj)
} ∏k+16`6j
I {u` > α(zi , y`)}
is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms. Therefore, we have
V[ξi∣∣ zi] 6 V
[ξki∣∣ zi] 6 V
[ξ0i∣∣ zi] = V [ni | zi ] .
B motivation for Russian roulette
I drior π(θ), data density p(y|θ) = f (y;θ)/Z(θ) with
Z(θ) =
∫f (x;θ)dx
intractable (e.g., Ising spin model, MRF, diffusion processes,networks, &tc)
I doubly-intractable posterior follows as
π(θ|y) = p(y|θ)× π(θ)× 1
Z(y)=
f (y;θ)
Z(θ)× π(θ)× 1
Z(y)
where Z(y) =∫
p(y|θ)π(θ)dθ
I both Z(θ) and Z(y) are intractable with massively differentconsequences
[thanks to Mark Girolami for his Russian slides!]
B motivation for Russian roulette
I drior π(θ), data density p(y|θ) = f (y;θ)/Z(θ) with
Z(θ) =
∫f (x;θ)dx
intractable (e.g., Ising spin model, MRF, diffusion processes,networks, &tc)
I doubly-intractable posterior follows as
π(θ|y) = p(y|θ)× π(θ)× 1
Z(y)=
f (y;θ)
Z(θ)× π(θ)× 1
Z(y)
where Z(y) =∫
p(y|θ)π(θ)dθ
I both Z(θ) and Z(y) are intractable with massively differentconsequences
[thanks to Mark Girolami for his Russian slides!]
B motivation for Russian roulette
I If Z(θ) is intractable, Metropolis–Hasting acceptanceprobability
α(θ ′,θ) = min
{1,
f (y;θ ′)π(θ ′)
f (y;θ)π(θ)× q(θ|θ ′)
q(θ ′|θ)× Z(θ)
Z(θ ′)
}is not available
I Use instead biased approximations e.g. pseudo-likelihoods,plugin Z(θ ′) estimates without sacrificing exactness of MCMC
B motivation for Russian roulette
I If Z(θ) is intractable, Metropolis–Hasting acceptanceprobability
α(θ ′,θ) = min
{1,
f (y;θ ′)π(θ ′)
f (y;θ)π(θ)× q(θ|θ ′)
q(θ ′|θ)× Z(θ)
Z(θ ′)
}is not available
I Use instead biased approximations e.g. pseudo-likelihoods,plugin Z(θ ′) estimates without sacrificing exactness of MCMC
Existing solution
I Unbiased plugin estimate
Z(θ)
Z(θ ′)≈ f (x;θ)
f (x;θ ′)where x ∼
f (x;θ ′)
Z(θ ′)
[Møller et al, Bka, 2006; Murray et al 2006]
I auxiliary variable method
I removes Z(θ)/Z(θ ′) from the picture
I require simulations from the model (e.g., via perfect sampling)
Exact approximate methods
Pseudo-Marginal construction that allows for the use of unbiased,positive estimates of target in acceptance probability
α(θ ′,θ) = min
{1,π(θ ′|y)
π(θ|y)× q(θ|θ ′)
q(θ ′|θ)
}[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]
Transition kernel has invariant distribution with exact targetdensity π(θ|y)
Exact approximate methods
Pseudo-Marginal construction that allows for the use of unbiased,positive estimates of target in acceptance probability
α(θ ′,θ) = min
{1,π(θ ′|y)
π(θ|y)× q(θ|θ ′)
q(θ ′|θ)
}[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]
Transition kernel has invariant distribution with exact targetdensity π(θ|y)
Infinite series estimator
I For each (θ, y), construct rv’s {V(j)θ , j > 0} such that
π(θ, {V(j)θ }|y) :=
∞∑j=0
V(j)θ
is a.s. finite with finite expectation
E[π(θ, {V
(j)θ } |y)
]= π(θ|y)
I
Infinite series estimator
I For each (θ, y), construct rv’s {V(j)θ , j > 0} such that
π(θ, {V(j)θ }|y) :=
∞∑j=0
V(j)θ
is a.s. finite with finite expectation
E[π(θ, {V
(j)θ } |y)
]= π(θ|y)
I Introduce a random stopping time τθ, such that with
ξ := (τθ, {V(j)θ , 0 6 j 6 τθ}) the estimate
π(θ,ξ|y) :=
τθ∑j=0
V(j)θ
satisfies
E(π(θ,ξ|y)|{V
(j)θ , j > 0}
)= π(θ, {V
(j)θ }|y)
Infinite series estimator
I For each (θ, y), construct rv’s {V(j)θ , j > 0} such that
π(θ, {V(j)θ }|y) :=
∞∑j=0
V(j)θ
is a.s. finite with finite expectation
E[π(θ, {V
(j)θ } |y)
]= π(θ|y)
I Warning: unbiased estimate π(θ,ξ|y) using seriesconstruction no general guarantee of positivity
Russian roulette
Method that requires unbiased truncation of a series
S(θ) =∑∞
i=0φi (θ)
Russian roulette employed extensively in simulation of neutronscattering and computer graphics
I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}
I Find the first time k > 1 such that Uk > qk
I Russian roulette estimate of S(θ) is
S(θ) =∑k
j=0φj(θ)
/∏j−1
i=1qi ,
[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]
Russian roulette
Method that requires unbiased truncation of a series
S(θ) =∑∞
i=0φi (θ)
Russian roulette employed extensively in simulation of neutronscattering and computer graphics
I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}
I Find the first time k > 1 such that Uk > qk
I Russian roulette estimate of S(θ) is
S(θ) =∑k
j=0φj(θ)
/∏j−1
i=1qi ,
[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]
Russian roulette
Method that requires unbiased truncation of a series
S(θ) =∑∞
i=0φi (θ)
Russian roulette employed extensively in simulation of neutronscattering and computer graphics
I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}
I Find the first time k > 1 such that Uk > qk
I Russian roulette estimate of S(θ) is
S(θ) =∑k
j=0φj(θ)
/∏j−1
i=1qi ,
I If limn→∞∏nj=1 qj = 0, Russian roulette terminates with
probability one
[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]
Russian roulette
Method that requires unbiased truncation of a series
S(θ) =∑∞
i=0φi (θ)
Russian roulette employed extensively in simulation of neutronscattering and computer graphics
I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}
I Find the first time k > 1 such that Uk > qk
I Russian roulette estimate of S(θ) is
S(θ) =∑k
j=0φj(θ)
/∏j−1
i=1qi ,
I E{S(θ)} = S(θ)
I variance finite under certain known conditions
[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]
towards ever more complexity
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Approximate Bayesian computation(ABC)
New challenges
Novel statisticial issues that forces a different Bayesian answer:
I very large datasets
I complex or unknown dependence structures with maybe p � n
I multiple and involved random effects
I missing data structures containing most of the information
I sequential structures involving most of the above
New paradigm?
“Surprisingly, the confident prediction of the previousgeneration that Bayesian methods would ultimately supplantfrequentist methods has given way to a realization that Markovchain Monte Carlo (MCMC) may be too slow to handlemodern data sets. Size matters because large data sets stresscomputer storage and processing power to the breaking point.The most successful compromises between Bayesian andfrequentist methods now rely on penalization andoptimization.”
[Lange at al., ISR, 2013]
New paradigm?
I sad reality constraint thatsize does matter
I focus on much smallerdimensions and on sparsesummaries
I many (fast if non-Bayesian)ways of producing thosesummaries
I Bayesian inference can kickin almost automatically atthis stage
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the originalBayesian inference problem
I Degrading the data precision downto a tolerance ε
I Replacing the likelihood with anon-parametric approximation
I Summarising/replacing the datawith insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the originalBayesian inference problem
I Degrading the data precision downto a tolerance ε
I Replacing the likelihood with anon-parametric approximation
I Summarising/replacing the datawith insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the originalBayesian inference problem
I Degrading the data precision downto a tolerance ε
I Replacing the likelihood with anon-parametric approximation
I Summarising/replacing the datawith insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihoodfunction
`(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the originalBayesian inference problem
I Degrading the data precision downto a tolerance ε
I Replacing the likelihood with anon-parametric approximation
I Summarising/replacing the datawith insufficient statistics
ABC methodology
Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating
θ′ ∼ π(θ) , z ∼ f (z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y,then the selected
θ′ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC methodology
Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating
θ′ ∼ π(θ) , z ∼ f (z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y,then the selected
θ′ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC methodology
Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating
θ′ ∼ π(θ) , z ∼ f (z|θ′) ,
until the auxiliary variable z is equal to the observed value, z = y,then the selected
θ′ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC algorithm
In most implementations, degree of approximation:
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N dorepeat
generate θ ′ from the prior distribution π(·)generate z from the likelihood f (·|θ ′)
until ρ{η(z),η(y)} 6 εset θi = θ
′
end for
where η(y) defines a (not necessarily sufficient) statistic
Comments
I role of distance paramount(because ε 6= 0)
I scaling of components of η(y) alsocapital
I ε matters little if “small enough”
I representative of “curse ofdimensionality”
I small is beautiful!, i.e. data as awhole may be weakly informativefor ABC
I non-parametric method at core
ABC simulation advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
ABC simulation advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
ABC simulation advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
ABC simulation advances
Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]
.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]
ABC as an inference machine
Starting point is summary statisticη(y), either chosen for computationalrealism or imposed by externalconstraints
I ABC can produce a distribution on the parameter of interestconditional on this summary statistic η(y)
I inference based on ABC may be consistent or not, so it needsto be validated on its own
I the choice of the tolerance level ε is dictated by bothcomputational and convergence constraints
ABC as an inference machine
Starting point is summary statisticη(y), either chosen for computationalrealism or imposed by externalconstraints
I ABC can produce a distribution on the parameter of interestconditional on this summary statistic η(y)
I inference based on ABC may be consistent or not, so it needsto be validated on its own
I the choice of the tolerance level ε is dictated by bothcomputational and convergence constraints
How Bayesian aBc is..?
At best, ABC approximates π(θ|η(y)):
I approximation error unknown (w/o massive simulation)
I pragmatic or empirical Bayes (there is no other solution!)
I many calibration issues (tolerance, distance, statistics)
I the NP side should be incorporated into the whole Bayesianpicture
I the approximation error should also be part of the Bayesianinference
Noisy ABC
ABC approximation error (under non-zero tolerance ε) replacedwith exact simulation from a controlled approximation to thetarget, convolution of true posterior with kernel function
πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ
,
with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]
Theorem
The ABC algorithm based on a randomised observation y = y+ ξ,ξ ∼ Kε, and an acceptance probability of
Kε(y− z)/M
gives draws from the posterior distribution π(θ|y).
Noisy ABC
ABC approximation error (under non-zero tolerance ε) replacedwith exact simulation from a controlled approximation to thetarget, convolution of true posterior with kernel function
πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ
,
with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]
Theorem
The ABC algorithm based on a randomised observation y = y+ ξ,ξ ∼ Kε, and an acceptance probability of
Kε(y− z)/M
gives draws from the posterior distribution π(θ|y).
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]
I Loss of statistical information balanced against gain in dataroughening
I Approximation error and information loss remain unknown
I Choice of statistics induces choice of distance functiontowards standardisation
I borrowing tools from data analysis (LDA) machine learning
[Estoup et al., ME, 2012]
Which summary?
Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]
I may be imposed for external/practical reasons
I may gather several non-B point estimates
I we can learn about efficient combination
I distance can be provided by estimation techniques
Which summary for model choice?
‘This is also why focus on model discrimination typically (...)proceeds by (...) accepting that the Bayes Factor that one obtainsis only derived from the summary statistics and may in no waycorrespond to that of the full model.’
[S. Sisson, Jan. 31, 2011, xianblog]
Depending on the choice of η(·), the Bayes factor based on thisinsufficient statistic,
Bη12(y) =
∫π1(θ1)f
η1 (η(y)|θ1) dθ1∫
π2(θ2)fη2 (η(y)|θ2) dθ2
,
is either consistent or not[X et al., PNAS, 2012]
Which summary for model choice?
Depending on the choice of η(·), the Bayes factor based on thisinsufficient statistic,
Bη12(y) =
∫π1(θ1)f
η1 (η(y)|θ1) dθ1∫
π2(θ2)fη2 (η(y)|θ2) dθ2
,
is either consistent or not[X et al., PNAS, 2012]
●
●
●
●
●
●
●●
●
●
●
Gauss Laplace
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
n=100
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
Gauss Laplace
0.0
0.2
0.4
0.6
0.8
1.0
n=100
Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)
Theorem
If Pn belongs to one of the two models and if µ0 cannot beattained by the other one :
0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
then the Bayes factor Bη12 is consistent
[Marin et al., JRSS B, 2013]
Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)
●
M1 M2
0.3
0.4
0.5
0.6
0.7
●●
●
●
M1 M2
0.3
0.4
0.5
0.6
0.7
M1 M2
0.3
0.4
0.5
0.6
0.7
●
●
●
●
●
●●
●
M1 M2
0.0
0.2
0.4
0.6
0.8
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
M1 M2
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●
●
●
●
●
●
M1 M2
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●
●
●
●
●
M1 M2
0.0
0.2
0.4
0.6
0.8
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
M1 M2
0.0
0.2
0.4
0.6
0.8
1.0
●
●
●●
●
●●
M1 M2
0.0
0.2
0.4
0.6
0.8
1.0
[Marin et al., JRSS B, 2013]