talk at 2013 wsc, isi conference in hong kong, august 26, 2013

103
An [under]view of Monte Carlo methods, from importance sampling to MCMC, to ABC (& kudos to Bernoulli) Christian P. Robert Universit´ e Paris-Dauphine, University of Warwick, & CREST, Paris 2013 WSC, Hong Kong [email protected]

Upload: christian-robert

Post on 10-May-2015

494 views

Category:

Education


2 download

DESCRIPTION

Those are the slides for my conference talk at 2013 WSC, in the "Jacob Bernoulli's "Ars Conjectandi" and the emergence of probability" session organised by Adam Jakubowski

TRANSCRIPT

Page 1: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

An [under]view of Monte Carlo methods, fromimportance sampling to MCMC, to ABC

(& kudos to Bernoulli)

Christian P. RobertUniversite Paris-Dauphine, University of Warwick, & CREST, Paris

2013 WSC, Hong Kong

[email protected]

Page 2: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Outline

Bernoulli, Jakob (1654–1705)

MCMC connected steps

Metropolis-Hastings revisited

Approximate Bayesian computation(ABC)

Page 3: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Bernoulli as founding father of Monte Carlo methods

The weak law of large numbers (or Bernoulli’s [Golden] theorem)provides the justification for Monte Carlo approximations:

if x1, . . . , xn are i.i.d. rv’s with density f ,

limn→∞ h(x1) + . . . + h(xn)

n=

∫X

h(x)f (x) dx

Stigler’s Law of Eponimy: Cardano (1501–1576) first stated theresult

Page 4: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Bernoulli as founding father of Monte Carlo methods

...and indeed

h(x1) + . . . + h(xn)

n

converges to

I =

∫X

h(x)f (x) dx

Page 5: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Bernoulli as founding father of Monte Carlo methods

...and indeed

h(x1) + . . . + h(xn)

n

converges to

I =

∫X

h(x)f (x) dx

...meaning that provided we can simulate xi ∼ f (·) long and fast“enough”, the empirical mean will be a good “enough”approximation to I

Page 6: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Early implementations of the LLN

I While Jakob Bernoullihimself apparently did notengage in simulation,

I Buffon (1707–1788) resortedto a (not-yet-Monte-Carlo)experiment in 1735 toestimate the value of theSaint Petersburg game(even though he did notperform a similar experimentfor estimating π)

[Stigler, STS, 1991; Stigler, JRSS A, 2010]

Page 7: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Early implementations of the LLN

I While Jakob Bernoullihimself apparently did notengage in simulation,

I De Forest (1834–1888)found the median of alog-Cauchy distribution,using normal simulationsapproximated to the seconddigit (in 1876)

[Stigler, STS, 1991; Stigler, JRSS A, 2010]

Page 8: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Early implementations of the LLN

I While Jakob Bernoullihimself apparently did notengage in simulation,

I followed closely by theubuquitous Galton using“normal” dice in 1890, afterdevelopping the Quincunx,used both for checking theCLT and simulating from aposterior distribution asearly as 1877

[Stigler, STS, 1991; Stigler, JRSS A, 2010]

Page 9: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Importance Sampling

When focussing on integral approximation, very loose principle inthat proposal distribution with pdf q(·) leads to alternativerepresentation

I =

∫X

h(x){f /q}(x) q(x) dx

Principle of importance

Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by

IIS = n−1n∑

i=1

h(xi ){f /q}(xi ).

...provided q is positive on the right set

Page 10: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Importance Sampling

When focussing on integral approximation, very loose principle inthat proposal distribution with pdf q(·) leads to alternativerepresentation

I =

∫X

h(x){f /q}(x) q(x) dx

Principle of importance

Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by

IIS = n−1n∑

i=1

h(xi ){f /q}(xi ).

...provided q is positive on the right set

Page 11: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Importance Sampling

When focussing on integral approximation, very loose principle inthat proposal distribution with pdf q(·) leads to alternativerepresentation

I =

∫X

h(x){f /q}(x) q(x) dx

Principle of importance

Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by

IIS = n−1n∑

i=1

h(xi ){f /q}(xi ).

...provided q is positive on the right set

Page 12: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

things aren’t all rosy...

LLN not sufficient to justify MonteCarlo methods: if

n−1n∑

i=1

h(xi ){f /q}(xi )

has an infinite variance, the estimatorIIS is useless Importance sampling estimation of

P(2 6 Z 6 6) Z is Cauchy and

importance is normal, compared

with exact value, 0.095.

Page 13: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

The harmonic mean estimator

Bayesian posterior distribution defined as

π(θ|x) = π(θ)L(θ|x)/m(x)

When θi ∼ π(θ|x),

1

T

T∑t=1

1

L(θt |x)

is an unbiased estimator of 1/m(x)[Gelfand & Dey, 1994; Newton & Raftery, 1994]

Highly hazardous material: Most often leads to an infinitevariance!!!

Page 14: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

The harmonic mean estimator

Bayesian posterior distribution defined as

π(θ|x) = π(θ)L(θ|x)/m(x)

When θi ∼ π(θ|x),

1

T

T∑t=1

1

L(θt |x)

is an unbiased estimator of 1/m(x)[Gelfand & Dey, 1994; Newton & Raftery, 1994]

Highly hazardous material: Most often leads to an infinitevariance!!!

Page 15: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees that thisestimator is consistent ie, it will very likely be very close to the correctanswer if you use a sufficiently large number of points from the posteriordistribution.

The bad news is that the number of points required for this estimator to

get close to the right answer will often be greater than the number of

atoms in the observable universe. The even worse news is that it’s easy

for people to not realize this, and to naıvely accept estimates that are

nowhere close to the correct value of the marginal likelihood.”

[Radford Neal’s blog, Aug. 23, 2008]

Page 16: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: proposal ϕ(·) must have lighter (rather than fatter)tails than π(·)L(·) for the approximation

1

/1

T

T∑t=1

ϕ(θt)

πk(θt)L(θt)θt ∼ ϕ(·)

to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ

Page 17: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: proposal ϕ(·) must have lighter (rather than fatter)tails than π(·)L(·) for the approximation

1

/1

T

T∑t=1

ϕ(θt)

πk(θt)L(θt)θt ∼ ϕ(·)

to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ

Page 18: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

HPD indicator as ϕ

Use the convex hull of MCMC simulations (θt)t=1,...,T

corresponding to the 10% HPD region (easily derived!) and ϕ asindicator:

ϕ(θ) =10

T

∑t∈HPD

Id(θ,θt)6ε

[X & Wraith, 2009]

Page 19: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Bayesian computing (R)evolution

Bernoulli, Jakob (1654–1705)

MCMC connected steps

Metropolis-Hastings revisited

Approximate Bayesian computation(ABC)

Page 20: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

computational jam

In the 1970’s and early 1980’s, theoretical foundations of Bayesianstatistics were sound, but methodology was lagging for lack ofcomputing tools.

I restriction to conjugate priors

I limited complexity of models

I small sample sizes

The field was desperately in need of a new computing paradigm![X & Casella, STS, 2012]

Page 21: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

MCMC as in Markov Chain Monte Carlo

Notion that i.i.d. simulation is definitely not necessary, all thatmatters is the ergodic theoremRealization that Markov chains could be used in a wide variety ofsituations only came to mainstream statisticians with Gelfand andSmith (1990) despite earlier publications in the statistical literaturelike Hastings (1970) and growing awareness in spatial statistics(Besag, 1986)Reasons:

I lack of computing machinery

I lack of background on Markov chains

I lack of trust in the practicality of the method

Page 22: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

pre-Gibbs/pre-Hastings era

Early 1970’s, Hammersley, Clifford, and Besag were working on thespecification of joint distributions from conditional distributionsand on necessary and sufficient conditions for the conditionaldistributions to be compatible with a joint distribution.

[Hammersley and Clifford, 1971]

Page 23: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

pre-Gibbs/pre-Hastings era

Early 1970’s, Hammersley, Clifford, and Besag were working on thespecification of joint distributions from conditional distributionsand on necessary and sufficient conditions for the conditionaldistributions to be compatible with a joint distribution.

“What is the most general form of the conditionalprobability functions that define a coherent jointfunction? And what will the joint look like?”

[Besag, 1972]

Page 24: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)

Joint distribution of vector associated with a dependence graphmust be represented as product of functions over the cliques of thegraphs, i.e., of functions depending only on the componentsindexed by the labels in the clique.

[Cressie, 1993; Lauritzen, 1996]

Page 25: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)

A probability distribution P with positive and continuous density fsatisfies the pairwise Markov property with respect to anundirected graph G if and only if it factorizes according to G, i.e.,

(F ) ≡ (G )

[Cressie, 1993; Lauritzen, 1996]

Page 26: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)

Under the positivity condition, the joint distribution g satisfies

g(y1, . . . , yp) ∝p∏

j=1

g`j (y`j |y`1 , . . . , y`j−1, y ′`j+1

, . . . , y ′`p)

g`j (y′`j|y`1 , . . . , y`j−1

, y ′`j+1, . . . , y ′`p)

for every permutation ` on {1, 2, . . . , p} and every y ′ ∈ Y.

[Cressie, 1993; Lauritzen, 1996]

Page 27: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Clicking in

After Peskun (1973), MCMC mostly dormant in mainstreamstatistical world for about 10 years, then several papers/bookshighlighted its usefulness in specific settings:

I Geman and Geman (1984)

I Besag (1986)

I Strauss (1986)

I Ripley (Stochastic Simulation, 1987)

I Tanner and Wong (1987)

I Younes (1988)

Page 28: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

[Re-]Enters the Gibbs sampler

Geman and Geman (1984), building onMetropolis et al. (1953), Hastings (1970), andPeskun (1973), constructed a Gibbs samplerfor optimisation in a discrete image processingproblem with a Gibbs random field withoutcompletion.Back to Metropolis et al., 1953: the Gibbssampler is already in use therein and ergodicityis proven on the collection of global maxima

Page 29: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

[Re-]Enters the Gibbs sampler

Geman and Geman (1984), building onMetropolis et al. (1953), Hastings (1970), andPeskun (1973), constructed a Gibbs samplerfor optimisation in a discrete image processingproblem with a Gibbs random field withoutcompletion.Back to Metropolis et al., 1953: the Gibbssampler is already in use therein and ergodicityis proven on the collection of global maxima

Page 30: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Removing the jam

In early 1990s, researchers found that Gibbs and then Metropolis -Hastings algorithms would crack almost any problem!Flood of papers followed applying MCMC:

I linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;Wang & al., 1993, 1994)

I generalized linear mixed models (Albert & Chib, 1993)

I mixture models (Tanner & Wong, 1987; Diebolt & X., 1990, 1994;Escobar & West, 1993)

I changepoint analysis (Carlin & al., 1992)

I point processes (Grenander & Møller, 1994)

I &tc

Page 31: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Removing the jam

In early 1990s, researchers found that Gibbs and then Metropolis -Hastings algorithms would crack almost any problem!Flood of papers followed applying MCMC:

I genomics (Stephens & Smith, 1993; Lawrence & al., 1993;Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,2000)

I ecology (George & X, 1992)

I variable selection in regression (George & mcCulloch, 1993; Green,1995; Chen & al., 2000)

I spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993))

I longitudinal studies (Lange & al., 1992)

I &tc

Page 32: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

MCMC and beyond

I reversible jump MCMC which impacted considerably Bayesian modelchoice (Green, 1995)

I adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal,2009)

I exact approximations to targets (Tanner & Wong, 1987; Beaumont,2003; Andrieu & Roberts, 2009)

I comp’al stats catching up with comp’al physics: free energy sampling(e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead,2011)

I sequential Monte Carlo (SMC) for non-sequential problems (Chopin,2002; Neal, 2001; Del Moral et al 2006)

I retrospective sampling

I intractability: EP – GIMH – PMCMC – SMC2 – INLA

I QMC[MC] (Owen, 2011)

Page 33: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Particles

Iterating/sequential importance sampling is about as old as MonteCarlo methods themselves!

[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]

Found in the molecular simulation literature of the 50’s withself-avoiding random walks and signal processing

[Marshall, 1965; Handschin and Mayne, 1969]

Use of the term “particle” dates back to Kitagawa (1996), and Carpenter

et al. (1997) coined the term “particle filter”.

Page 34: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Particles

Iterating/sequential importance sampling is about as old as MonteCarlo methods themselves!

[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]

Found in the molecular simulation literature of the 50’s withself-avoiding random walks and signal processing

[Marshall, 1965; Handschin and Mayne, 1969]

Use of the term “particle” dates back to Kitagawa (1996), and Carpenter

et al. (1997) coined the term “particle filter”.

Page 35: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

pMC & pMCMC

I Recycling of past simulations legitimate to build betterimportance sampling functions as in population Monte Carlo

[Iba, 2000; Cappe et al, 2004; Del Moral et al., 2007]

I synthesis by Andrieu, Doucet, and Hollenstein (2010) usingparticles to build an evolving MCMC kernel pθ(y1:T ) in statespace models p(x1:T )p(y1:T |x1:T )

I importance sampling on discretely observed diffusions[Beskos et al., 2006; Fearnhead et al., 2008, 2010]

Page 36: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Metropolis-Hastings revisited

Bernoulli, Jakob (1654–1705)

MCMC connected steps

Metropolis-Hastings revisitedReinterpretation andRao-BlackwellisationRussian roulette

Approximate Bayesian computation(ABC)

Page 37: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Metropolis Hastings algorithm

1. We wish to approximate

I =

∫h(x)π(x)dx∫π(x)dx

=

∫h(x)π(x)dx

2. π(x) is known but not∫π(x)dx .

3. Approximate I with δ = 1n

∑nt=1 h(x(t)) where (x(t)) is a

Markov chain with limiting distribution π.

4. Convergence obtained from Law of Large Numbers or CLT forMarkov chains.

Page 38: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Metropolis Hasting Algorithm

Suppose that x(t) is drawn.

1. Simulate yt ∼ q(·|x(t)).

2. Set x(t+1) = yt with probability

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}Otherwise, set x(t+1) = x(t).

3. α is such that the detailed balance equation is satisfied: . πis the stationary distribution of (x(t)).

I The accepted candidates are simulated with the rejectionalgorithm.

Page 39: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Metropolis Hasting Algorithm

Suppose that x(t) is drawn.

1. Simulate yt ∼ q(·|x(t)).

2. Set x(t+1) = yt with probability

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}Otherwise, set x(t+1) = x(t).

3. α is such that the detailed balance equation is satisfied: . πis the stationary distribution of (x(t)).

I The accepted candidates are simulated with the rejectionalgorithm.

Page 40: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Metropolis Hasting Algorithm

Suppose that x(t) is drawn.

1. Simulate yt ∼ q(·|x(t)).

2. Set x(t+1) = yt with probability

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}Otherwise, set x(t+1) = x(t).

3. α is such that the detailed balance equation is satisfied:

π(x)q(y |x)α(x , y) = π(y)q(x |y)α(y , x).

. π is the stationary distribution of (x(t)).

I The accepted candidates are simulated with the rejectionalgorithm.

Page 41: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Metropolis Hasting Algorithm

Suppose that x(t) is drawn.

1. Simulate yt ∼ q(·|x(t)).

2. Set x(t+1) = yt with probability

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}Otherwise, set x(t+1) = x(t).

3. α is such that the detailed balance equation is satisfied:

π(x)q(y |x)α(x , y) = π(y)q(x |y)α(y , x).

. π is the stationary distribution of (x(t)).

I The accepted candidates are simulated with the rejectionalgorithm.

Page 42: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Some properties of the HM algorithm

Alternative representation of the estimator δ is

δ =1

n

n∑t=1

h(x(t)) =1

n

Mn∑i=1

nih(zi ) ,

where

I zi ’s are the accepted yj ’s,

I Mn is the number of accepted yj ’s till time n,

I ni is the number of times zi appears in the sequence (x(t))t .

Page 43: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

The ”accepted candidates”

q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )6

q(·|zi )p(zi )

,

where p(zi ) =∫α(zi , y) q(y |zi )dy . To simulate from q(·|zi ):

1. Propose a candidate y ∼ q(·|zi )2. Accept with probability

q(y |zi )

/(q(y |zi )

p(zi )

)= α(zi , y)

Otherwise, reject it and starts again.

I this is the transition of the HM algorithm.The transition kernelq enjoys π as a stationary distribution:

π(x)q(y |x) = π(y)q(x |y) ,

Page 44: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

The ”accepted candidates”

q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )6

q(·|zi )p(zi )

,

where p(zi ) =∫α(zi , y) q(y |zi )dy . To simulate from q(·|zi ):

1. Propose a candidate y ∼ q(·|zi )2. Accept with probability

q(y |zi )

/(q(y |zi )

p(zi )

)= α(zi , y)

Otherwise, reject it and starts again.

I this is the transition of the HM algorithm.The transition kernelq enjoys π as a stationary distribution:

π(x)q(y |x) = π(y)q(x |y) ,

Page 45: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

”accepted” Markov chain

Lemma (Douc & X., AoS, 2011)

The sequence (zi , ni ) satisfies

1. (zi , ni )i is a Markov chain;

2. zi+1 and ni are independent given zi ;

3. ni is distributed as a geometric random variable withprobability parameter

p(zi ) :=

∫α(zi , y) q(y |zi ) dy ; (1)

4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that

q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

Page 46: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

”accepted” Markov chain

Lemma (Douc & X., AoS, 2011)

The sequence (zi , ni ) satisfies

1. (zi , ni )i is a Markov chain;

2. zi+1 and ni are independent given zi ;

3. ni is distributed as a geometric random variable withprobability parameter

p(zi ) :=

∫α(zi , y) q(y |zi ) dy ; (1)

4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that

q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

Page 47: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

”accepted” Markov chain

Lemma (Douc & X., AoS, 2011)

The sequence (zi , ni ) satisfies

1. (zi , ni )i is a Markov chain;

2. zi+1 and ni are independent given zi ;

3. ni is distributed as a geometric random variable withprobability parameter

p(zi ) :=

∫α(zi , y) q(y |zi ) dy ; (1)

4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that

q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

Page 48: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

”accepted” Markov chain

Lemma (Douc & X., AoS, 2011)

The sequence (zi , ni ) satisfies

1. (zi , ni )i is a Markov chain;

2. zi+1 and ni are independent given zi ;

3. ni is distributed as a geometric random variable withprobability parameter

p(zi ) :=

∫α(zi , y) q(y |zi ) dy ; (1)

4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that

q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

Page 49: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Importance sampling perspective

1. A natural idea:

δ∗ =1

n

Mn∑i=1

h(zi )

p(zi ),

Page 50: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Importance sampling perspective

1. A natural idea:

δ∗ '

∑Mni=1

h(zi )

p(zi )∑Mni=1

1

p(zi )

=

∑Mni=1

π(zi )

π(zi )h(zi )∑Mn

i=1

π(zi )

π(zi )

.

Page 51: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Importance sampling perspective

1. A natural idea:

δ∗ '

∑Mni=1

h(zi )

p(zi )∑Mni=1

1

p(zi )

=

∑Mni=1

π(zi )

π(zi )h(zi )∑Mn

i=1

π(zi )

π(zi )

.

2. But p not available in closed form.

Page 52: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Importance sampling perspective

1. A natural idea:

δ∗ '

∑Mni=1

h(zi )

p(zi )∑Mni=1

1

p(zi )

=

∑Mni=1

π(zi )

π(zi )h(zi )∑Mn

i=1

π(zi )

π(zi )

.

2. But p not available in closed form.

3. The geometric ni is the replacement, an obvious solution thatis used in the original Metropolis–Hastings estimate sinceE[ni ] = 1/p(zi ).

Page 53: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

The Bernoulli factory

The crude estimate of 1/p(zi ),

ni = 1 +

∞∑j=1

∏`6j

I {u` > α(zi , y`)} ,

can be improved:

Lemma (Douc & X., AoS, 2011)

If (yj)j is an iid sequence with distribution q(y |zi ), the quantity

ξi = 1 +

∞∑j=1

∏`6j

{1 − α(zi , y`)}

is an unbiased estimator of 1/p(zi ) which variance, conditional onzi , is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).

Page 54: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Rao-Blackwellised, for sure?

ξi = 1 +

∞∑j=1

∏`6j

{1 − α(zi , y`)}

1. Infinite sum but finite with at least positive probability:

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}For example: take a symmetric random walk as a proposal.

2. What if we wish to be sure that the sum is finite?

Finite horizon k version:

ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

Page 55: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Rao-Blackwellised, for sure?

ξi = 1 +

∞∑j=1

∏`6j

{1 − α(zi , y`)}

1. Infinite sum but finite with at least positive probability:

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}For example: take a symmetric random walk as a proposal.

2. What if we wish to be sure that the sum is finite?

Finite horizon k version:

ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

Page 56: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

which Bernoulli factory?!

Not the spice warehouse of Leon Bernoulli!

Query:

Given an algorithm delivering iid B(p) rv’s, is it possible to derivean algorithm delivering iid B(p) rv’s when f is known and punknown?

[von Neumann, 1951; Keane & O’Brien, 1994]

I existence (e.g., impossible for f (p) = min(2p, 1))

I condition: for some n,

min{f (p), 1 − f (p)} > min{p, 1 − p}n

I implementation (polynomial vs. exponential time)

I use of sandwiching polynomials/power series

Page 57: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

which Bernoulli factory?!

Not the spice warehouse of Leon Bernoulli!

Query:

Given an algorithm delivering iid B(p) rv’s, is it possible to derivean algorithm delivering iid B(p) rv’s when f is known and punknown?

[von Neumann, 1951; Keane & O’Brien, 1994]

I existence (e.g., impossible for f (p) = min(2p, 1))

I condition: for some n,

min{f (p), 1 − f (p)} > min{p, 1 − p}n

I implementation (polynomial vs. exponential time)

I use of sandwiching polynomials/power series

Page 58: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Variance improvement

Theorem (Douc & X., AoS, 2011)

If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is aniid uniform sequence, for any k > 0, the quantity

ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms.

Page 59: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Variance improvement

Theorem (Douc & X., AoS, 2011)

If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is aniid uniform sequence, for any k > 0, the quantity

ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms. Moreover, for k > 1,

V[ξki

∣∣∣ zi] = 1− p(zi)

p2(zi)−

1− (1− 2p(zi) + r(zi))k

2p(zi) − r(zi)

(2− p(zi)

p2(zi)

)(p(zi) − r(zi)) ,

where p(zi) :=∫α(zi , y) q(y |zi) dy . and r(zi) :=

∫α2(zi , y) q(y |zi) dy .

Page 60: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Variance improvement

Theorem (Douc & X., AoS, 2011)

If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is aniid uniform sequence, for any k > 0, the quantity

ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms. Therefore, we have

V[ξi∣∣ zi] 6 V

[ξki∣∣ zi] 6 V

[ξ0i∣∣ zi] = V [ni | zi ] .

Page 61: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

B motivation for Russian roulette

I drior π(θ), data density p(y|θ) = f (y;θ)/Z(θ) with

Z(θ) =

∫f (x;θ)dx

intractable (e.g., Ising spin model, MRF, diffusion processes,networks, &tc)

I doubly-intractable posterior follows as

π(θ|y) = p(y|θ)× π(θ)× 1

Z(y)=

f (y;θ)

Z(θ)× π(θ)× 1

Z(y)

where Z(y) =∫

p(y|θ)π(θ)dθ

I both Z(θ) and Z(y) are intractable with massively differentconsequences

[thanks to Mark Girolami for his Russian slides!]

Page 62: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

B motivation for Russian roulette

I drior π(θ), data density p(y|θ) = f (y;θ)/Z(θ) with

Z(θ) =

∫f (x;θ)dx

intractable (e.g., Ising spin model, MRF, diffusion processes,networks, &tc)

I doubly-intractable posterior follows as

π(θ|y) = p(y|θ)× π(θ)× 1

Z(y)=

f (y;θ)

Z(θ)× π(θ)× 1

Z(y)

where Z(y) =∫

p(y|θ)π(θ)dθ

I both Z(θ) and Z(y) are intractable with massively differentconsequences

[thanks to Mark Girolami for his Russian slides!]

Page 63: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

B motivation for Russian roulette

I If Z(θ) is intractable, Metropolis–Hasting acceptanceprobability

α(θ ′,θ) = min

{1,

f (y;θ ′)π(θ ′)

f (y;θ)π(θ)× q(θ|θ ′)

q(θ ′|θ)× Z(θ)

Z(θ ′)

}is not available

I Use instead biased approximations e.g. pseudo-likelihoods,plugin Z(θ ′) estimates without sacrificing exactness of MCMC

Page 64: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

B motivation for Russian roulette

I If Z(θ) is intractable, Metropolis–Hasting acceptanceprobability

α(θ ′,θ) = min

{1,

f (y;θ ′)π(θ ′)

f (y;θ)π(θ)× q(θ|θ ′)

q(θ ′|θ)× Z(θ)

Z(θ ′)

}is not available

I Use instead biased approximations e.g. pseudo-likelihoods,plugin Z(θ ′) estimates without sacrificing exactness of MCMC

Page 65: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Existing solution

I Unbiased plugin estimate

Z(θ)

Z(θ ′)≈ f (x;θ)

f (x;θ ′)where x ∼

f (x;θ ′)

Z(θ ′)

[Møller et al, Bka, 2006; Murray et al 2006]

I auxiliary variable method

I removes Z(θ)/Z(θ ′) from the picture

I require simulations from the model (e.g., via perfect sampling)

Page 66: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Exact approximate methods

Pseudo-Marginal construction that allows for the use of unbiased,positive estimates of target in acceptance probability

α(θ ′,θ) = min

{1,π(θ ′|y)

π(θ|y)× q(θ|θ ′)

q(θ ′|θ)

}[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]

Transition kernel has invariant distribution with exact targetdensity π(θ|y)

Page 67: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Exact approximate methods

Pseudo-Marginal construction that allows for the use of unbiased,positive estimates of target in acceptance probability

α(θ ′,θ) = min

{1,π(θ ′|y)

π(θ|y)× q(θ|θ ′)

q(θ ′|θ)

}[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]

Transition kernel has invariant distribution with exact targetdensity π(θ|y)

Page 68: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Infinite series estimator

I For each (θ, y), construct rv’s {V(j)θ , j > 0} such that

π(θ, {V(j)θ }|y) :=

∞∑j=0

V(j)θ

is a.s. finite with finite expectation

E[π(θ, {V

(j)θ } |y)

]= π(θ|y)

I

Page 69: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Infinite series estimator

I For each (θ, y), construct rv’s {V(j)θ , j > 0} such that

π(θ, {V(j)θ }|y) :=

∞∑j=0

V(j)θ

is a.s. finite with finite expectation

E[π(θ, {V

(j)θ } |y)

]= π(θ|y)

I Introduce a random stopping time τθ, such that with

ξ := (τθ, {V(j)θ , 0 6 j 6 τθ}) the estimate

π(θ,ξ|y) :=

τθ∑j=0

V(j)θ

satisfies

E(π(θ,ξ|y)|{V

(j)θ , j > 0}

)= π(θ, {V

(j)θ }|y)

Page 70: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Infinite series estimator

I For each (θ, y), construct rv’s {V(j)θ , j > 0} such that

π(θ, {V(j)θ }|y) :=

∞∑j=0

V(j)θ

is a.s. finite with finite expectation

E[π(θ, {V

(j)θ } |y)

]= π(θ|y)

I Warning: unbiased estimate π(θ,ξ|y) using seriesconstruction no general guarantee of positivity

Page 71: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Russian roulette

Method that requires unbiased truncation of a series

S(θ) =∑∞

i=0φi (θ)

Russian roulette employed extensively in simulation of neutronscattering and computer graphics

I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}

I Find the first time k > 1 such that Uk > qk

I Russian roulette estimate of S(θ) is

S(θ) =∑k

j=0φj(θ)

/∏j−1

i=1qi ,

[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]

Page 72: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Russian roulette

Method that requires unbiased truncation of a series

S(θ) =∑∞

i=0φi (θ)

Russian roulette employed extensively in simulation of neutronscattering and computer graphics

I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}

I Find the first time k > 1 such that Uk > qk

I Russian roulette estimate of S(θ) is

S(θ) =∑k

j=0φj(θ)

/∏j−1

i=1qi ,

[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]

Page 73: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Russian roulette

Method that requires unbiased truncation of a series

S(θ) =∑∞

i=0φi (θ)

Russian roulette employed extensively in simulation of neutronscattering and computer graphics

I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}

I Find the first time k > 1 such that Uk > qk

I Russian roulette estimate of S(θ) is

S(θ) =∑k

j=0φj(θ)

/∏j−1

i=1qi ,

I If limn→∞∏nj=1 qj = 0, Russian roulette terminates with

probability one

[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]

Page 74: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Russian roulette

Method that requires unbiased truncation of a series

S(θ) =∑∞

i=0φi (θ)

Russian roulette employed extensively in simulation of neutronscattering and computer graphics

I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}

I Find the first time k > 1 such that Uk > qk

I Russian roulette estimate of S(θ) is

S(θ) =∑k

j=0φj(θ)

/∏j−1

i=1qi ,

I E{S(θ)} = S(θ)

I variance finite under certain known conditions

[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]

Page 75: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

towards ever more complexity

Bernoulli, Jakob (1654–1705)

MCMC connected steps

Metropolis-Hastings revisited

Approximate Bayesian computation(ABC)

Page 76: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

New challenges

Novel statisticial issues that forces a different Bayesian answer:

I very large datasets

I complex or unknown dependence structures with maybe p � n

I multiple and involved random effects

I missing data structures containing most of the information

I sequential structures involving most of the above

Page 77: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

New paradigm?

“Surprisingly, the confident prediction of the previousgeneration that Bayesian methods would ultimately supplantfrequentist methods has given way to a realization that Markovchain Monte Carlo (MCMC) may be too slow to handlemodern data sets. Size matters because large data sets stresscomputer storage and processing power to the breaking point.The most successful compromises between Bayesian andfrequentist methods now rely on penalization andoptimization.”

[Lange at al., ISR, 2013]

Page 78: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

New paradigm?

I sad reality constraint thatsize does matter

I focus on much smallerdimensions and on sparsesummaries

I many (fast if non-Bayesian)ways of producing thosesummaries

I Bayesian inference can kickin almost automatically atthis stage

Page 79: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

Page 80: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

Page 81: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

Page 82: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

Page 83: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

Page 84: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

Page 85: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

Page 86: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC algorithm

In most implementations, degree of approximation:

Algorithm 1 Likelihood-free rejection sampler

for i = 1 to N dorepeat

generate θ ′ from the prior distribution π(·)generate z from the likelihood f (·|θ ′)

until ρ{η(z),η(y)} 6 εset θi = θ

end for

where η(y) defines a (not necessarily sufficient) statistic

Page 87: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Comments

I role of distance paramount(because ε 6= 0)

I scaling of components of η(y) alsocapital

I ε matters little if “small enough”

I representative of “curse ofdimensionality”

I small is beautiful!, i.e. data as awhole may be weakly informativefor ABC

I non-parametric method at core

Page 88: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 89: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 90: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 91: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

Page 92: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC as an inference machine

Starting point is summary statisticη(y), either chosen for computationalrealism or imposed by externalconstraints

I ABC can produce a distribution on the parameter of interestconditional on this summary statistic η(y)

I inference based on ABC may be consistent or not, so it needsto be validated on its own

I the choice of the tolerance level ε is dictated by bothcomputational and convergence constraints

Page 93: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

ABC as an inference machine

Starting point is summary statisticη(y), either chosen for computationalrealism or imposed by externalconstraints

I ABC can produce a distribution on the parameter of interestconditional on this summary statistic η(y)

I inference based on ABC may be consistent or not, so it needsto be validated on its own

I the choice of the tolerance level ε is dictated by bothcomputational and convergence constraints

Page 94: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

How Bayesian aBc is..?

At best, ABC approximates π(θ|η(y)):

I approximation error unknown (w/o massive simulation)

I pragmatic or empirical Bayes (there is no other solution!)

I many calibration issues (tolerance, distance, statistics)

I the NP side should be incorporated into the whole Bayesianpicture

I the approximation error should also be part of the Bayesianinference

Page 95: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Noisy ABC

ABC approximation error (under non-zero tolerance ε) replacedwith exact simulation from a controlled approximation to thetarget, convolution of true posterior with kernel function

πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ

,

with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]

Theorem

The ABC algorithm based on a randomised observation y = y+ ξ,ξ ∼ Kε, and an acceptance probability of

Kε(y− z)/M

gives draws from the posterior distribution π(θ|y).

Page 96: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Noisy ABC

ABC approximation error (under non-zero tolerance ε) replacedwith exact simulation from a controlled approximation to thetarget, convolution of true posterior with kernel function

πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ

,

with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]

Theorem

The ABC algorithm based on a randomised observation y = y+ ξ,ξ ∼ Kε, and an acceptance probability of

Kε(y− z)/M

gives draws from the posterior distribution π(θ|y).

Page 97: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]

Page 98: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]

I Loss of statistical information balanced against gain in dataroughening

I Approximation error and information loss remain unknown

I Choice of statistics induces choice of distance functiontowards standardisation

I borrowing tools from data analysis (LDA) machine learning

[Estoup et al., ME, 2012]

Page 99: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]

I may be imposed for external/practical reasons

I may gather several non-B point estimates

I we can learn about efficient combination

I distance can be provided by estimation techniques

Page 100: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Which summary for model choice?

‘This is also why focus on model discrimination typically (...)proceeds by (...) accepting that the Bayes Factor that one obtainsis only derived from the summary statistics and may in no waycorrespond to that of the full model.’

[S. Sisson, Jan. 31, 2011, xianblog]

Depending on the choice of η(·), the Bayes factor based on thisinsufficient statistic,

Bη12(y) =

∫π1(θ1)f

η1 (η(y)|θ1) dθ1∫

π2(θ2)fη2 (η(y)|θ2) dθ2

,

is either consistent or not[X et al., PNAS, 2012]

Page 101: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Which summary for model choice?

Depending on the choice of η(·), the Bayes factor based on thisinsufficient statistic,

Bη12(y) =

∫π1(θ1)f

η1 (η(y)|θ1) dθ1∫

π2(θ2)fη2 (η(y)|θ2) dθ2

,

is either consistent or not[X et al., PNAS, 2012]

●●

Gauss Laplace

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

n=100

●●

●●

Gauss Laplace

0.0

0.2

0.4

0.6

0.8

1.0

n=100

Page 102: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Selecting proper summaries

Consistency only depends on the range of

µi (θ) = Ei [η(y)]

under both models against the asymptotic mean µ0 of η(y)

Theorem

If Pn belongs to one of the two models and if µ0 cannot beattained by the other one :

0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)

< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,

then the Bayes factor Bη12 is consistent

[Marin et al., JRSS B, 2013]

Page 103: Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Selecting proper summaries

Consistency only depends on the range of

µi (θ) = Ei [η(y)]

under both models against the asymptotic mean µ0 of η(y)

M1 M2

0.3

0.4

0.5

0.6

0.7

●●

M1 M2

0.3

0.4

0.5

0.6

0.7

M1 M2

0.3

0.4

0.5

0.6

0.7

●●

M1 M2

0.0

0.2

0.4

0.6

0.8

●●

●●

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

M1 M2

0.0

0.2

0.4

0.6

0.8

●●●

●●●

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

●●

●●

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

[Marin et al., JRSS B, 2013]