talk at 2013 wsc, isi conference in hong kong, august 26, 2013

An [under]view of Monte Carlo methods, fromimportance sampling to MCMC, to ABC

(& kudos to Bernoulli)

Christian P. RobertUniversite Paris-Dauphine, University of Warwick, & CREST, Paris

2013 WSC, Hong Kong

[email protected]

Outline

Bernoulli, Jakob (1654–1705)

MCMC connected steps

Metropolis-Hastings revisited

Approximate Bayesian computation(ABC)

Bernoulli as founding father of Monte Carlo methods

The weak law of large numbers (or Bernoulli’s [Golden] theorem)provides the justification for Monte Carlo approximations:

if x1, . . . , xn are i.i.d. rv’s with density f ,

limn→∞ h(x1) + . . . + h(xn)

n=

∫X

h(x)f (x) dx

Stigler’s Law of Eponimy: Cardano (1501–1576) first stated theresult


...and indeed

h(x1) + . . . + h(xn)

n

converges to

I =

∫X

h(x)f (x) dx


...and indeed

h(x1) + . . . + h(xn)

n

converges to

I =

∫X

h(x)f (x) dx

...meaning that provided we can simulate xi ∼ f (·) long and fast“enough”, the empirical mean will be a good “enough”approximation to I

Early implementations of the LLN

I While Jakob Bernoullihimself apparently did notengage in simulation,

I Buffon (1707–1788) resortedto a (not-yet-Monte-Carlo)experiment in 1735 toestimate the value of theSaint Petersburg game(even though he did notperform a similar experimentfor estimating π)

[Stigler, STS, 1991; Stigler, JRSS A, 2010]



I De Forest (1834–1888)found the median of alog-Cauchy distribution,using normal simulationsapproximated to the seconddigit (in 1876)




I followed closely by theubuquitous Galton using“normal” dice in 1890, afterdevelopping the Quincunx,used both for checking theCLT and simulating from aposterior distribution asearly as 1877


Importance Sampling

When focussing on integral approximation, very loose principle inthat proposal distribution with pdf q(·) leads to alternativerepresentation

I =

∫X

h(x){f /q}(x) q(x) dx

Principle of importance

Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by

IIS = n−1n∑

i=1

h(xi ){f /q}(xi ).

...provided q is positive on the right set

things aren’t all rosy...

LLN not sufficient to justify MonteCarlo methods: if

n−1n∑

i=1

h(xi ){f /q}(xi )

has an infinite variance, the estimatorIIS is useless Importance sampling estimation of

P(2 6 Z 6 6) Z is Cauchy and

importance is normal, compared

with exact value, 0.095.

The harmonic mean estimator

Bayesian posterior distribution defined as

π(θ|x) = π(θ)L(θ|x)/m(x)

When θi ∼ π(θ|x),

1

T

T∑t=1

1

L(θt |x)

is an unbiased estimator of 1/m(x)[Gelfand & Dey, 1994; Newton & Raftery, 1994]

Highly hazardous material: Most often leads to an infinitevariance!!!

“The Worst Monte Carlo Method Ever”

“The good news is that the Law of Large Numbers guarantees that thisestimator is consistent ie, it will very likely be very close to the correctanswer if you use a sufficiently large number of points from the posteriordistribution.

The bad news is that the number of points required for this estimator to

get close to the right answer will often be greater than the number of

atoms in the observable universe. The even worse news is that it’s easy

for people to not realize this, and to naıvely accept estimates that are

nowhere close to the correct value of the marginal likelihood.”

[Radford Neal’s blog, Aug. 23, 2008]

Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: proposal ϕ(·) must have lighter (rather than fatter)tails than π(·)L(·) for the approximation

1

/1

T

T∑t=1

ϕ(θt)

πk(θt)L(θt)θt ∼ ϕ(·)

to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ

HPD indicator as ϕ

Use the convex hull of MCMC simulations (θt)t=1,...,T

corresponding to the 10% HPD region (easily derived!) and ϕ asindicator:

ϕ(θ) =10

T

∑t∈HPD

Id(θ,θt)6ε

[X & Wraith, 2009]

Bayesian computing (R)evolution





computational jam

In the 1970’s and early 1980’s, theoretical foundations of Bayesianstatistics were sound, but methodology was lagging for lack ofcomputing tools.

I restriction to conjugate priors

I limited complexity of models

I small sample sizes

The field was desperately in need of a new computing paradigm![X & Casella, STS, 2012]

MCMC as in Markov Chain Monte Carlo

Notion that i.i.d. simulation is definitely not necessary, all thatmatters is the ergodic theoremRealization that Markov chains could be used in a wide variety ofsituations only came to mainstream statisticians with Gelfand andSmith (1990) despite earlier publications in the statistical literaturelike Hastings (1970) and growing awareness in spatial statistics(Besag, 1986)Reasons:

I lack of computing machinery

I lack of background on Markov chains

I lack of trust in the practicality of the method

pre-Gibbs/pre-Hastings era

Early 1970’s, Hammersley, Clifford, and Besag were working on thespecification of joint distributions from conditional distributionsand on necessary and sufficient conditions for the conditionaldistributions to be compatible with a joint distribution.

[Hammersley and Clifford, 1971]

pre-Gibbs/pre-Hastings era

Early 1970’s, Hammersley, Clifford, and Besag were working on thespecification of joint distributions from conditional distributionsand on necessary and sufficient conditions for the conditionaldistributions to be compatible with a joint distribution.

“What is the most general form of the conditionalprobability functions that define a coherent jointfunction? And what will the joint look like?”

[Besag, 1972]

Hammersley-Clifford[-Besag] theorem

Theorem (Hammersley-Clifford)

Joint distribution of vector associated with a dependence graphmust be represented as product of functions over the cliques of thegraphs, i.e., of functions depending only on the componentsindexed by the labels in the clique.

[Cressie, 1993; Lauritzen, 1996]



A probability distribution P with positive and continuous density fsatisfies the pairwise Markov property with respect to anundirected graph G if and only if it factorizes according to G, i.e.,

(F ) ≡ (G )




Under the positivity condition, the joint distribution g satisfies

g(y1, . . . , yp) ∝p∏

j=1

g`j (y`j |y`1 , . . . , y`j−1, y ′`j+1

, . . . , y ′`p)

g`j (y′`j|y`1 , . . . , y`j−1

, y ′`j+1, . . . , y ′`p)

for every permutation ` on {1, 2, . . . , p} and every y ′ ∈ Y.


Clicking in

After Peskun (1973), MCMC mostly dormant in mainstreamstatistical world for about 10 years, then several papers/bookshighlighted its usefulness in specific settings:

I Geman and Geman (1984)

I Besag (1986)

I Strauss (1986)

I Ripley (Stochastic Simulation, 1987)

I Tanner and Wong (1987)

I Younes (1988)

[Re-]Enters the Gibbs sampler

Geman and Geman (1984), building onMetropolis et al. (1953), Hastings (1970), andPeskun (1973), constructed a Gibbs samplerfor optimisation in a discrete image processingproblem with a Gibbs random field withoutcompletion.Back to Metropolis et al., 1953: the Gibbssampler is already in use therein and ergodicityis proven on the collection of global maxima

Removing the jam

In early 1990s, researchers found that Gibbs and then Metropolis -Hastings algorithms would crack almost any problem!Flood of papers followed applying MCMC:

I linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;Wang & al., 1993, 1994)

I generalized linear mixed models (Albert & Chib, 1993)

I mixture models (Tanner & Wong, 1987; Diebolt & X., 1990, 1994;Escobar & West, 1993)

I changepoint analysis (Carlin & al., 1992)

I point processes (Grenander & Møller, 1994)

I &tc

Removing the jam

In early 1990s, researchers found that Gibbs and then Metropolis -Hastings algorithms would crack almost any problem!Flood of papers followed applying MCMC:

I genomics (Stephens & Smith, 1993; Lawrence & al., 1993;Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,2000)

I ecology (George & X, 1992)

I variable selection in regression (George & mcCulloch, 1993; Green,1995; Chen & al., 2000)

I spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993))

I longitudinal studies (Lange & al., 1992)

I &tc

MCMC and beyond

I reversible jump MCMC which impacted considerably Bayesian modelchoice (Green, 1995)

I adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal,2009)

I exact approximations to targets (Tanner & Wong, 1987; Beaumont,2003; Andrieu & Roberts, 2009)

I comp’al stats catching up with comp’al physics: free energy sampling(e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead,2011)

I sequential Monte Carlo (SMC) for non-sequential problems (Chopin,2002; Neal, 2001; Del Moral et al 2006)

I retrospective sampling

I intractability: EP – GIMH – PMCMC – SMC2 – INLA

I QMC[MC] (Owen, 2011)

Particles

Iterating/sequential importance sampling is about as old as MonteCarlo methods themselves!

[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]

Found in the molecular simulation literature of the 50’s withself-avoiding random walks and signal processing

[Marshall, 1965; Handschin and Mayne, 1969]

Use of the term “particle” dates back to Kitagawa (1996), and Carpenter

et al. (1997) coined the term “particle filter”.

pMC & pMCMC

I Recycling of past simulations legitimate to build betterimportance sampling functions as in population Monte Carlo

[Iba, 2000; Cappe et al, 2004; Del Moral et al., 2007]

I synthesis by Andrieu, Doucet, and Hollenstein (2010) usingparticles to build an evolving MCMC kernel pθ(y1:T ) in statespace models p(x1:T )p(y1:T |x1:T )

I importance sampling on discretely observed diffusions[Beskos et al., 2006; Fearnhead et al., 2008, 2010]




Metropolis-Hastings revisitedReinterpretation andRao-BlackwellisationRussian roulette


Metropolis Hastings algorithm

1. We wish to approximate

I =

∫h(x)π(x)dx∫π(x)dx

=

∫h(x)π(x)dx

2. π(x) is known but not∫π(x)dx .

3. Approximate I with δ = 1n

∑nt=1 h(x(t)) where (x(t)) is a

Markov chain with limiting distribution π.

4. Convergence obtained from Law of Large Numbers or CLT forMarkov chains.

Metropolis Hasting Algorithm

Suppose that x(t) is drawn.

1. Simulate yt ∼ q(·|x(t)).

2. Set x(t+1) = yt with probability

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}Otherwise, set x(t+1) = x(t).

3. α is such that the detailed balance equation is satisfied: . πis the stationary distribution of (x(t)).

I The accepted candidates are simulated with the rejectionalgorithm.

Metropolis Hasting Algorithm

Suppose that x(t) is drawn.

1. Simulate yt ∼ q(·|x(t)).

2. Set x(t+1) = yt with probability

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}Otherwise, set x(t+1) = x(t).

3. α is such that the detailed balance equation is satisfied:

π(x)q(y |x)α(x , y) = π(y)q(x |y)α(y , x).

. π is the stationary distribution of (x(t)).

I The accepted candidates are simulated with the rejectionalgorithm.

Some properties of the HM algorithm

Alternative representation of the estimator δ is

δ =1

n

n∑t=1

h(x(t)) =1

n

Mn∑i=1

nih(zi ) ,

where

I zi ’s are the accepted yj ’s,

I Mn is the number of accepted yj ’s till time n,

I ni is the number of times zi appears in the sequence (x(t))t .

”accepted” Markov chain

Lemma (Douc & X., AoS, 2011)

The sequence (zi , ni ) satisfies

1. (zi , ni )i is a Markov chain;

2. zi+1 and ni are independent given zi ;

3. ni is distributed as a geometric random variable withprobability parameter

p(zi ) :=

∫α(zi , y) q(y |zi ) dy ; (1)

4. (zi )i is a Markov chain with transition kernelQ(z, dy) = q(y |z)dy and stationary distribution π such that

q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

Importance sampling perspective

1. A natural idea:

δ∗ =1

n

Mn∑i=1

h(zi )

p(zi ),


1. A natural idea:

δ∗ '

∑Mni=1

h(zi )

p(zi )∑Mni=1

1

p(zi )

=

∑Mni=1

π(zi )

π(zi )h(zi )∑Mn

i=1

π(zi )

π(zi )

.


1. A natural idea:

δ∗ '

∑Mni=1

h(zi )

p(zi )∑Mni=1

1

p(zi )

=

∑Mni=1

π(zi )

π(zi )h(zi )∑Mn

i=1

π(zi )

π(zi )

.

2. But p not available in closed form.


1. A natural idea:

δ∗ '

∑Mni=1

h(zi )

p(zi )∑Mni=1

1

p(zi )

=

∑Mni=1

π(zi )

π(zi )h(zi )∑Mn

i=1

π(zi )

π(zi )

.

2. But p not available in closed form.

3. The geometric ni is the replacement, an obvious solution thatis used in the original Metropolis–Hastings estimate sinceE[ni ] = 1/p(zi ).

The Bernoulli factory

The crude estimate of 1/p(zi ),

ni = 1 +

∞∑j=1

∏`6j

I {u` > α(zi , y`)} ,

can be improved:

Lemma (Douc & X., AoS, 2011)

If (yj)j is an iid sequence with distribution q(y |zi ), the quantity

ξi = 1 +

∞∑j=1

∏`6j

{1 − α(zi , y`)}

is an unbiased estimator of 1/p(zi ) which variance, conditional onzi , is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).

Rao-Blackwellised, for sure?

ξi = 1 +

∞∑j=1

∏`6j

{1 − α(zi , y`)}

1. Infinite sum but finite with at least positive probability:

α(x(t), yt) = min

{1,π(yt)

π(x(t))

q(x(t)|yt)

q(yt |x(t))

}For example: take a symmetric random walk as a proposal.

2. What if we wish to be sure that the sum is finite?

Finite horizon k version:

ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

which Bernoulli factory?!

Not the spice warehouse of Leon Bernoulli!

Query:

Given an algorithm delivering iid B(p) rv’s, is it possible to derivean algorithm delivering iid B(p) rv’s when f is known and punknown?

[von Neumann, 1951; Keane & O’Brien, 1994]

I existence (e.g., impossible for f (p) = min(2p, 1))

I condition: for some n,

min{f (p), 1 − f (p)} > min{p, 1 − p}n

I implementation (polynomial vs. exponential time)

I use of sandwiching polynomials/power series

Variance improvement

Theorem (Douc & X., AoS, 2011)

If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is aniid uniform sequence, for any k > 0, the quantity

ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms.




ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms. Moreover, for k > 1,

V[ξki

∣∣∣ zi] = 1− p(zi)

p2(zi)−

1− (1− 2p(zi) + r(zi))k

2p(zi) − r(zi)

(2− p(zi)

p2(zi)

)(p(zi) − r(zi)) ,

where p(zi) :=∫α(zi , y) q(y |zi) dy . and r(zi) :=

∫α2(zi , y) q(y |zi) dy .




ξki = 1 +

∞∑j=1

∏16`6k∧j

{1 − α(zi , yj)

} ∏k+16`6j

I {u` > α(zi , y`)}

is an unbiased estimator of 1/p(zi ) with an almost sure finitenumber of terms. Therefore, we have

V[ξi∣∣ zi] 6 V

[ξki∣∣ zi] 6 V

[ξ0i∣∣ zi] = V [ni | zi ] .

B motivation for Russian roulette

I drior π(θ), data density p(y|θ) = f (y;θ)/Z(θ) with

Z(θ) =

∫f (x;θ)dx

intractable (e.g., Ising spin model, MRF, diffusion processes,networks, &tc)

I doubly-intractable posterior follows as

π(θ|y) = p(y|θ)× π(θ)× 1

Z(y)=

f (y;θ)

Z(θ)× π(θ)× 1

Z(y)

where Z(y) =∫

p(y|θ)π(θ)dθ

I both Z(θ) and Z(y) are intractable with massively differentconsequences

[thanks to Mark Girolami for his Russian slides!]

B motivation for Russian roulette

I If Z(θ) is intractable, Metropolis–Hasting acceptanceprobability

α(θ ′,θ) = min

{1,

f (y;θ ′)π(θ ′)

f (y;θ)π(θ)× q(θ|θ ′)

q(θ ′|θ)× Z(θ)

Z(θ ′)

}is not available

I Use instead biased approximations e.g. pseudo-likelihoods,plugin Z(θ ′) estimates without sacrificing exactness of MCMC

Existing solution

I Unbiased plugin estimate

Z(θ)

Z(θ ′)≈ f (x;θ)

f (x;θ ′)where x ∼

f (x;θ ′)

Z(θ ′)

[Møller et al, Bka, 2006; Murray et al 2006]

I auxiliary variable method

I removes Z(θ)/Z(θ ′) from the picture

I require simulations from the model (e.g., via perfect sampling)

Exact approximate methods

Pseudo-Marginal construction that allows for the use of unbiased,positive estimates of target in acceptance probability

α(θ ′,θ) = min

{1,π(θ ′|y)

π(θ|y)× q(θ|θ ′)

q(θ ′|θ)

}[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]

Transition kernel has invariant distribution with exact targetdensity π(θ|y)

Infinite series estimator

I For each (θ, y), construct rv’s {V(j)θ , j > 0} such that

π(θ, {V(j)θ }|y) :=

∞∑j=0

V(j)θ

is a.s. finite with finite expectation

E[π(θ, {V

(j)θ } |y)

]= π(θ|y)

I



π(θ, {V(j)θ }|y) :=

∞∑j=0

V(j)θ


E[π(θ, {V

(j)θ } |y)

]= π(θ|y)

I Introduce a random stopping time τθ, such that with

ξ := (τθ, {V(j)θ , 0 6 j 6 τθ}) the estimate

π(θ,ξ|y) :=

τθ∑j=0

V(j)θ

satisfies

E(π(θ,ξ|y)|{V

(j)θ , j > 0}

)= π(θ, {V

(j)θ }|y)



π(θ, {V(j)θ }|y) :=

∞∑j=0

V(j)θ


E[π(θ, {V

(j)θ } |y)

]= π(θ|y)

I Warning: unbiased estimate π(θ,ξ|y) using seriesconstruction no general guarantee of positivity

Russian roulette

Method that requires unbiased truncation of a series

S(θ) =∑∞

i=0φi (θ)

Russian roulette employed extensively in simulation of neutronscattering and computer graphics

I Assign probabilities {qj , j > 1} qj ∈ (0, 1] and generateU(0, 1) i.i.d. r.v’s {Uj , j > 1}

I Find the first time k > 1 such that Uk > qk

I Russian roulette estimate of S(θ) is

S(θ) =∑k

j=0φj(θ)

/∏j−1

i=1qi ,

[Girolami, Lyne, Strathman, Simpson, & Atchade, arXiv:1306.4032]

Russian roulette


S(θ) =∑∞

i=0φi (θ)





S(θ) =∑k

j=0φj(θ)

/∏j−1

i=1qi ,

I If limn→∞∏nj=1 qj = 0, Russian roulette terminates with

probability one


Russian roulette


S(θ) =∑∞

i=0φi (θ)





S(θ) =∑k

j=0φj(θ)

/∏j−1

i=1qi ,

I E{S(θ)} = S(θ)

I variance finite under certain known conditions


towards ever more complexity





New challenges

Novel statisticial issues that forces a different Bayesian answer:

I very large datasets

I complex or unknown dependence structures with maybe p � n

I multiple and involved random effects

I missing data structures containing most of the information

I sequential structures involving most of the above

New paradigm?

“Surprisingly, the confident prediction of the previousgeneration that Bayesian methods would ultimately supplantfrequentist methods has given way to a realization that Markovchain Monte Carlo (MCMC) may be too slow to handlemodern data sets. Size matters because large data sets stresscomputer storage and processing power to the breaking point.The most successful compromises between Bayesian andfrequentist methods now rely on penalization andoptimization.”

[Lange at al., ISR, 2013]

New paradigm?

I sad reality constraint thatsize does matter

I focus on much smallerdimensions and on sparsesummaries

I many (fast if non-Bayesian)ways of producing thosesummaries

I Bayesian inference can kickin almost automatically atthis stage

Approximate Bayesian computation (ABC)

Case of a well-defined statistical model where the likelihoodfunction

`(θ|y) = f (y1, . . . , yn|θ)

is out of reach!

Empirical approximations to the originalBayesian inference problem

I Degrading the data precision downto a tolerance ε

I Replacing the likelihood with anon-parametric approximation

I Summarising/replacing the datawith insufficient statistics

ABC methodology

Bayesian setting: target is π(θ)f (x |θ)When likelihood f (x |θ) not in closed form, likelihood-free rejectiontechnique:

Foundation

For an observation y ∼ f (y|θ), under the prior π(θ), if one keepsjointly simulating

θ′ ∼ π(θ) , z ∼ f (z|θ′) ,

until the auxiliary variable z is equal to the observed value, z = y,then the selected

θ′ ∼ π(θ|y)

[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]

ABC algorithm

In most implementations, degree of approximation:

Algorithm 1 Likelihood-free rejection sampler

for i = 1 to N dorepeat

generate θ ′ from the prior distribution π(·)generate z from the likelihood f (·|θ ′)

until ρ{η(z),η(y)} 6 εset θi = θ

′

end for

where η(y) defines a (not necessarily sufficient) statistic

Comments

I role of distance paramount(because ε 6= 0)

I scaling of components of η(y) alsocapital

I ε matters little if “small enough”

I representative of “curse ofdimensionality”

I small is beautiful!, i.e. data as awhole may be weakly informativefor ABC

I non-parametric method at core

ABC simulation advances

Simulating from the prior is often poor in efficiencyEither modify the proposal distribution on θ to increase the densityof x ’s within the vicinity of y ...

[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]

...or by viewing the problem as a conditional density estimationand by developing techniques to allow for larger ε[Beaumont et al., 2002; Blum & Francois, 2010; Biau et al., 2013]

.....or even by including ε in the inferential framework [ABCµ][Ratmann et al., 2009]

ABC as an inference machine

Starting point is summary statisticη(y), either chosen for computationalrealism or imposed by externalconstraints

I ABC can produce a distribution on the parameter of interestconditional on this summary statistic η(y)

I inference based on ABC may be consistent or not, so it needsto be validated on its own

I the choice of the tolerance level ε is dictated by bothcomputational and convergence constraints

How Bayesian aBc is..?

At best, ABC approximates π(θ|η(y)):

I approximation error unknown (w/o massive simulation)

I pragmatic or empirical Bayes (there is no other solution!)

I many calibration issues (tolerance, distance, statistics)

I the NP side should be incorporated into the whole Bayesianpicture

I the approximation error should also be part of the Bayesianinference

Noisy ABC

ABC approximation error (under non-zero tolerance ε) replacedwith exact simulation from a controlled approximation to thetarget, convolution of true posterior with kernel function

πε(θ, z|y) =π(θ)f (z|θ)Kε(y− z)∫π(θ)f (z|θ)Kε(y− z)dzdθ

,

with Kε kernel parameterised by bandwidth ε.[Wilkinson, 2013]

Theorem

The ABC algorithm based on a randomised observation y = y+ ξ,ξ ∼ Kε, and an acceptance probability of

Kε(y− z)/M

gives draws from the posterior distribution π(θ|y).

Which summary?

Fundamental difficulty of the choice of the summary statistic whenthere is no non-trivial sufficient statistics [except when done by theexperimenters in the field]

Which summary?


I Loss of statistical information balanced against gain in dataroughening

I Approximation error and information loss remain unknown

I Choice of statistics induces choice of distance functiontowards standardisation

I borrowing tools from data analysis (LDA) machine learning

[Estoup et al., ME, 2012]

Which summary?


I may be imposed for external/practical reasons

I may gather several non-B point estimates

I we can learn about efficient combination

I distance can be provided by estimation techniques

Which summary for model choice?

‘This is also why focus on model discrimination typically (...)proceeds by (...) accepting that the Bayes Factor that one obtainsis only derived from the summary statistics and may in no waycorrespond to that of the full model.’

[S. Sisson, Jan. 31, 2011, xianblog]

Depending on the choice of η(·), the Bayes factor based on thisinsufficient statistic,

Bη12(y) =

∫π1(θ1)f

η1 (η(y)|θ1) dθ1∫

π2(θ2)fη2 (η(y)|θ2) dθ2

,

is either consistent or not[X et al., PNAS, 2012]

Which summary for model choice?

Depending on the choice of η(·), the Bayes factor based on thisinsufficient statistic,

Bη12(y) =

∫π1(θ1)f

η1 (η(y)|θ1) dθ1∫

π2(θ2)fη2 (η(y)|θ2) dθ2

,

is either consistent or not[X et al., PNAS, 2012]

●

●

●

●

●

●

●●

●

●

●

Gauss Laplace

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

n=100

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

Gauss Laplace

0.0

0.2

0.4

0.6

0.8

1.0

n=100

Selecting proper summaries

Consistency only depends on the range of

µi (θ) = Ei [η(y)]

under both models against the asymptotic mean µ0 of η(y)

Theorem

If Pn belongs to one of the two models and if µ0 cannot beattained by the other one :

0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)

< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,

then the Bayes factor Bη12 is consistent

[Marin et al., JRSS B, 2013]

Selecting proper summaries

Consistency only depends on the range of

µi (θ) = Ei [η(y)]

under both models against the asymptotic mean µ0 of η(y)

●

M1 M2

0.3

0.4

0.5

0.6

0.7

●●

●

●

M1 M2

0.3

0.4

0.5

0.6

0.7

M1 M2

0.3

0.4

0.5

0.6

0.7

●

●

●

●

●

●●

●

M1 M2

0.0

0.2

0.4

0.6

0.8

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●

●

●

●

●

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●

●

●

●

M1 M2

0.0

0.2

0.4

0.6

0.8

●

●●●

●

●

●

●

●●●

●

●

●

●

●

●

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●●

●

●●

M1 M2

0.0

0.2

0.4

0.6

0.8

1.0

[Marin et al., JRSS B, 2013]

talk at 2013 wsc, isi conference in hong kong, august 26, 2013

Education

t ktlt t

n1 n i

pdf q leads

proposal distribution

montecarlo experiment

jakob bernoulli

monte carlo approximations

density f