likelihood free computational statistics

Likelihood free computational statistics

Pierre Pudlo

Universite Montpellier 2Institut de Mathematiques et Modelisation de Montpellier (I3M)

Institut de Biologie ComputationelleLabex NUMEV

17/04/2015

Pierre Pudlo (UM2) Avignon 17/04/2015 1 / 20

Contents

1 Approximate Bayesian computation

2 ABC model choice

3 Bayesian computation with empirical likelihood


Contents


2 ABC model choice



Intractable likelihoods

ProblemHow to perform a Bayesian analysis when the likelihood f (y|φ) is intractable?

Example 1. Gibbs random fields

f (y|φ) ∝ exp(−H(y, φ))

is known up to a constant

Z(φ) =∑

y

exp(−H(y, φ))

Example 2. Neutral populationgenetics

Aim. Infer demographic parameters onthe past of some populations based onthe trace left in genomes of individualssampled from current populations.

Latent process (past history of thesample) ∈ space of high dimension.

If y is the genetic data of the sample,the likelihood is

f (y|φ) =∫

Z

f (y, z |φ) dz

Typically, dim(Z )� dim(Y ).

No hope to compute the likelihood withclever Monte Carlo algorithms?

Coralie Merle, Raphael Leblois etFrancois Rousset


A bend via importance sampling

If y is the genetic data of the sample,the likelihood is

f (y|φ) =∫

Z

f (y, z |φ) dz

We are trying to compute this integralwith importance sampling.

Actually z = (z1, . . . , zT) is a measuredvalued Markov chain, stopped at agiven optional time T and y = zT hence

f (y|φ) =∫

Z

1{y = zT} f (z1, . . . , zT |φ) dz

Importance sampling introduces anauxiliary distribution q(dz |φ)

f (y|φ) =∫

Z

1{y = zT}f (z |φ)q(z |φ)︸︷︷︸weight of z

sampling distr.︷︸︸︷q(z |φ)dz

The most efficient q is the conditionaldistribution of the Markov chainknowing that zT = y, but even harder tocompute than f (y |φ).

Any other q who is a Markoviandistribution is inefficient as thevariance of the weight growsexponentially with T.

Need a clever q: see the seminal paperof Stephens and Donnelly (2000)

And resampling algorithms. . .


Approximate Bayesian computation

IdeaInfer conditional distribution of φ given yobs on simulations from the joint π(φ) f (y|φ)

ABC algorithmA) Generate a large set of (φ, y)from the Bayesian model

π(φ) f (y|φ)

B) Keep the particles (φ, y) suchthat d(η(yobs), η(y)) ≤ ε

C) Return the φ’s of the keptparticles

Curse of dimensionality: y is replacedby some numerical summaries η(y)

Stage A) is computationally heavy!

We end up rejecting almost allsimulations except if fallen in theneighborhood of η(yobs)

Sequential ABC algorithms try to avoiddrawing φ is area of low π(φ|y).

An auto-calibrated ABC-SMCsampler with Mohammed Sedki,Jean-Michel Marin, Jean-MarieCornuet and Christian P. Robert


ABC sequential sampler

How to calibrate ε1 ≥ ε2 ≥ · · · ≥ εT and T to be efficient? The auto-calibrated ABC-SMC sampler developed with Mohammed Sedki,Jean-Michel Marin, Jean-Marie Cornet and Christian P. Robert


ABC target

Three levels of approximation of theposterior π

(φ

∣∣∣ yobs

)1 the ABC posterior distribution

π(φ

∣∣∣ η(yobs))

2 approximated with a kernel ofbandwidth ε (or with k-nearestneighbours)

π(φ

∣∣∣∣ d(η(y), η(yobs)) ≤ ε)

3 a Monte Carlo error:sample size N < ∞

See, e.g., our review with J.-M. Marin,C. Robert and R. Ryder

If η(y) are not sufficient statistics,

π(φ

∣∣∣ yobs

), π

(φ

∣∣∣ η(yobs))

Information regarding yobs might belost!

Curse of dimensionality:cannot have both ε small, N largewhen η(y) is of large dimension

Post-processing of Beaumont et al.(2002) with local linear regression.

But the lack of sufficiency might still beproblematic. See Robert et al. (2011)for model choice.


Contents


2 ABC model choice



ABC model choice

ABC model choiceA) Generate a large set of

(m, φ, y) from the Bayesianmodel, π(m)πm(φ) fm(y|φ)

B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε

C) For each m, returnpm(yobs) = porportion of mamong the kept particles

Likewise, if η(y) is not sufficient for themodel choice issue,

π(m

∣∣∣ y), π

(m

∣∣∣ η(y))

It might be difficult to designinformative η(y).

Toy example.Model 1. yi

iid∼ N (φ, 1)

Model 2. yiiid∼ N (φ, 2)

Same prior on φ (whatever the model)& uniform prior on model index

η(y) = y1 + · · · + yn is sufficient toestimate φ in both models

But η(y) carries no informationregarding the variance (hence themodel choice issue)

Other examples in Robert et al. (2011)

In population genetics. Might bedifficult to find summary statistics thathelp discriminate between models(= possible historical scenarios on thesampled populations)


ABC model choice

ABC model choice

A) Generate a large set of(m, φ, y) from the Bayesianmodel π(m)πm(φ) fm(y|φ)

B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε

C) For each m, returnpm(yobs) = porportion of mamong the kept particles

If ε is tuned so that the number of keptparticles is k, then pm is a k-nearestneighbor estimate of

E(1{M = m}

∣∣∣∣η(yobs))

Approximating the posteriorprobabilities of model m is aregression problem whereI the response is 1{M = m},I the co-variables are the summary

statistics η(y),I the loss is L2 (conditional

expectation)

The prefered method to approximatethe postererior probabilities in DIYABCis a local multinomial regression.

Ticklish if dim(η(y)) large, or highcorrelation in the summary statistics.


Choosing between hidden random fields

Choosing between dependencygraph: 4 or 8 neighbours?

Models. α, β ∼ priorz | β ∼ Potts on G4 or G8 with interaction βy | z, α ∼

∏i P(yi|zi, α)

How to sum up the noisy y?

Without noise (directly observed field),sufficient statistics for the model choiceissue.

With Julien Stoehr and Lionel Cucala

a method to design new summarystatistics

Based on a clustering of the observeddata on possible dependency graphs

I number of connected componentsI size of the largest connected

component,I . . .


Machine learning to analyse machine simulated data

ABC model choiceA) Generate a large set of

(m, φ, y) from π(m)πm(φ) fm(y|φ)

B) Infer (anything?) about

m∣∣∣ η(y) with machine learningmethods

In this machine learning perspective:I the (iid) simulations of A) form the

training setI yobs becomes a new data point

With J.-M. Marin, J.-M. Cornuet, A.Estoup, M. Gautier and C. P. Robert

I Predicting m is a classificationproblem

I Computing π(m|η(y)) is aregression problem

It is well known that classification ismuch simple than regression.(dimension of the object we infer)

Why computing π(m|η(y)) if we knowthat

π(m|y) , π(m|η(y))?


An example with random forest on human SNP data

Out of Africa

6 scenarios, 6 models

Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject

Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization

event,I a secondary split between European

and Asian lineages andI a recent admixture for Americans

with African origin

Confidence in the selected model?


Example (continued)

Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject

Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization

event,I a secondary split between European

and Asian lineages andI a recent admixture for Americans

with African origin

Benefits of random forests?1 Can find the relevant statistics in a

large set of statistics (112) todiscriminate models

2 Lower prior misclassification error(≈ 6%) than other methods (ABC, i.e.k-nn ≈ 18%)

3 Supply a similarity measure tocompare η(y) and η(yobs)

Confidence in the selected model?Compute the average of themisclassification error over an ABCapproximation of the predictive (∗). Here,≤ 0.1%

(∗) π(m, φ, y | ηobs) = π(m | ηobs)πm(φ | ηobs) fm(y |φ)


Contents


2 ABC model choice



Another approximation of the likelihood

What if bothI the likelihood is intractable andI unable to simulate a dataset in a reasonable amount of time to resort on ABC?

First answer: use pseudo-likelihoodssuch as the pairwise composite likelihood

fPCL(y |φ) =∏i< j

f (yi, y j |φ)

Maximum composite likelihoodestimators φ(y) are suitable estimators

But cannot substitute a true likelihoodin a Bayesian framework

leads to credible intervals which aretoo narrow: over-confidence in φ(y), seee.g. Ribatet et al. (2012)

Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)

It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct

Original aim of Owen: remove parametricassumptions


Bayesian computation via empirical likelihood

Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)

It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct

Original aim of Owen: remove parametricassumptions

With empirical likelihood, the parameter φis defined as

(∗) E(h(yb, φ)

)= 0

whereI yb is one block of y,I E the expected value according to

the true distribution of the block yb

I h is a known function

E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ

In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?


Bayesian computation via empirical likelihood

With empirical likelihood, the parameter φis defined as

(∗) E(h(yb, φ)

)= 0

whereI yb is one block of y,I E the expected value according to

the true distribution of the block yb

I h is a known function

E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ

In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?

A block = genetic data at given locus

h(yb, φ) is the pairwise composite scorefunction we can explicitly compute inmany situations:

h(yb, φ) = ∇φ log fPCL(yb |φ)

Benefits.I much faster than ABC (no need to

simulate fake data)I same accuracy than ABC or even

much precise: no loss of informationwith summary statistics


An experiment

Evolutionary scenario:MRCA

POP 0 POP 1 POP 2

τ1

τ2

Dataset:I 50 genes per populations,I 100 microsat. loci

Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior

Comparison of ABC and EL

histogram = ELcurve = ABCvertical line = “true” parameter


likelihood free computational statistics

Science

y exphy

bayesian model

abc target

constant z

abc model choice

abc sequential sampler

zt dz importance sampling

bayesian analysis