likelihood free computational statistics

22
Likelihood free computational statistics Pierre Pudlo Universit ´ e Montpellier 2 Institut de Math ´ ematiques et Mod ´ elisation de Montpellier (I3M) Institut de Biologie Computationelle Labex NUMEV 17/04/2015 Pierre Pudlo (UM2) Avignon 17/04/2015 1 / 20

Upload: pierre-pudlo

Post on 27-Jul-2015

45 views

Category:

Science


1 download

TRANSCRIPT

Likelihood free computational statistics

Pierre Pudlo

Universite Montpellier 2Institut de Mathematiques et Modelisation de Montpellier (I3M)

Institut de Biologie ComputationelleLabex NUMEV

17/04/2015

Pierre Pudlo (UM2) Avignon 17/04/2015 1 / 20

Contents

1 Approximate Bayesian computation

2 ABC model choice

3 Bayesian computation with empirical likelihood

Pierre Pudlo (UM2) Avignon 17/04/2015 2 / 20

Contents

1 Approximate Bayesian computation

2 ABC model choice

3 Bayesian computation with empirical likelihood

Pierre Pudlo (UM2) Avignon 17/04/2015 3 / 20

Intractable likelihoods

ProblemHow to perform a Bayesian analysis when the likelihood f (y|φ) is intractable?

Example 1. Gibbs random fields

f (y|φ) ∝ exp(−H(y, φ))

is known up to a constant

Z(φ) =∑

y

exp(−H(y, φ))

Example 2. Neutral populationgenetics

Aim. Infer demographic parameters onthe past of some populations based onthe trace left in genomes of individualssampled from current populations.

Latent process (past history of thesample) ∈ space of high dimension.

If y is the genetic data of the sample,the likelihood is

f (y|φ) =∫

Z

f (y, z |φ) dz

Typically, dim(Z )� dim(Y ).

No hope to compute the likelihood withclever Monte Carlo algorithms?

Coralie Merle, Raphael Leblois etFrancois Rousset

Pierre Pudlo (UM2) Avignon 17/04/2015 4 / 20

A bend via importance sampling

If y is the genetic data of the sample,the likelihood is

f (y|φ) =∫

Z

f (y, z |φ) dz

We are trying to compute this integralwith importance sampling.

Actually z = (z1, . . . , zT) is a measuredvalued Markov chain, stopped at agiven optional time T and y = zT hence

f (y|φ) =∫

Z

1{y = zT} f (z1, . . . , zT |φ) dz

Importance sampling introduces anauxiliary distribution q(dz |φ)

f (y|φ) =∫

Z

1{y = zT}f (z |φ)q(z |φ)︸ ︷︷ ︸weight of z

sampling distr.︷ ︸︸ ︷q(z |φ)dz

The most efficient q is the conditionaldistribution of the Markov chainknowing that zT = y, but even harder tocompute than f (y |φ).

Any other q who is a Markoviandistribution is inefficient as thevariance of the weight growsexponentially with T.

Need a clever q: see the seminal paperof Stephens and Donnelly (2000)

And resampling algorithms. . .

Pierre Pudlo (UM2) Avignon 17/04/2015 5 / 20

Approximate Bayesian computation

IdeaInfer conditional distribution of φ given yobs on simulations from the joint π(φ) f (y|φ)

ABC algorithmA) Generate a large set of (φ, y)from the Bayesian model

π(φ) f (y|φ)

B) Keep the particles (φ, y) suchthat d(η(yobs), η(y)) ≤ ε

C) Return the φ’s of the keptparticles

Curse of dimensionality: y is replacedby some numerical summaries η(y)

Stage A) is computationally heavy!

We end up rejecting almost allsimulations except if fallen in theneighborhood of η(yobs)

Sequential ABC algorithms try to avoiddrawing φ is area of low π(φ|y).

An auto-calibrated ABC-SMCsampler with Mohammed Sedki,Jean-Michel Marin, Jean-MarieCornuet and Christian P. Robert

Pierre Pudlo (UM2) Avignon 17/04/2015 6 / 20

ABC sequential sampler

How to calibrate ε1 ≥ ε2 ≥ · · · ≥ εT and T to be efficient? The auto-calibrated ABC-SMC sampler developed with Mohammed Sedki,Jean-Michel Marin, Jean-Marie Cornet and Christian P. Robert

Pierre Pudlo (UM2) Avignon 17/04/2015 7 / 20

ABC target

Three levels of approximation of theposterior π

∣∣∣ yobs

)1 the ABC posterior distribution

π(φ

∣∣∣ η(yobs))

2 approximated with a kernel ofbandwidth ε (or with k-nearestneighbours)

π(φ

∣∣∣∣ d(η(y), η(yobs)) ≤ ε)

3 a Monte Carlo error:sample size N < ∞

See, e.g., our review with J.-M. Marin,C. Robert and R. Ryder

If η(y) are not sufficient statistics,

π(φ

∣∣∣ yobs

), π

∣∣∣ η(yobs))

Information regarding yobs might belost!

Curse of dimensionality:cannot have both ε small, N largewhen η(y) is of large dimension

Post-processing of Beaumont et al.(2002) with local linear regression.

But the lack of sufficiency might still beproblematic. See Robert et al. (2011)for model choice.

Pierre Pudlo (UM2) Avignon 17/04/2015 8 / 20

Contents

1 Approximate Bayesian computation

2 ABC model choice

3 Bayesian computation with empirical likelihood

Pierre Pudlo (UM2) Avignon 17/04/2015 9 / 20

ABC model choice

ABC model choiceA) Generate a large set of

(m, φ, y) from the Bayesianmodel, π(m)πm(φ) fm(y|φ)

B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε

C) For each m, returnpm(yobs) = porportion of mamong the kept particles

Likewise, if η(y) is not sufficient for themodel choice issue,

π(m

∣∣∣ y), π

(m

∣∣∣ η(y))

It might be difficult to designinformative η(y).

Toy example.Model 1. yi

iid∼ N (φ, 1)

Model 2. yiiid∼ N (φ, 2)

Same prior on φ (whatever the model)& uniform prior on model index

η(y) = y1 + · · · + yn is sufficient toestimate φ in both models

But η(y) carries no informationregarding the variance (hence themodel choice issue)

Other examples in Robert et al. (2011)

In population genetics. Might bedifficult to find summary statistics thathelp discriminate between models(= possible historical scenarios on thesampled populations)

Pierre Pudlo (UM2) Avignon 17/04/2015 10 / 20

ABC model choice

ABC model choice

A) Generate a large set of(m, φ, y) from the Bayesianmodel π(m)πm(φ) fm(y|φ)

B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε

C) For each m, returnpm(yobs) = porportion of mamong the kept particles

If ε is tuned so that the number of keptparticles is k, then pm is a k-nearestneighbor estimate of

E(1{M = m}

∣∣∣∣η(yobs))

Approximating the posteriorprobabilities of model m is aregression problem whereI the response is 1{M = m},I the co-variables are the summary

statistics η(y),I the loss is L2 (conditional

expectation)

The prefered method to approximatethe postererior probabilities in DIYABCis a local multinomial regression.

Ticklish if dim(η(y)) large, or highcorrelation in the summary statistics.

Pierre Pudlo (UM2) Avignon 17/04/2015 11 / 20

Choosing between hidden random fields

Choosing between dependencygraph: 4 or 8 neighbours?

Models. α, β ∼ priorz | β ∼ Potts on G4 or G8 with interaction βy | z, α ∼

∏i P(yi|zi, α)

How to sum up the noisy y?

Without noise (directly observed field),sufficient statistics for the model choiceissue.

With Julien Stoehr and Lionel Cucala

a method to design new summarystatistics

Based on a clustering of the observeddata on possible dependency graphs

I number of connected componentsI size of the largest connected

component,I . . .

Pierre Pudlo (UM2) Avignon 17/04/2015 12 / 20

Machine learning to analyse machine simulated data

ABC model choiceA) Generate a large set of

(m, φ, y) from π(m)πm(φ) fm(y|φ)

B) Infer (anything?) about

m∣∣∣ η(y) with machine learningmethods

In this machine learning perspective:I the (iid) simulations of A) form the

training setI yobs becomes a new data point

With J.-M. Marin, J.-M. Cornuet, A.Estoup, M. Gautier and C. P. Robert

I Predicting m is a classificationproblem

I Computing π(m|η(y)) is aregression problem

It is well known that classification ismuch simple than regression.(dimension of the object we infer)

Why computing π(m|η(y)) if we knowthat

π(m|y) , π(m|η(y))?

Pierre Pudlo (UM2) Avignon 17/04/2015 13 / 20

An example with random forest on human SNP data

Out of Africa

6 scenarios, 6 models

Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject

Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization

event,I a secondary split between European

and Asian lineages andI a recent admixture for Americans

with African origin

Confidence in the selected model?

Pierre Pudlo (UM2) Avignon 17/04/2015 14 / 20

Example (continued)

Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject

Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization

event,I a secondary split between European

and Asian lineages andI a recent admixture for Americans

with African origin

Benefits of random forests?1 Can find the relevant statistics in a

large set of statistics (112) todiscriminate models

2 Lower prior misclassification error(≈ 6%) than other methods (ABC, i.e.k-nn ≈ 18%)

3 Supply a similarity measure tocompare η(y) and η(yobs)

Confidence in the selected model?Compute the average of themisclassification error over an ABCapproximation of the predictive (∗). Here,≤ 0.1%

(∗) π(m, φ, y | ηobs) = π(m | ηobs)πm(φ | ηobs) fm(y |φ)

Pierre Pudlo (UM2) Avignon 17/04/2015 15 / 20

Contents

1 Approximate Bayesian computation

2 ABC model choice

3 Bayesian computation with empirical likelihood

Pierre Pudlo (UM2) Avignon 17/04/2015 16 / 20

Another approximation of the likelihood

What if bothI the likelihood is intractable andI unable to simulate a dataset in a reasonable amount of time to resort on ABC?

First answer: use pseudo-likelihoodssuch as the pairwise composite likelihood

fPCL(y |φ) =∏i< j

f (yi, y j |φ)

Maximum composite likelihoodestimators φ(y) are suitable estimators

But cannot substitute a true likelihoodin a Bayesian framework

leads to credible intervals which aretoo narrow: over-confidence in φ(y), seee.g. Ribatet et al. (2012)

Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)

It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct

Original aim of Owen: remove parametricassumptions

Pierre Pudlo (UM2) Avignon 17/04/2015 17 / 20

Bayesian computation via empirical likelihood

Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)

It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct

Original aim of Owen: remove parametricassumptions

With empirical likelihood, the parameter φis defined as

(∗) E(h(yb, φ)

)= 0

whereI yb is one block of y,I E the expected value according to

the true distribution of the block yb

I h is a known function

E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ

In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?

Pierre Pudlo (UM2) Avignon 17/04/2015 18 / 20

Bayesian computation via empirical likelihood

With empirical likelihood, the parameter φis defined as

(∗) E(h(yb, φ)

)= 0

whereI yb is one block of y,I E the expected value according to

the true distribution of the block yb

I h is a known function

E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ

In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?

A block = genetic data at given locus

h(yb, φ) is the pairwise composite scorefunction we can explicitly compute inmany situations:

h(yb, φ) = ∇φ log fPCL(yb |φ)

Benefits.I much faster than ABC (no need to

simulate fake data)I same accuracy than ABC or even

much precise: no loss of informationwith summary statistics

Pierre Pudlo (UM2) Avignon 17/04/2015 19 / 20

An experiment

Evolutionary scenario:MRCA

POP 0 POP 1 POP 2

τ1

τ2

Dataset:I 50 genes per populations,I 100 microsat. loci

Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior

Comparison of ABC and EL

histogram = ELcurve = ABCvertical line = “true” parameter

Pierre Pudlo (UM2) Avignon 17/04/2015 20 / 20

An experiment

Evolutionary scenario:MRCA

POP 0 POP 1 POP 2

τ1

τ2

Dataset:I 50 genes per populations,I 100 microsat. loci

Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior

Comparison of ABC and EL

histogram = ELcurve = ABCvertical line = “true” parameter

Pierre Pudlo (UM2) Avignon 17/04/2015 20 / 20

An experiment

Evolutionary scenario:MRCA

POP 0 POP 1 POP 2

τ1

τ2

Dataset:I 50 genes per populations,I 100 microsat. loci

Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior

Comparison of ABC and EL

histogram = ELcurve = ABCvertical line = “true” parameter

Pierre Pudlo (UM2) Avignon 17/04/2015 20 / 20