likelihood free computational statistics
TRANSCRIPT
Likelihood free computational statistics
Pierre Pudlo
Universite Montpellier 2Institut de Mathematiques et Modelisation de Montpellier (I3M)
Institut de Biologie ComputationelleLabex NUMEV
17/04/2015
Pierre Pudlo (UM2) Avignon 17/04/2015 1 / 20
Contents
1 Approximate Bayesian computation
2 ABC model choice
3 Bayesian computation with empirical likelihood
Pierre Pudlo (UM2) Avignon 17/04/2015 2 / 20
Contents
1 Approximate Bayesian computation
2 ABC model choice
3 Bayesian computation with empirical likelihood
Pierre Pudlo (UM2) Avignon 17/04/2015 3 / 20
Intractable likelihoods
ProblemHow to perform a Bayesian analysis when the likelihood f (y|φ) is intractable?
Example 1. Gibbs random fields
f (y|φ) ∝ exp(−H(y, φ))
is known up to a constant
Z(φ) =∑
y
exp(−H(y, φ))
Example 2. Neutral populationgenetics
Aim. Infer demographic parameters onthe past of some populations based onthe trace left in genomes of individualssampled from current populations.
Latent process (past history of thesample) ∈ space of high dimension.
If y is the genetic data of the sample,the likelihood is
f (y|φ) =∫
Z
f (y, z |φ) dz
Typically, dim(Z )� dim(Y ).
No hope to compute the likelihood withclever Monte Carlo algorithms?
Coralie Merle, Raphael Leblois etFrancois Rousset
Pierre Pudlo (UM2) Avignon 17/04/2015 4 / 20
A bend via importance sampling
If y is the genetic data of the sample,the likelihood is
f (y|φ) =∫
Z
f (y, z |φ) dz
We are trying to compute this integralwith importance sampling.
Actually z = (z1, . . . , zT) is a measuredvalued Markov chain, stopped at agiven optional time T and y = zT hence
f (y|φ) =∫
Z
1{y = zT} f (z1, . . . , zT |φ) dz
Importance sampling introduces anauxiliary distribution q(dz |φ)
f (y|φ) =∫
Z
1{y = zT}f (z |φ)q(z |φ)︸ ︷︷ ︸weight of z
sampling distr.︷ ︸︸ ︷q(z |φ)dz
The most efficient q is the conditionaldistribution of the Markov chainknowing that zT = y, but even harder tocompute than f (y |φ).
Any other q who is a Markoviandistribution is inefficient as thevariance of the weight growsexponentially with T.
Need a clever q: see the seminal paperof Stephens and Donnelly (2000)
And resampling algorithms. . .
Pierre Pudlo (UM2) Avignon 17/04/2015 5 / 20
Approximate Bayesian computation
IdeaInfer conditional distribution of φ given yobs on simulations from the joint π(φ) f (y|φ)
ABC algorithmA) Generate a large set of (φ, y)from the Bayesian model
π(φ) f (y|φ)
B) Keep the particles (φ, y) suchthat d(η(yobs), η(y)) ≤ ε
C) Return the φ’s of the keptparticles
Curse of dimensionality: y is replacedby some numerical summaries η(y)
Stage A) is computationally heavy!
We end up rejecting almost allsimulations except if fallen in theneighborhood of η(yobs)
Sequential ABC algorithms try to avoiddrawing φ is area of low π(φ|y).
An auto-calibrated ABC-SMCsampler with Mohammed Sedki,Jean-Michel Marin, Jean-MarieCornuet and Christian P. Robert
Pierre Pudlo (UM2) Avignon 17/04/2015 6 / 20
ABC sequential sampler
How to calibrate ε1 ≥ ε2 ≥ · · · ≥ εT and T to be efficient? The auto-calibrated ABC-SMC sampler developed with Mohammed Sedki,Jean-Michel Marin, Jean-Marie Cornet and Christian P. Robert
Pierre Pudlo (UM2) Avignon 17/04/2015 7 / 20
ABC target
Three levels of approximation of theposterior π
(φ
∣∣∣ yobs
)1 the ABC posterior distribution
π(φ
∣∣∣ η(yobs))
2 approximated with a kernel ofbandwidth ε (or with k-nearestneighbours)
π(φ
∣∣∣∣ d(η(y), η(yobs)) ≤ ε)
3 a Monte Carlo error:sample size N < ∞
See, e.g., our review with J.-M. Marin,C. Robert and R. Ryder
If η(y) are not sufficient statistics,
π(φ
∣∣∣ yobs
), π
(φ
∣∣∣ η(yobs))
Information regarding yobs might belost!
Curse of dimensionality:cannot have both ε small, N largewhen η(y) is of large dimension
Post-processing of Beaumont et al.(2002) with local linear regression.
But the lack of sufficiency might still beproblematic. See Robert et al. (2011)for model choice.
Pierre Pudlo (UM2) Avignon 17/04/2015 8 / 20
Contents
1 Approximate Bayesian computation
2 ABC model choice
3 Bayesian computation with empirical likelihood
Pierre Pudlo (UM2) Avignon 17/04/2015 9 / 20
ABC model choice
ABC model choiceA) Generate a large set of
(m, φ, y) from the Bayesianmodel, π(m)πm(φ) fm(y|φ)
B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε
C) For each m, returnpm(yobs) = porportion of mamong the kept particles
Likewise, if η(y) is not sufficient for themodel choice issue,
π(m
∣∣∣ y), π
(m
∣∣∣ η(y))
It might be difficult to designinformative η(y).
Toy example.Model 1. yi
iid∼ N (φ, 1)
Model 2. yiiid∼ N (φ, 2)
Same prior on φ (whatever the model)& uniform prior on model index
η(y) = y1 + · · · + yn is sufficient toestimate φ in both models
But η(y) carries no informationregarding the variance (hence themodel choice issue)
Other examples in Robert et al. (2011)
In population genetics. Might bedifficult to find summary statistics thathelp discriminate between models(= possible historical scenarios on thesampled populations)
Pierre Pudlo (UM2) Avignon 17/04/2015 10 / 20
ABC model choice
ABC model choice
A) Generate a large set of(m, φ, y) from the Bayesianmodel π(m)πm(φ) fm(y|φ)
B) Keep the particles (m, φ, y)such that d(η(y), η(yobs)) ≤ ε
C) For each m, returnpm(yobs) = porportion of mamong the kept particles
If ε is tuned so that the number of keptparticles is k, then pm is a k-nearestneighbor estimate of
E(1{M = m}
∣∣∣∣η(yobs))
Approximating the posteriorprobabilities of model m is aregression problem whereI the response is 1{M = m},I the co-variables are the summary
statistics η(y),I the loss is L2 (conditional
expectation)
The prefered method to approximatethe postererior probabilities in DIYABCis a local multinomial regression.
Ticklish if dim(η(y)) large, or highcorrelation in the summary statistics.
Pierre Pudlo (UM2) Avignon 17/04/2015 11 / 20
Choosing between hidden random fields
Choosing between dependencygraph: 4 or 8 neighbours?
Models. α, β ∼ priorz | β ∼ Potts on G4 or G8 with interaction βy | z, α ∼
∏i P(yi|zi, α)
How to sum up the noisy y?
Without noise (directly observed field),sufficient statistics for the model choiceissue.
With Julien Stoehr and Lionel Cucala
a method to design new summarystatistics
Based on a clustering of the observeddata on possible dependency graphs
I number of connected componentsI size of the largest connected
component,I . . .
Pierre Pudlo (UM2) Avignon 17/04/2015 12 / 20
Machine learning to analyse machine simulated data
ABC model choiceA) Generate a large set of
(m, φ, y) from π(m)πm(φ) fm(y|φ)
B) Infer (anything?) about
m∣∣∣ η(y) with machine learningmethods
In this machine learning perspective:I the (iid) simulations of A) form the
training setI yobs becomes a new data point
With J.-M. Marin, J.-M. Cornuet, A.Estoup, M. Gautier and C. P. Robert
I Predicting m is a classificationproblem
I Computing π(m|η(y)) is aregression problem
It is well known that classification ismuch simple than regression.(dimension of the object we infer)
Why computing π(m|η(y)) if we knowthat
π(m|y) , π(m|η(y))?
Pierre Pudlo (UM2) Avignon 17/04/2015 13 / 20
An example with random forest on human SNP data
Out of Africa
6 scenarios, 6 models
Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject
Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization
event,I a secondary split between European
and Asian lineages andI a recent admixture for Americans
with African origin
Confidence in the selected model?
Pierre Pudlo (UM2) Avignon 17/04/2015 14 / 20
Example (continued)
Observed data. 4 populations, 30individuals per population; 10,000genotyped SNP from the 1000 GenomeProject
Random forest trained on 40, 000simulations (112 summary statistics) predict the model which supportsI a single out-of-Africa colonization
event,I a secondary split between European
and Asian lineages andI a recent admixture for Americans
with African origin
Benefits of random forests?1 Can find the relevant statistics in a
large set of statistics (112) todiscriminate models
2 Lower prior misclassification error(≈ 6%) than other methods (ABC, i.e.k-nn ≈ 18%)
3 Supply a similarity measure tocompare η(y) and η(yobs)
Confidence in the selected model?Compute the average of themisclassification error over an ABCapproximation of the predictive (∗). Here,≤ 0.1%
(∗) π(m, φ, y | ηobs) = π(m | ηobs)πm(φ | ηobs) fm(y |φ)
Pierre Pudlo (UM2) Avignon 17/04/2015 15 / 20
Contents
1 Approximate Bayesian computation
2 ABC model choice
3 Bayesian computation with empirical likelihood
Pierre Pudlo (UM2) Avignon 17/04/2015 16 / 20
Another approximation of the likelihood
What if bothI the likelihood is intractable andI unable to simulate a dataset in a reasonable amount of time to resort on ABC?
First answer: use pseudo-likelihoodssuch as the pairwise composite likelihood
fPCL(y |φ) =∏i< j
f (yi, y j |φ)
Maximum composite likelihoodestimators φ(y) are suitable estimators
But cannot substitute a true likelihoodin a Bayesian framework
leads to credible intervals which aretoo narrow: over-confidence in φ(y), seee.g. Ribatet et al. (2012)
Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)
It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct
Original aim of Owen: remove parametricassumptions
Pierre Pudlo (UM2) Avignon 17/04/2015 17 / 20
Bayesian computation via empirical likelihood
Our proposal with Kerrie Mengersen andChristian P. Robert:use the empirical likelihood of Owen(2001, 2011)
It relies on iid blocks in the dataset y toreconstruct a likelihood& permits likelihood ratio tests confidence intervals are correct
Original aim of Owen: remove parametricassumptions
With empirical likelihood, the parameter φis defined as
(∗) E(h(yb, φ)
)= 0
whereI yb is one block of y,I E the expected value according to
the true distribution of the block yb
I h is a known function
E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ
In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?
Pierre Pudlo (UM2) Avignon 17/04/2015 18 / 20
Bayesian computation via empirical likelihood
With empirical likelihood, the parameter φis defined as
(∗) E(h(yb, φ)
)= 0
whereI yb is one block of y,I E the expected value according to
the true distribution of the block yb
I h is a known function
E.g, if φ is the mean of an iid sample,h(yb, φ) = yb − φ
In population genetics, what is (∗) withI dates of population splitsI population sizes, etc. ?
A block = genetic data at given locus
h(yb, φ) is the pairwise composite scorefunction we can explicitly compute inmany situations:
h(yb, φ) = ∇φ log fPCL(yb |φ)
Benefits.I much faster than ABC (no need to
simulate fake data)I same accuracy than ABC or even
much precise: no loss of informationwith summary statistics
Pierre Pudlo (UM2) Avignon 17/04/2015 19 / 20
An experiment
Evolutionary scenario:MRCA
POP 0 POP 1 POP 2
τ1
τ2
Dataset:I 50 genes per populations,I 100 microsat. loci
Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior
Comparison of ABC and EL
histogram = ELcurve = ABCvertical line = “true” parameter
Pierre Pudlo (UM2) Avignon 17/04/2015 20 / 20
An experiment
Evolutionary scenario:MRCA
POP 0 POP 1 POP 2
τ1
τ2
Dataset:I 50 genes per populations,I 100 microsat. loci
Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior
Comparison of ABC and EL
histogram = ELcurve = ABCvertical line = “true” parameter
Pierre Pudlo (UM2) Avignon 17/04/2015 20 / 20
An experiment
Evolutionary scenario:MRCA
POP 0 POP 1 POP 2
τ1
τ2
Dataset:I 50 genes per populations,I 100 microsat. loci
Assumptions:I Ne identical over all populationsI φ = log10(θ, τ1, τ2)I non-informative prior
Comparison of ABC and EL
histogram = ELcurve = ABCvertical line = “true” parameter
Pierre Pudlo (UM2) Avignon 17/04/2015 20 / 20