approximate bayesian methods in genetic data analysis mark a. beaumont, university of reading,

30
Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,

Post on 21-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Approximate Bayesian Methods in Genetic Data Analysis

Mark A. Beaumont, University of Reading,

Acknowledgements

Wenyang Zhang, University of Kent,

David Balding, Imperial College, London

Dave Tallmon, Juneau, Alaska

Arnaud Estoup, Montpellier

BBSRC, NERC

General Problem

In population genetics the data we observe have many possible unobservable ‘causes’, which generally follow a hierarchical structure.

For example, genetic data depends on some unknown genealogical history, which in turn depends on the mutation model, demographic history, and the effects of selection. These, in turn, depend on the ecology of the organism.

Therefore we have many competing explanations for the data and we wish to choose among them.

How to do this?

Be pragmatic – take a Bayesian approach

Bayesian analysis offers a flexible framework for modelling uncertainty.

MCMC has made this possible for population genetic problems.

Problems with MCMC-based methods of genealogical inference

• Slow – problems of convergence.

•Difficult to code up.

•Difficult to modify flexibly to different scenarios.

• Difficulty addressing the questions that biologists want answered. (Hence the rise of cladistic, network-based methods like NCA.)

MCMC is useful, but…

Method for Sampling from Posterior Distribution

Consider parameters , data D:Simulate samples ,Di from the joint density P(,D):

First simulate from the prior i ~ P()

Then simulate from the likelihood Di ~ P(D | i)

The posterior distribution

for any given D can be estimate by the proportion of all simulated points that correspond to that particular D and divided by the proportion of points corresponding to D (ignoring ).

)(),()|( DP

DPDP

Data, D

Par

amet

er,

Pri

or –

p(

)

Marginal likelihood – p(D)

Posterior distribution – p( | D) Likelihood – p(D| )

Replace the data with summary statistics

Key Points:

•For most problems, we can’t hit the data exactly.

•But similar data may have similar posterior distributions.

•If we replace the data with summary statistics, then it is easier to decide how ‘similar’ data sets are to each other.

History

Tavaré et al. (1997, Genetics) – Specify P(S | ), [use rejection to] estimate P( | S).

Fu and Li (1997, MBE) – Use S and rejection to estimate posterior distribution of coalescence times (I.e. P(G | S)).

Weiss and von Haeseler (1998, Genetics) – use rejection to estimate likelihood P(S | ).

Pritchard et al. (1999, MBE) – use rejection to estimate P(,G | S).

Wall (2000, MBE) – uses rejection to estimate P(S | ).

Beaumont et al. (2002, Genetics) – uses regression/rejection to estimate P( | S).

Marjoram et al. (2003, PNAS) – uses MCMC and rejection to estimate P( | S).

– Demographic/mutational parameters; S – Summary statistics; G - Genealogy

Beaumont, Zhang, and Balding (2002) Approximate Bayesian Computation in Population Genetics. Genetics 162: 2025-2035.

•This is a problem of density estimation. We want to use information about the relationship between the summary statistics and the parameters in the vicinity of the observed summary statistics.

•Keep the idea of accepting points close to those observed in the data.

•Use multiple regression to ‘correct’ for relationship between summary statistics and parameter values.

•Downweight points further away from the observed values.

•The idea is that we should be able to accept many more points.

Assume we have observed a d dimensional vector of summary statistics s, and we have n random draws of a (scalar) parameter 1,…,n and corresponding summary statistics S1,…,n. We scale s and S1,…,n so that S1,…,n have unit variance.

Local Linear Regression

t

tt

ctK

0

,1)(

21

)|( sSE

d

i ixx1

2

where

Epanechnikov kernel

n

iii

Ti sSKsS

1

2)(

We want to minimize

s

Tsss

Ts WSSWS

1

ˆˆ

The solution is

Tn

T

s

SS

SS

S

1

1 1

Tn ,,1

sSKsSKW ns ,,diag 1

where

Our best estimate of the posterior mean is then

sTsss

Ts

T WSSWSe1

1

where e1 is a d+1 length vector (1,0,…,0).

1

0

Summary Statistic

Weight

Parameter

Obtaining posterior densities and other summaries using regression approach.

We make an assumption that the errors are constant in the interval and adjust the parameter values as

* Tiii sS

The posterior density for can be approximated as

i i

ii i

sSK

sSKKS

where K(t) is another Epanechnikov kernel with bandwidth .

Alternative, can use some other density estimation method.

Model Comparison

As noted in Pritchard et al. (1999), can compare two models, M1 and M2 by evaluating the marginal distribution of the summary statistics at s.

I.e.

)(

)(

2

1

sS

sS

M

M

Could use original Pritchard method (proportion of points within tolerance window).

Alternatively, use multivariate kernel methods to estimate density.

1

0

Example – estimation of (in a population with constant size

•Simulate 100 sets 445 chromosomes, 8 linked microsatellite loci (SMM)

• = 10

•Summary statistics: mean heterozygosity, mean variance in allele length, number of distinct haplotypes.

•Rectangular priors (0,50)

•Point estimate – posterior mean.

•Also use MCMC (Batwing) to estimate posterior mean (flat prior).

•Compare Mean Square Error of different methods.

Accuracy in the estimation of scaled mutation rate = 2N

Tolerance

Rel

ativ

e m

ean

squa

re e

rror

MCMC

Standard Rejection

Regression

Summary statistics:-

• mean variance in length

• mean heterozygosity

• number of haplotypes

Data:-

• linked microsat loci

Main Conclusion

The regression method allows a much larger proportion of points to be used than the rejection method.

This means that more summary statistics can be used in the regression method without compromising accuracy.

Generalisations

• You want to investigate a system which gives rise to genetical and/or ecological data.

• Construct a (complicated) model (individual-based, stage-structured, genealogical…) that gives rise to the same type of data.

• Put priors on all the parameters.

• Decide on the parameters you want to make inferences about.

• Choose summary statistics. Measure these from your data.

• Perform simulations.

• Construct posterior distributions for the parameters of interest, using e.g. the regression methods here.

Pritchard method:

Estoup et al. (2002, Genetics)– Demographic history of invasion of islands by cane toads. 10 microsatellite loci, 22 allozyme loci. 4/3 summary statistics, 6 demographic parameters.

Estoup and Clegg (2003, Molecular Ecology) – Demographic history of colonisation of islands by silvereyes.

Regression method:

Tallmon et al (2004, Genetics) – Estimating effective population size by temporal method. One main parameter of interest (Ne), 4 summary statistics, tested on up to

Estoup et al. (2004, Evolution, in press) – Demographic history of invasion of Australia by cane toads. 75/63 summary statistics, model comparison, up to 5 demographic parameters.

Some Papers using Approximate Bayesian approaches

Coalescent

From Tallmon, Luikart, and Beaumont (Genetics, 2004).

Future Work

How to choose suitable summary statistics?

Need for ‘Data Mining’ techniques. Projection pursuit. Orthogonalisation. Stepwise regression.

Because the method is quick, can use e.g. MSE, integrated squared error, coverage etc as an ultimate criterion.

Improve conditional density estimation.

Improve choice of bandwidth in kernel.

Use of transformations (e.g. log-linear modelling). Quantile regression.