niklas wahlberg university of turku. jarno tuimala free researcher / finnish tax administration

52
Introduction to model based methods Niklas Wahlberg University of Turku

Upload: harvey-preston

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Introduction to model based methods

Niklas WahlbergUniversity of Turku

Page 2: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Introduction to model based methods

Jarno TuimalaFree researcher / Finnish Tax Administration

Page 3: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno)

20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data (Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno)

Schedule

J

Page 4: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

With >100 billion bases in GenBank, we are beginning to understand how DNA sequences evolve

Mitochondrial and nuclear genes differ in mutation dynamics

Different genes have their own mutation dynamics

DNA evolves through mutation

Page 5: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

C A

C G T A1 2 3

1

Seq 1

Seq 2

Number of changes

Hidden evolution in DNA sequences

Ancest GGCGCGSeq 1 AGCGAGSeq 2 GCGGAC

Page 6: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Evolutionary model

Distance

Time

Correction for the

difference between the true and tha

observed distance.

J

Page 7: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Models incorporate information about the rates at which each nucleotide is replaced by each alternative nucleotide ◦ For DNA this can be expressed as a 4 x 4 rate

matrix (known as the Q matrix) Other model parameters may include:

◦ Site by site rate variation - often modelled as a statistical distribution - for example a gamma distribution

Modeling evolution

J

Page 8: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

The mean instantaneous substitution rate (=the general mutation rate + rate of fixation in population)

The relative rates of substitution between each base pair

The average frequencies of each base in the dataset

Branch lengths Topology!

Parameters we are interested in

Page 9: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Purines Pyrimidines

Page 10: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

πA πG

πTπC

A general model of sequence evolution

a

b

c d

e

fg

hi j

k

l

Page 11: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

πA πG

πTπC

A general model of sequence evolution

a

b

c d

e

fg

hi j

k

l

transition

transition

J

transversions

Page 12: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

If all substituitons were equally likely, the expected ratio (R) of transitions (P) to transversions (Q) would be about 0.5:◦ Re = P / Q ~ 0.5

In reality, this is not the case, and the ratio is usually higher.

Some models of sequence evolution take this ratio into account, some don't.

Transition / trasversion rate

J

Page 13: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

A -μ(aπC+bπG+cπT) μaπC μbπG μcπT

C μgπA -μ(gπA+dπG+eπT) μdπG μeπT

G μhπA μjπC -μ(hπA+jπC+fπT) μfπT

T μiπA μkπC μlπG -μ(iπA+kπC+lπG)

A general model of molecular evolution

Q =

μ = mean instantaneous substitution rate

πA = frequency of A

a, b, c,... l = relative rate of substitution } product is the rate parameter

A C G T

Page 14: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Rate of change from base i to base j is independent of the base that occupied a site prior to i (Markov property)

Substitution rate does not change over time (homogeneity)

Relative frequencies of A, G, C, and T are at equilibrium (stationarity)

Time-homogenous time-continuous stationary Markov models

Page 15: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

The Jukes and Cantor model is the simplest model

The JC model is a one parameter model1) it assumes that all bases are equally frequent (p=0.25)2) unless modified it assumes all sites can change and that they do so at the same rate

A C G TACG

T

a a aa a

a

-3a-3a

-3a

-3aa a a

a a

a

Page 16: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

• = the rate of substitution ( changes from A to G every t)• The rate of substitution for each nucleotide is 3• In t steps there will be 3t changes

A G

TC

Jukes-Cantor model

Page 17: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

A G

TC

Kimura model

= transitions = transversions

Page 18: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

The Kimura model has 2 parameters

The K2P model is more realistic, but still1) it assumes that all bases are equally frequent

(p=0.25)2) unless modified it assumes all sites can change and that they do so at the same rate

A C G TACG

T

a a

- -

-

- a a

Page 19: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

The Hasegawa-Kishino-Yano model

The HKY model takes into account variable base frequencies, but still1) unless modified it assumes all sites can change and that they do so at the same rate

A C G TACG

T

C Ga TG Ta

T

- -

-

-A Ca GAa CA

Page 20: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

πA πG

πTπC

The GTR model

a

b

c d

e

f

Page 21: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

-μ(aπC+bπG+cπT) μaπC μbπG μcπT

μaπA -μ(aπA+dπG+eπT) μdπG μeπT

μbπA μdπC -μ(bπA+dπC+fπT) μfπT

μcπA μeπC μfπG -μ(cπA+eπC+fπG)

The most general time-reversible model

Q =

μ = mean instantaneous substitution rate

πA = frequency of A

a, b, c,... f = relative rate of substitution } product is the rate parameter

Page 22: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Almost all models used are special cases of one model:◦ The general time reversible model

The next three slides are from: https://code.google.com/p/jmodeltest2/wiki/TheoreticalBackground

The most commonly used models

ACAGGTGAGGCTCAGCCAATTTGAGCTTTGTCGATAGGT

Page 23: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

J

Page 24: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

J

Page 25: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Hypotheses tested are: F = base frequencies; S = substitution type; I = proportion of invariable sites; G = gamma rates. J

Page 26: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

JC

Variable base frequencies

3 substitution types

2 substitution types

Single substitution type

3 substitution types

2 substitution types

Variable base frequencies

Equal base frequencies

F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM

6 substitution types

6 substitution types

Page 27: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Model parameters can be: ◦ estimated from the data (using a likelihood

function)◦ can be pre-set based upon assumptions about the

data (for example that for all sequences all sites change at the same rate and all substitutions are equally likely - e.g. the Jukes and Cantor Model)

◦ wherever possible avoid assumptions which are violated by the data because they can lead to incorrect trees

Models

Page 28: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration
Page 29: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

The most common additional parameters are:◦ A correction for the proportion of sites which are

invariable (parameter I )◦ A correction for variable site rates at those sites

which can change (parameter gamma, G ) All models can be supplemented with these

parameters (e.g. GTR+I+G, HKY+I+G )

Models can be made more parameter rich to increase their realism

Page 30: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Invariable sites

Page 31: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

A gamma distribution can be used to model site rate heterogeneity

α = shape parameter

Page 32: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Computational difficulties in using continuous distribution

Most programs use discrete categories

Gamma distribution computationally costly

Rate

Frequency

Page 33: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

The parameters I and G covary! (I + G ) can be estimated, but the values of I

and G are not easily teased apart Parameter G takes I into account, I not

needed

Usually though, a certain amount of sites (estimated from data) are assumed invariant, and rest (the varying sites) are allowed to follow the rates drawn from the discrete gamma distribution.

Difficulties in estimating parameters

J

Page 34: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

But the more parameters you estimate from the data the more time needed for an analysis and the more sampling error accumulates◦ One might have a realistic model but large sampling

errors◦ Realism comes at a cost in time and precision!◦ Fewer parameters may give an inaccurate estimate,

but more parameters decrease the precision of the estimate

◦ In general use the simplest model which fits the data

Models can be made more parameter rich to increase their realism

Page 35: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

When models are nested◦ Likelihood ratio test (LRT)◦ Test statistic: -2*ln(likelihood for model 1 / likelihood for model 2) Compared to Chi square distribution df1-df2 degrees of

freedom When models are not nested

◦ Akaike Information Criterion (AIC) 2k-2ln(likelihood), where k is the number of parameteres

estimated in the models The best model has the lowest AIC

◦ Bayesian Information Criterion (BIC) Similar to AIC

Choosing your model

Page 36: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

JC

Variable base frequencies

3 substitution types

2 substitution types

Single substitution type

3 substitution types

2 substitution types

Variable base frequencies

Equal base frequencies

F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM

6 substitution types

6 substitution types

Page 37: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

JC

Variable base frequencies

3 substitution types

2 substitution types

Single substitution type

3 substitution types

2 substitution types

Variable base frequencies

Equal base frequencies

F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM

6 substitution types

6 substitution types

Page 38: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

JC

Variable base frequencies

3 substitution types

2 substitution types

Single substitution type

3 substitution types

2 substitution types

Variable base frequencies

Equal base frequencies

F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM

6 substitution types

6 substitution types

Page 39: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

JC

Variable base frequencies

3 substitution types

2 substitution types

Single substitution type

3 substitution types

2 substitution types

Variable base frequencies

Equal base frequencies

F81

HKY85

F84

TrN

GTR

K2P

K3ST

SYM

6 substitution types

6 substitution types

Page 40: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Yang (1995) has shown that parameter estimates are reasonably stable across tree topologies provided trees are not “too wrong”.

Thus one can obtain a tree using a quick method, such as neighbor-joining, and then estimate parameters on that tree.

These parameters can then be used to calculate the likelihood of the tree.

When the likelihood of the tree is calculated under all the to-be-compared models, the model giving the lowest likelihood or AIC value can be selected.

The final tree is then estimated using this model.

Estimation of likelihood of substitution model parameters

Page 41: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

For both tests, one needs to compute the likelihood of the trees under the models.

For now, assume we know the likelihood of the models we want to compare.

Need to know the likelihood of a model

Page 42: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

LR = 2*(lnL1-lnL0)

LRT statistic approximately follows a chi-square distribution

Degrees of freedom equal to the number of extra parameters in the more complex model

Likelihood ratio test (LRT)

Alternative hypothesis

More parameter-rich

Null hypothesis

Less parameter-rich

Page 43: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

HKY85   -lnL = 1787.08 GTR       -lnL = 1784.82

Then, LR = 2 (1784.82 - 1787.08) = 4.53 degrees of freedom = 4 (GTR adds 4

additional parameters to HKY85)critical value (P = 0.05) = 9.49

GTR does not fit significantly better!

Example

Page 44: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

A measure of the goodness of fit of a model◦ information lost when model M is used to

approximate the process of molecular evolution◦ AIC is an estimate of the expected relative distance

between a fitted model, M, and the unknown true mechanism that generated the data

AIC(M) = - 2*Log(Likelihood(M)) + 2*K(M)◦ K(M) is number of estimable parameters of model M

Given a dataset, models can be ranked according to their AIC

The model with the lowest AIC is selected

Akaike Information Criterion

Page 45: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

BIC takes into account also sample size n BIC(M) = - 2xLog(Likelihood(M)) +

K(M)xLog(n)◦ K(M) is number of estimable parameters of model

M and n is the number of characters

Bayesian Information Criterion

Page 46: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Output of a model testing program

Page 47: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Kelchner & Thomas 2007, TREE 22:87-94

Page 48: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Latest of the latest! Model jumping

◦ Allow the data to determine which model is the most optimal during the analysis

Only available in MrBayes 3.2

JC K2P GTR

Page 49: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

A priori separation of characters into different partitions

Each partition analyzed with a different model In addition to allowing heterogeneity across

data subsets in overall rate and in substitution model parameters, several programs also allow the user to unlink topology and branch lengths

“Different data subsets can thus have independent branch lengths or even different topologies.” (Ronquist and Huelsenbeck, 2003:1573)

Partitioned models

Page 50: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

21 amino acids Models are based largely

on empirical aa replacement matrices

Examples: JTT, WAG, MtREV, Blosum62

Protein models

Page 51: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Parameters include topology and branch lengths!

How to estimate values for those parameters?◦ Distance methods◦ Maximum likelihood methods◦ Bayesian methods

Models have parameters

Page 52: Niklas Wahlberg University of Turku. Jarno Tuimala Free researcher / Finnish Tax Administration

Objective function (score) that quantifies how well the data fit a tree

Used to evaluate and rank alternative trees Two logical steps for phylogenetic methods

that rely on optimality criteria◦ Definition of optimality criterion◦ Maximization (or minimization) of criterion for

alternative trees for their evaluation and ranking

Optimality Criteria