Download - Phylogenetic Inference - Florida State Universitystevet/BSC5936/Wilgenbusch.2003.pdf•Replicates represent some fraction of the total data set. •Jackknife tree is also displayed

$: Phylogenetic Inference - Florida State Universitystevet/BSC5936/Wilgenbusch.2003.pdf•Replicates represent some fraction of the total data set. •Jackknife tree is also displayed$
1

Phylogenetic InferenceMore applications

• Conservation biology– Faith, 1992 (phylogeny and conservation priorities)– Baker and Palumbi, 1994 (illegal whale hunting)

• Epidemiology– Bush et al. 1999 (predictive evolution)

• Forensics– Ou et al., 1992 (dental practice HIV transmission)

• Gene function prediction– Chang and Donoghue, 2000; Bader et al., 2001

• Drug Development– Halbur et al., 1994

Phylogenetic Inference:in a nutshell

GorillaPan

Homo sapiens

Hylobates

Pongo

0.1 substitutions/site

AGAT

T

CT TC

2

Inside phylogenetic inference• What do we want to get out the analyses?

– Accurate picture of evolutionary relationship among the sequences– Comparison among competing hypotheses– Estimate divergence dates– Robustness of the data set

• Select an optimality criterion– Maximum parsimony (and character weighting scheme)– Maximum likelihood (and model of evolution)– Distance (Least-squares and Minimum Evolution) (and distance)

• Select a search strategy– Exact– Heuristic– Stochastic

• Test the Reliability of the search results– Support for individual tree nodes– Support for complete tree topologies– Support for branch length estimates

Parsimony MethodsAttempt to minimize the number of changes on the tree

• Equal Weighted Parsimony– Change from one base to another = 1

• Transversion Parsimony– Transversions (purines {A, G} ´ pyrimidines {C, T}) =1– Transistions (A ´ G or C ´ T) = 0

• Generalized or Weighted Parsimony

A G C TA - 1 3 3G 1 - 3 3C 3 3 - 1T 3 3 1 -

A “step matrix” or costmatrix for transformationsfrom one base to another.

3

Comparing Parsimony Scores

Gorilla

Pan

Homo sapiens

Hylobates

Pongo

50 changes

4635

50

40

61

159

113

Homo sapiens

Gorilla

Pan

Hylobates

Pongo

50 changes

4212

51

44

39

95

73

Gorilla

Pan

Homo sapiens

Hylobates

Pongo

10 changes

333

210

32

20

Unweighted parsimonyscore: 356

Transversionparsimonyscore: 73

GeneralizedparsimonyTv:Ti of 3:1score: 504

No common currency

Choosing among LikelihoodMethods

Choose the model that maximizes a “goodness of fit”statistic without adding unnecessary parameters.

• Likelihood Ratio Test (LRT)• Parametric Bootstrap• Akaike Information Criterion (AIC)

4

Choosing a model

Choosing a model

5

Maximum likelihood inferenceChoosing a DNA substitution model

GorillaPan

Homo sapiens

Hylobates

Pongo


K2P JC

lnL1 = -2625.73859 lnL0 = -2664.43013

38.67 ± ?, with one additional parameter

Maximum likelihood inferencefor nested models -- use LRT

• l = L0/ L1 choose the simpler model when l is large– L0 = ML estimate of simpler model (fewer free parameters, lower

likelihood -- e.g., without rate heterogeneity).– L1= ML estimate of more complicated model (more free

parameters, higher likelihood -- e.g., with rate heterogeneity).

• -2lnl is generally c2 distributed with k degrees of freedom.– where k is the difference between the number of free parameters

used to calculate L0 and L1 (L0 has k fewer parameters then L1).

6

Maximum likelihood inferencefor nested models

GorillaPan

Homo sapiens

Hylobates

Pongo


GorillaPan

Homo sapiens

Hylobates

Pongo


K2P JC

lnL1 = -2625.73859 lnL0 = -2664.43013l = 2(lnL1 - lnL0)

c2df=1=77.38308, P < 0.0001

Maximum likelihood inferencenested models

GTR

SYMTrN

F81

JC

K3ST

K2P

HKY85F84

Equal base frequencies

3 substitution types(transitions,2 transversion classes)

2 substitution types(transitions vs. transversions)

3 substitution types(transversions, 2 transition classes)

2 substitution types(transitions vs.transversions)

Single substitution type

Equal basefrequencies

Single substitution typeEqual base frequencies

(general time-reversible)

(Tamura-Nei)

(Hasegawa-Kishino-Yano)

(Felsenstein)

Jukes-Cantor

(Kimura 2-parameter)

(Kimura 3-subst. type)

(Felsenstein)

7

Parametric Bootstrap:Monte Carlo simulations

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT



Generate datasets on treegiven branchlengths andsubstitutionparameters …

Simulated data sets

Simulating phylogenetic data sets

T1 T2

T3 T4

?

p (AÆA t=.5) = .4p (AÆC t=.5) = .1p (AÆG t=.5) = .3p (AÆT t=.5) = .2

A T1

C G T

.25 .5 .75 10

A

A C T0 .4 .5 .8 1

G

T1T2T3T4

GAAC

TAAC

ACAG

AAAC

AAAC

…………

.5.4

.7

.9.8

8

Parametric bootstrapfor non-nested models:

Monte Carlo SimulationsT1 T2

T3T4

GTR+SS GTR



Replicate i

lnL¢1 lnL¢0

d¢i = lnL¢1 - lnL¢0

Simulate under substitution model

Parametric bootstrap for non-nestedmodels: Monte Carlo Simulations

0

2

4

6

8

10

12

14

16

18

2.5767241 3.2724798 3.9682354 4.6639911 5.3597467 6.0555024 6.751258 6.751258 6.751258 6.751258 6.751258 6.751258

Num

ber o

f Pse

udor

eplic

ates

d¢i = lnL¢1 - lnL¢0

Observed d

9

Maximum likelihood inference (AIC) (Akaike, 1974)

• AIC = -2lnL + 2n• Where,

lnL is the maximum likelihood value of a specificmodel of nucleotide sequence evolution andtree topology given the data.

n = the number of parameters free to vary• Smaller AIC indicates a better model

Phylogeny Inference• What do I want to get out the analyses?

– Accurate picture of evolutionary relationship among the sequences– Comparison among competing hypotheses– Estimate of divergence dates– Robustness of the data set

• Select an optimality criterion– Maximum parsimony– Maximum likelihood– Distance (Least-squares and Minimum Evolution)

• Select a search strategy• Assessing the reliability of a phylogeny

– Support for individual tree nodes– Support for complete tree topologies– Support for branch length estimates

10

Evaluating TreesUnrooted bifurcating (2N-5)!!

Sequences Number of Trees3 14 35 156 1057 9458 103959 135,135

10 2,027,02511 34,459,42512 654,729,07513 13,749,310,57514 316,234,143,22515 7,905,853,580,625

1 3

421

3 4

2

1 3

4 2

1 3

42

1

3 4

2 1 3

4 2

Evaluating TreesRooted bifurcating (2N-3)!!

Sequences Number of Trees3 34 155 1056 9457 103958 135,1359 2,027,025

10 34,459,42511 654,729,07512 13,749,310,57513 316,234,143,22514 7,905,853,580,62515 2,134,580,4667,6875

1324 3124

1324 1324

1324

11

How believable is the optimal tree?

0

2

4

6

8

10

12

14

1289

1313

1328

1343

1358

1373

1388

1403

1418

1433

1448

1463

1478

1493

1508

1523

1538

1553

1568

1583

1598

1613

1628

1643

1658

1673

ML tree = 1300 ± ?

Distribution of tree scores

Consensus TreesA B C D E F A B C D E F A B C DE F

A B C D E F

strict

A B C D E F

semistrict

A B C D E F

50% majority-rule

GB A BC D E F G

Adams

1 2 3

A C D E F G

1

A BC D E F

2

12

Ways of assessing support for atree topology

• Bootstrap/Jackknife analyses• Parametric bootstrap• KH-test and others• Bayesian Posterior Probabilities

Suppose we’re interested in the parameter Q of apopulation.

• Question 1: What estimator of Q will we use?• Question 2: How accurate is the estimator Q?If Q is the mean of some distribution then could use an

estimate of variance:

Not all parameters have this propertyTake repeated samples from the empirical distribution of F

Nonparametric Bootstrap(Efron, 1979)

†

dŸ

=(Xi - X)2

n2

È

Î Í Í

˘

˚ ˙ ˙

13


Nonparametric BootstrapPhylogenetic inference (Felsenstein, 1985)


Lemur catta

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Macaca fuscata

Original data set


Lemur catta

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Macaca fuscata

…

…

pseudo rep n

Lemur catta

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Macaca fuscata

pseudo rep 1

Majority-rule consensus tree

Lemur catta

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Macaca fuscata

100

91

10057

1234567 Freq-----------------.***... 100.00.*****. 100.00.****.. 91.17..**... 57.33.**.... 42.00.***.*. 7.83

(1)(2)

(3)(4)(5)

(6)(7)

14

Jackknifing Phylogenetic Data• Also used to assess support for splits on a given

tree• Data are sampled without replacement• Replicates represent some fraction of the total data

set.• Jackknife tree is also displayed as a majority-rule

consensus tree, where support for a split is givenas the percent of jackknife replicates whichcontain the split.

Parametric Bootstrap

Homo sapiens

Pan

Gorilla

Pongo

Hylobates




Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Homo sapiens

Pongo

Gorilla

Pan

Hylobates

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Generate datasets on treegiven branchlengths andsubstitutionparameters …

reestimate the tree

Simulated data sets

15

Kishino-Hasegawa Test (KH-test)(Kishino and Hasegawa, 1989 Hasegawa and Kishino, 1989)

Lemur catta

Tarsius syrichta

Saimiri sciureus

Macaca fuscata

M. mulatta

M. fascicularis

M. sylvanus

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

Lemur catta

Tarsius syrichta

Saimiri sciureus

Macaca fuscata

M. mulatta

M. fascicularis

M. sylvanus

Homo sapiens

Pan

Gorilla

Pongo

Hylobates

-lnL = 5735.81631-lnL = 5728.062107.75420 ± ?

Kishino-Hasegawa Test (KH-test)Null Hypothesis

If we select the tree topologies a priori:

H0: E[d] = 0

H1: E[d] ≠ 0

The test statistic is the scoredifference between thecompeting hypotheses.

What is the distribution of dunder the null?

0

16

Kishino-Hasegawa Test (KH-test)Bootstrap Null distribution

(Hasegawa and Kishino, 1989)

Nonparametric bootstrap can be used to generate the Null F.Sample from the empirical data set with replacement

d¢i = lnL¢T1 - lnL¢T2 , where i is a bootstrap replicate

d¢1 = lnL¢T1 - lnL¢T2d'2 = lnL¢T1 - lnL¢T2d'3 = lnL¢T1 - lnL¢T2…

d¢n = lnL¢T1 - lnL¢T2

where n is the number of bootstrap replicates

0

5

10

15

20

25

2.576724127 3.272479778 3.968235429 4.66399108 5.359746731 6.055502383 6.751258034 More

Num

ber o

f rep

licat

es

Observed d

d¢i = lnL¢T1 - lnL¢T2

0

Kishino-Hasegawa Test (KH-test)(Time saving methods)

• RELL - Resample Estimated Log-Likelood (Kishino et al.,1990)– Replicate lnL scores for each tree are obtained by resampling the

sitewise log-likelihood scores from the original analysis– large data sets required

• Estimate the variance of d by estimating the variance of thesitewise log-likelihoods (Kishino and Hasegawa, 1989)– No resampling is required

†

n 2 =(s j -s j

_)2

n -1jÂ

17

KH-test Assumptions

• Trees must be selected a priori• Sites are independently and identically

distributed• Large number of sites are sampled• Alternative to KH-test relaxes constraint

that trees are selected a priori– SH-test (Shimodaira and Hasegawa, 1999)

Shimodaira-Hasegawa (SH-test)(Shimodaira and Hasegawa, 1999)

The test statistic is the score difference between theMaximum Likelihood tree and every other tree compared: i.e., dT = lnLML- lnLT

Hypotheses that we wish to test are: H0: all trees are equally good explanations of the data

H1: some or all trees are not equallygood explanations of the data

What is the expected distribution of d under the null? Hint: We know lnLML≥ lnLT

0

5

10

15

20

25

0 5 10 15 20 25 30

freq

uenc

y

0

18

Shimodaira-Hasegawa (SH-test)(Shimodaira and Hasegawa, 1999)


• Generate nonparametric bootstrapreplicates

• For each replicate evaluate each candidatetree (Tml, T2, T3, T4, … Tj) and centerlikelihood scores

• Generate Null distribution -- find maxscore for each replicate and calculate thedifference between Max and each tree

(lnL¢1, lnL¢2, lnL¢3, lnL¢4, … lnL¢j)i

µ = lnL¢1 - mean(lnL¢1)i

find max(µ¢1, µ¢2, µ¢3, µ¢4, … µ¢j)i

d¢T = max(µ¢) - µ¢T

0

5

10

15

20

25

2.576724127 3.272479778 3.968235429 4.66399108 5.359746731 6.055502383 6.751258034 More

Numb

er of

repli

cate

s

• Test hypotheses -- one-sided test

d¢T = lnL¢ML- lnL¢T

Observed d



… i

Tools of the trade• MacClade - David and Wayne Maddison, U. Arizona and U. British

Columbia– http://phylogeny.arizona.edu/macclade/macclade.html

• Mesquite - Wayne and David Maddison, U. British Columbia and U.Arizona– http://mesquiteproject.org/mesquite/mesquite.html

• PAUP* - David Swofford, Florida State U.– http://paup.csit.fsu.edu/

• Phylip - Joe Felsenstein, U. Washington– http://evolution.genetics.washington.edu/phylip.html

• Molphy - Jun Adachi & Masami Hasegawa, Inst. of StatisticalMathematics, Tokyo– http://www.ism.ac.jp/software/ismlib/softother.e.html

• Comprehensive list of phylogeny programs– http://evolution.genetics.washington.edu/phylip/software.html

19

PAUP* Demonstration

• PAUP* web site:– http://paup.csit.fsu.edu/

• Using PAUP*– Documentation, command reference/tutorial

• http://paup.csit.fsu.edu/downl.html– Current Protocols in Bioinformatics

• http://www3.interscience.wiley.com/c_p/index.htm– Chapter 6: Inferring evolutionary relationship

BibliographyAkaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr., 19:716-723.Atchley, W. R. and W. M. Fitch. 1991. Gene trees and the origins of inbred strains of mice. Science 254: 554-558.D. A. Bader, B.M.E Moret, and L. Vawter. 2001. Industrial applications of high-performance computing for phylogeny reconstruction. In H.J. Siegel,

(ed.) Proc. SPIE Commercial Applications for High-Performance Computing. Vol. 4428, pp 159-168. Denver, CO SPIE.Baker and Palumbi 1994, Which whales are hunted? A molecular genetic approach to whaling. Science: 1538-1539.Bush et al. 1999. Predicting the evolution of human influenza A. Science:1921-1925.Efron, B. 1985 Bootstrap confidence intervals for a class of parametric problems. Biometrika, 72, 45-58.Fitch, W. M. and W.R. Atchley. 1985. Evolution in inbred strains of mice appears rapid. Science 228:1169-1175.Fitch, W. M. and E. Margoliash. 1967. Construction of phylogenetic trees. Science155:279-284.Halbur, P., Lum, M. A., Meng, X, Morozov, I., and Paul, P.S. 1994. New porcine reproductive and respiratory syndrome virus DNA and proteins

encoded by open reading frames of an Iowa strain of the virus are used in vaccines against PRRSV in pigs. Patent filing WO9606619-A1.Hillis, D. M. 2000. How to resolve the debate on the origins of AIDS. Science 289:1877-1878.Hillis, D. M. 2000. Origins of HIV. Science 288:1757-1759.Hillis, D. M., J. J. Bull, M. E. White, M. R. Badgett, and I. J. Molineux. 1992. Experimental phylogenetics: generation of a known phylogeny.

Science 255:589-592.Hillis, D. M., J. P. Huelsenbeck, and C. W. Cunningham.# 1994.# Application and accuracy of molecular phylogenies.# Science 264:671-677.Holder, M. T. 2001. Using a Complex Model of Sequence Evolution to Evaluate and Improve Phylogenetic Methods. Ph.D. Dissertation. Univ. of

Texas at Austin.Huelsenbeck, J. P., Hillis, D. M. and Jones, R. 1996. Parametric bootstrapping in molecular phylogenetics: Applications and performance. In

Ferraris, J. D. and Palumbi, S. R. (eds.), Molecular Zoology. Advances, strategies and protocols. Wiley-Liss, New York, pp. 19-45.Huelsenbeck, J. P., Ronquist, F., Nielsen, R., Bollback, J. P. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology.

Science 294: 2310- 2314.

20

Bibliography (continued)Goldman, N. 1993. Statistical tests of models of DNA substitution. Journal of Molecular Evolution 36: 182-98.Goldman, N., J. P. Anderson, and A. G. Rodrigo. 2000. Likelihood-based tests of topologies in phylogenetics. Systematic

Biology 49:652-670.Hasegawa, M and H. Kishino. 1989. Confidence limits on the maximum-likelihood estimate of the hominoid tree from

mitochondrial-DNA sequences. Evolution 43: 627-677.Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies

from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29:170-179.Lewis, P. O. 2001. Phylogenetic systematics turns a new leaf . Trends in Evolution and Ecology 16:30-36.Li, W. 1997. Molecular Evolution. Sinauer Associates. Sunderland, Massachusetts.Ou, C. et al. 1992. Molecular epidemiology of HIV transmission in a dental practice. 1165-1171Page, R. D. and Holmes, E. C. 1998. Molecular Evolution: A Phylogenetic Approach. Blackwell Science, Oxford.

Poe, S., and D. L. Swofford. 1999. Taxon sampling revisited. Nature 389:299-300.Shimodaira, H. and M. Hasegawa. 1999. Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic

Inference. Molecular Biology and Evolution 16:1114-1116.Steel, M. and Penny, D. 2000. Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology

and Evolution 17:839-850.Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pages 407-514 in D. M. Hillis,

C. Moritz, and B. Mable (eds.) Molecular Systematics (2nd ed.), Sinauer Associates, Sunderland, Massachusetts.

Download - Phylogenetic Inference - Florida State Universitystevet/BSC5936/Wilgenbusch.2003.pdf•Replicates represent some fraction of the total data set. •Jackknife tree is also displayed

Top Related