1
Phylogenetic InferenceMore applications
• Conservation biology– Faith, 1992 (phylogeny and conservation priorities)– Baker and Palumbi, 1994 (illegal whale hunting)
• Epidemiology– Bush et al. 1999 (predictive evolution)
• Forensics– Ou et al., 1992 (dental practice HIV transmission)
• Gene function prediction– Chang and Donoghue, 2000; Bader et al., 2001
• Drug Development– Halbur et al., 1994
Phylogenetic Inference:in a nutshell
GorillaPan
Homo sapiens
Hylobates
Pongo
0.1 substitutions/site
AGAT
T
CT TC
2
Inside phylogenetic inference• What do we want to get out the analyses?
– Accurate picture of evolutionary relationship among the sequences– Comparison among competing hypotheses– Estimate divergence dates– Robustness of the data set
• Select an optimality criterion– Maximum parsimony (and character weighting scheme)– Maximum likelihood (and model of evolution)– Distance (Least-squares and Minimum Evolution) (and distance)
• Select a search strategy– Exact– Heuristic– Stochastic
• Test the Reliability of the search results– Support for individual tree nodes– Support for complete tree topologies– Support for branch length estimates
Parsimony MethodsAttempt to minimize the number of changes on the tree
• Equal Weighted Parsimony– Change from one base to another = 1
• Transversion Parsimony– Transversions (purines {A, G} ´ pyrimidines {C, T}) =1– Transistions (A ´ G or C ´ T) = 0
• Generalized or Weighted Parsimony
A G C TA - 1 3 3G 1 - 3 3C 3 3 - 1T 3 3 1 -
A “step matrix” or costmatrix for transformationsfrom one base to another.
3
Comparing Parsimony Scores
Gorilla
Pan
Homo sapiens
Hylobates
Pongo
50 changes
4635
50
40
61
159
113
Homo sapiens
Gorilla
Pan
Hylobates
Pongo
50 changes
4212
51
44
39
95
73
Gorilla
Pan
Homo sapiens
Hylobates
Pongo
10 changes
333
210
32
20
Unweighted parsimonyscore: 356
Transversionparsimonyscore: 73
GeneralizedparsimonyTv:Ti of 3:1score: 504
No common currency
Choosing among LikelihoodMethods
Choose the model that maximizes a “goodness of fit”statistic without adding unnecessary parameters.
• Likelihood Ratio Test (LRT)• Parametric Bootstrap• Akaike Information Criterion (AIC)
4
Choosing a model
Choosing a model
5
Maximum likelihood inferenceChoosing a DNA substitution model
GorillaPan
Homo sapiens
Hylobates
Pongo
0.1 substitutions/site
K2P JC
lnL1 = -2625.73859 lnL0 = -2664.43013
38.67 ± ?, with one additional parameter
Maximum likelihood inferencefor nested models -- use LRT
• l = L0/ L1 choose the simpler model when l is large– L0 = ML estimate of simpler model (fewer free parameters, lower
likelihood -- e.g., without rate heterogeneity).– L1= ML estimate of more complicated model (more free
parameters, higher likelihood -- e.g., with rate heterogeneity).
• -2lnl is generally c2 distributed with k degrees of freedom.– where k is the difference between the number of free parameters
used to calculate L0 and L1 (L0 has k fewer parameters then L1).
6
Maximum likelihood inferencefor nested models
GorillaPan
Homo sapiens
Hylobates
Pongo
0.05 substitutions/site
GorillaPan
Homo sapiens
Hylobates
Pongo
0.1 substitutions/site
K2P JC
lnL1 = -2625.73859 lnL0 = -2664.43013l = 2(lnL1 - lnL0)
c2df=1=77.38308, P < 0.0001
Maximum likelihood inferencenested models
GTR
SYMTrN
F81
JC
K3ST
K2P
HKY85F84
Equal base frequencies
3 substitution types(transitions,2 transversion classes)
2 substitution types(transitions vs. transversions)
3 substitution types(transversions, 2 transition classes)
2 substitution types(transitions vs.transversions)
Single substitution type
Equal basefrequencies
Single substitution typeEqual base frequencies
(general time-reversible)
(Tamura-Nei)
(Hasegawa-Kishino-Yano)
(Felsenstein)
Jukes-Cantor
(Kimura 2-parameter)
(Kimura 3-subst. type)
(Felsenstein)
7
Parametric Bootstrap:Monte Carlo simulations
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Generate datasets on treegiven branchlengths andsubstitutionparameters …
Simulated data sets
Simulating phylogenetic data sets
T1 T2
T3 T4
?
p (AÆA t=.5) = .4p (AÆC t=.5) = .1p (AÆG t=.5) = .3p (AÆT t=.5) = .2
A T1
C G T
.25 .5 .75 10
A
A C T0 .4 .5 .8 1
G
T1T2T3T4
GAAC
TAAC
ACAG
AAAC
AAAC
…………
.5.4
.7
.9.8
8
Parametric bootstrapfor non-nested models:
Monte Carlo SimulationsT1 T2
T3T4
GTR+SS GTR
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Replicate i
lnL¢1 lnL¢0
d¢i = lnL¢1 - lnL¢0
Simulate under substitution model
Parametric bootstrap for non-nestedmodels: Monte Carlo Simulations
0
2
4
6
8
10
12
14
16
18
2.5767241 3.2724798 3.9682354 4.6639911 5.3597467 6.0555024 6.751258 6.751258 6.751258 6.751258 6.751258 6.751258
Num
ber o
f Pse
udor
eplic
ates
d¢i = lnL¢1 - lnL¢0
Observed d
9
Maximum likelihood inference (AIC) (Akaike, 1974)
• AIC = -2lnL + 2n• Where,
lnL is the maximum likelihood value of a specificmodel of nucleotide sequence evolution andtree topology given the data.
n = the number of parameters free to vary• Smaller AIC indicates a better model
Phylogeny Inference• What do I want to get out the analyses?
– Accurate picture of evolutionary relationship among the sequences– Comparison among competing hypotheses– Estimate of divergence dates– Robustness of the data set
• Select an optimality criterion– Maximum parsimony– Maximum likelihood– Distance (Least-squares and Minimum Evolution)
• Select a search strategy• Assessing the reliability of a phylogeny
– Support for individual tree nodes– Support for complete tree topologies– Support for branch length estimates
10
Evaluating TreesUnrooted bifurcating (2N-5)!!
Sequences Number of Trees3 14 35 156 1057 9458 103959 135,135
10 2,027,02511 34,459,42512 654,729,07513 13,749,310,57514 316,234,143,22515 7,905,853,580,625
1 3
421
3 4
2
1 3
4 2
1 3
42
1
3 4
2 1 3
4 2
Evaluating TreesRooted bifurcating (2N-3)!!
Sequences Number of Trees3 34 155 1056 9457 103958 135,1359 2,027,025
10 34,459,42511 654,729,07512 13,749,310,57513 316,234,143,22514 7,905,853,580,62515 2,134,580,4667,6875
1324 3124
1324 1324
1324
11
How believable is the optimal tree?
0
2
4
6
8
10
12
14
1289
1313
1328
1343
1358
1373
1388
1403
1418
1433
1448
1463
1478
1493
1508
1523
1538
1553
1568
1583
1598
1613
1628
1643
1658
1673
ML tree = 1300 ± ?
Distribution of tree scores
Consensus TreesA B C D E F A B C D E F A B C DE F
A B C D E F
strict
A B C D E F
semistrict
A B C D E F
50% majority-rule
GB A BC D E F G
Adams
1 2 3
A C D E F G
1
A BC D E F
2
12
Ways of assessing support for atree topology
• Bootstrap/Jackknife analyses• Parametric bootstrap• KH-test and others• Bayesian Posterior Probabilities
Suppose we’re interested in the parameter Q of apopulation.
• Question 1: What estimator of Q will we use?• Question 2: How accurate is the estimator Q?If Q is the mean of some distribution then could use an
estimate of variance:
Not all parameters have this propertyTake repeated samples from the empirical distribution of F
Nonparametric Bootstrap(Efron, 1979)
†
dŸ
=(Xi - X)2
n2
È
Î Í Í
˘
˚ ˙ ˙
13
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Nonparametric BootstrapPhylogenetic inference (Felsenstein, 1985)
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Lemur catta
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Macaca fuscata
Original data set
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Lemur catta
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Macaca fuscata
…
…
pseudo rep n
Lemur catta
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Macaca fuscata
pseudo rep 1
Majority-rule consensus tree
Lemur catta
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Macaca fuscata
100
91
10057
1234567 Freq-----------------.***... 100.00.*****. 100.00.****.. 91.17..**... 57.33.**.... 42.00.***.*. 7.83
(1)(2)
(3)(4)(5)
(6)(7)
14
Jackknifing Phylogenetic Data• Also used to assess support for splits on a given
tree• Data are sampled without replacement• Replicates represent some fraction of the total data
set.• Jackknife tree is also displayed as a majority-rule
consensus tree, where support for a split is givenas the percent of jackknife replicates whichcontain the split.
Parametric Bootstrap
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Homo sapiens
Pongo
Gorilla
Pan
Hylobates
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Generate datasets on treegiven branchlengths andsubstitutionparameters …
reestimate the tree
Simulated data sets
15
Kishino-Hasegawa Test (KH-test)(Kishino and Hasegawa, 1989 Hasegawa and Kishino, 1989)
Lemur catta
Tarsius syrichta
Saimiri sciureus
Macaca fuscata
M. mulatta
M. fascicularis
M. sylvanus
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
Lemur catta
Tarsius syrichta
Saimiri sciureus
Macaca fuscata
M. mulatta
M. fascicularis
M. sylvanus
Homo sapiens
Pan
Gorilla
Pongo
Hylobates
-lnL = 5735.81631-lnL = 5728.062107.75420 ± ?
Kishino-Hasegawa Test (KH-test)Null Hypothesis
If we select the tree topologies a priori:
H0: E[d] = 0
H1: E[d] ≠ 0
The test statistic is the scoredifference between thecompeting hypotheses.
What is the distribution of dunder the null?
0
16
Kishino-Hasegawa Test (KH-test)Bootstrap Null distribution
(Hasegawa and Kishino, 1989)
Nonparametric bootstrap can be used to generate the Null F.Sample from the empirical data set with replacement
d¢i = lnL¢T1 - lnL¢T2 , where i is a bootstrap replicate
d¢1 = lnL¢T1 - lnL¢T2d'2 = lnL¢T1 - lnL¢T2d'3 = lnL¢T1 - lnL¢T2…
d¢n = lnL¢T1 - lnL¢T2
where n is the number of bootstrap replicates
0
5
10
15
20
25
2.576724127 3.272479778 3.968235429 4.66399108 5.359746731 6.055502383 6.751258034 More
Num
ber o
f rep
licat
es
Observed d
d¢i = lnL¢T1 - lnL¢T2
0
Kishino-Hasegawa Test (KH-test)(Time saving methods)
• RELL - Resample Estimated Log-Likelood (Kishino et al.,1990)– Replicate lnL scores for each tree are obtained by resampling the
sitewise log-likelihood scores from the original analysis– large data sets required
• Estimate the variance of d by estimating the variance of thesitewise log-likelihoods (Kishino and Hasegawa, 1989)– No resampling is required
†
n 2 =(s j -s j
_)2
n -1jÂ
17
KH-test Assumptions
• Trees must be selected a priori• Sites are independently and identically
distributed• Large number of sites are sampled• Alternative to KH-test relaxes constraint
that trees are selected a priori– SH-test (Shimodaira and Hasegawa, 1999)
Shimodaira-Hasegawa (SH-test)(Shimodaira and Hasegawa, 1999)
The test statistic is the score difference between theMaximum Likelihood tree and every other tree compared: i.e., dT = lnLML- lnLT
Hypotheses that we wish to test are: H0: all trees are equally good explanations of the data
H1: some or all trees are not equallygood explanations of the data
What is the expected distribution of d under the null? Hint: We know lnLML≥ lnLT
0
5
10
15
20
25
0 5 10 15 20 25 30
freq
uenc
y
0
18
Shimodaira-Hasegawa (SH-test)(Shimodaira and Hasegawa, 1999)
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
• Generate nonparametric bootstrapreplicates
• For each replicate evaluate each candidatetree (Tml, T2, T3, T4, … Tj) and centerlikelihood scores
• Generate Null distribution -- find maxscore for each replicate and calculate thedifference between Max and each tree
(lnL¢1, lnL¢2, lnL¢3, lnL¢4, … lnL¢j)i
µ = lnL¢1 - mean(lnL¢1)i
find max(µ¢1, µ¢2, µ¢3, µ¢4, … µ¢j)i
d¢T = max(µ¢) - µ¢T
0
5
10
15
20
25
2.576724127 3.272479778 3.968235429 4.66399108 5.359746731 6.055502383 6.751258034 More
Numb
er of
repli
cate
s
• Test hypotheses -- one-sided test
d¢T = lnL¢ML- lnL¢T
Observed d
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
Lemur AAGCTTCATAG…TTACATCATCCAHomo AAGCTTCACCG…TTACATCCTCATPan AAGCTTCACCG…TTACATCCTCATGoril AAGCTTCACCG…CCCACGGACTTAPongo AAGCTTCACCG…GCAACCACCCTCHylo AAGCTTTACAG…TGCAACCGTCCTMaca AAGCTTTTCCG…CGCAACCATCCT
… i
Tools of the trade• MacClade - David and Wayne Maddison, U. Arizona and U. British
Columbia– http://phylogeny.arizona.edu/macclade/macclade.html
• Mesquite - Wayne and David Maddison, U. British Columbia and U.Arizona– http://mesquiteproject.org/mesquite/mesquite.html
• PAUP* - David Swofford, Florida State U.– http://paup.csit.fsu.edu/
• Phylip - Joe Felsenstein, U. Washington– http://evolution.genetics.washington.edu/phylip.html
• Molphy - Jun Adachi & Masami Hasegawa, Inst. of StatisticalMathematics, Tokyo– http://www.ism.ac.jp/software/ismlib/softother.e.html
• Comprehensive list of phylogeny programs– http://evolution.genetics.washington.edu/phylip/software.html
19
PAUP* Demonstration
• PAUP* web site:– http://paup.csit.fsu.edu/
• Using PAUP*– Documentation, command reference/tutorial
• http://paup.csit.fsu.edu/downl.html– Current Protocols in Bioinformatics
• http://www3.interscience.wiley.com/c_p/index.htm– Chapter 6: Inferring evolutionary relationship
BibliographyAkaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr., 19:716-723.Atchley, W. R. and W. M. Fitch. 1991. Gene trees and the origins of inbred strains of mice. Science 254: 554-558.D. A. Bader, B.M.E Moret, and L. Vawter. 2001. Industrial applications of high-performance computing for phylogeny reconstruction. In H.J. Siegel,
(ed.) Proc. SPIE Commercial Applications for High-Performance Computing. Vol. 4428, pp 159-168. Denver, CO SPIE.Baker and Palumbi 1994, Which whales are hunted? A molecular genetic approach to whaling. Science: 1538-1539.Bush et al. 1999. Predicting the evolution of human influenza A. Science:1921-1925.Efron, B. 1985 Bootstrap confidence intervals for a class of parametric problems. Biometrika, 72, 45-58.Fitch, W. M. and W.R. Atchley. 1985. Evolution in inbred strains of mice appears rapid. Science 228:1169-1175.Fitch, W. M. and E. Margoliash. 1967. Construction of phylogenetic trees. Science155:279-284.Halbur, P., Lum, M. A., Meng, X, Morozov, I., and Paul, P.S. 1994. New porcine reproductive and respiratory syndrome virus DNA and proteins
encoded by open reading frames of an Iowa strain of the virus are used in vaccines against PRRSV in pigs. Patent filing WO9606619-A1.Hillis, D. M. 2000. How to resolve the debate on the origins of AIDS. Science 289:1877-1878.Hillis, D. M. 2000. Origins of HIV. Science 288:1757-1759.Hillis, D. M., J. J. Bull, M. E. White, M. R. Badgett, and I. J. Molineux. 1992. Experimental phylogenetics: generation of a known phylogeny.
Science 255:589-592.Hillis, D. M., J. P. Huelsenbeck, and C. W. Cunningham.# 1994.# Application and accuracy of molecular phylogenies.# Science 264:671-677.Holder, M. T. 2001. Using a Complex Model of Sequence Evolution to Evaluate and Improve Phylogenetic Methods. Ph.D. Dissertation. Univ. of
Texas at Austin.Huelsenbeck, J. P., Hillis, D. M. and Jones, R. 1996. Parametric bootstrapping in molecular phylogenetics: Applications and performance. In
Ferraris, J. D. and Palumbi, S. R. (eds.), Molecular Zoology. Advances, strategies and protocols. Wiley-Liss, New York, pp. 19-45.Huelsenbeck, J. P., Ronquist, F., Nielsen, R., Bollback, J. P. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology.
Science 294: 2310- 2314.
20
Bibliography (continued)Goldman, N. 1993. Statistical tests of models of DNA substitution. Journal of Molecular Evolution 36: 182-98.Goldman, N., J. P. Anderson, and A. G. Rodrigo. 2000. Likelihood-based tests of topologies in phylogenetics. Systematic
Biology 49:652-670.Hasegawa, M and H. Kishino. 1989. Confidence limits on the maximum-likelihood estimate of the hominoid tree from
mitochondrial-DNA sequences. Evolution 43: 627-677.Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies
from DNA sequence data, and the branching order in Hominoidea. Journal of Molecular Evolution 29:170-179.Lewis, P. O. 2001. Phylogenetic systematics turns a new leaf . Trends in Evolution and Ecology 16:30-36.Li, W. 1997. Molecular Evolution. Sinauer Associates. Sunderland, Massachusetts.Ou, C. et al. 1992. Molecular epidemiology of HIV transmission in a dental practice. 1165-1171Page, R. D. and Holmes, E. C. 1998. Molecular Evolution: A Phylogenetic Approach. Blackwell Science, Oxford.
Poe, S., and D. L. Swofford. 1999. Taxon sampling revisited. Nature 389:299-300.Shimodaira, H. and M. Hasegawa. 1999. Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic
Inference. Molecular Biology and Evolution 16:1114-1116.Steel, M. and Penny, D. 2000. Parsimony, likelihood, and the role of models in molecular phylogenetics. Molecular Biology
and Evolution 17:839-850.Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pages 407-514 in D. M. Hillis,
C. Moritz, and B. Mable (eds.) Molecular Systematics (2nd ed.), Sinauer Associates, Sunderland, Massachusetts.