introduction to phylogenetic estimation algorithms · 2011-09-15 · • maximum likelihood is...
TRANSCRIPT
![Page 1: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/1.jpg)
Introduction to PhylogeneticEstimation Algorithms
Tandy Warnow
![Page 2: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/2.jpg)
Phylogeny
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website,University of Arizona
![Page 3: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/3.jpg)
Phylogenetic Analysis• Gather data• Align sequences• Estimate phylogeny on the multiple alignment• Estimate the reliable aspects of the evolutionary
history (using bootstrapping, consensus trees, orother methods)
• Perform post-tree analyses
![Page 4: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/4.jpg)
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
![Page 5: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/5.jpg)
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
![Page 6: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/6.jpg)
UVWXY
U
V W
X
Y
AGTGGATTATGCCCATATGACTTAGCCCTAAGCCCGCTT
![Page 7: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/7.jpg)
…ACGGTGCAGTTACC-A…
…AC----CAGTCACCTA…
The true multiple alignment– Reflects historical substitution, insertion, and deletion
events– Defined using transitive closure of pairwise alignments
computed on edges of the true tree
…ACGGTGCAGTTACCA…
SubstitutionDeletion
…ACCAGTCACCTA…
Insertion
![Page 8: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/8.jpg)
Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
![Page 9: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/9.jpg)
Phase 1: Multiple Sequence Alignment
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
![Page 10: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/10.jpg)
Phase 2: Construct tree
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
S1
S4
S2
S3
![Page 11: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/11.jpg)
Alignment methods• Clustal• POY (and POY*)• Probcons (and Probtree)• MAFFT• Prank• Muscle• Di-align• T-Coffee• Opal• FSA• Infernal• Etc.
Phylogeny methods• Bayesian MCMC• Maximum parsimony• Maximum likelihood• Neighbor joining• FastME• UPGMA• Quartet puzzling• Etc.
RAxML: best heuristic for large-scale ML optimization
![Page 12: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/12.jpg)
• How are methods evaluated?
• Which methods perform well?
• What about other evolutionary processes,such as duplications or rearrangements?
• What if the phylogeny is not a tree?
• What are the major outstanding challenges?
![Page 13: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/13.jpg)
• Part I (Basics): standard statistical models ofsubstitution-only sequence evolution, methods forphylogeny estimation, performance criteria, and basicproof techniques.
• Part II (Advanced): Alignment estimation, morecomplex models of sequence evolution, species treeestimation from gene trees and sequences, reticulateevolution, and gene order/content phylogeny.
![Page 14: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/14.jpg)
Part I: Basics• Substitution-only models of evolution• Performance criteria• Standard methods for phylogeny estimation• Statistical performance guarantees and proof techniques• Performance on simulated and real data• Evaluating support
![Page 15: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/15.jpg)
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
![Page 16: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/16.jpg)
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
![Page 17: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/17.jpg)
Markov Models of Site Evolution
Jukes-Cantor (JC):• T is binary tree and has substitution probabilities p(e) on each edge e.• The state at the root is randomly drawn from {A,C,T,G}• If a site (position) changes on an edge, it changes with equal probability
to each of the remaining states.• The evolutionary process is Markovian.Generalized Time Reversible (GTR) model: general substitution matrix.Rates-across-sites models used to describe variation between sites.
![Page 18: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/18.jpg)
Performance criteria• Running time and space.• Statistical performance issues (e.g., statistical
consistency and sequence length requirements),typically studied mathematically.
• “Topological accuracy” with respect to the underlyingtrue tree. Typically studied in simulation.
• Accuracy with respect to a mathematical score (e.g.tree length or likelihood score) on real data.
![Page 19: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/19.jpg)
FN: false negative (missing edge)FP: false positive (incorrect edge)
50% error rate
FN
FP
![Page 20: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/20.jpg)
Statistical consistency and sequence lengthrequirements
![Page 21: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/21.jpg)
1. Polynomial time distance-based methods2. Hill-climbing heuristics for NP-hard optimization
problems
Phylogenetic reconstruction methods
Phylogenies
Cost
Global optimum
Local optimum
3. Bayesian methods
![Page 22: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/22.jpg)
Distance-based Methods
![Page 23: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/23.jpg)
UPGMAWhile |S|>2:
find pair x,y of closest taxa;
delete x
Recurse on S-{x}
Insert y as sibling to x
Return tree
a b c d e
![Page 24: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/24.jpg)
UPGMA
a b c d e
Works whenevolution is“clocklike”
![Page 25: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/25.jpg)
UPGMA
a
b c
d e
Fails to producetrue tree ifevolutiondeviates toomuch from aclock!
![Page 26: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/26.jpg)
Distance-based Methods
![Page 27: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/27.jpg)
Additive Distance Matrices
![Page 28: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/28.jpg)
Statistical Consistency Theorem (Steel): Logdet distances are
statistically consistent estimators of modeltree distances.
!"Sequence length
![Page 29: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/29.jpg)
Distance-based Methods
![Page 30: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/30.jpg)
Constructing quartet treesFour-Point Condition: A matrix D is additive if and only if for every four
indices i,j,k,l, the maximum and median of the three pairwise sums areidentical
Dij+Dkl < Dik+Djl = Dil+Djk
The Four-Point Method computes quartet trees using the Four-pointcondition (modified for non-additive matrices).
![Page 31: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/31.jpg)
Naïve Quartet MethodInput: estimated matrix {dij}Output: tree or FailAlgorithm: Compute the tree on each quartet using the four-point
methodMerge them into a tree on the entire set if they are compatible:
– Find a sibling pair A,B– Recurse on S-{A}– If S-{A} has a tree T, insert A into T by making A a sibling to
B, and return the tree
![Page 32: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/32.jpg)
Error tolerance of NQM• Note: every quartet tree is correctly computed if every
estimated distance dij is close enough (within f/2) tothe true evolutionary distance Aij, where f is thesmallest internal edge length.
• Hence, the NQM is guaranteed correct if maxij{|dij-Aij|} < f/2.
![Page 33: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/33.jpg)
The Naïve Quartet Method (NQM) returns thetrue tree if is small enough.),( !dL"
!"Sequence length
Hence NQM is statistically consistent under theGTR model (and any model with a statistically consistent distance estimator)
![Page 34: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/34.jpg)
Statistical consistency and sequence lengthrequirements
![Page 35: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/35.jpg)
Theorem (Erdos et al. 1999): The Naïve Quartet Method will return the truetree w.h.p. provided sequence lengths are exponential in theevolutionary diameter of the tree.
Sketch of proof:
• NQM guaranteed correct if all entries in the estimated distance matrixhave low error.
• Estimations of large distances require long sequences to have low errorwith high probability (w.h.p).
Note: Other methods have the same guarantee (various authors), andbetter empirical performance.
![Page 36: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/36.jpg)
Performance on large diameter trees
Simulation study basedupon fixed edgelengths, K2P model ofevolution, sequencelengths fixed to 1000nucleotides.
Error rates reflectproportion of incorrectedges in inferredtrees.
[Nakhleh et al. ISMB 2001]
NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Erro
r Rat
e
![Page 37: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/37.jpg)
Statistical consistency and sequence lengthrequirements
![Page 38: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/38.jpg)
Chordal graph algorithms yield phylogenyestimation from polynomial length
sequences
• Theorem (Warnow etal., SODA 2001):DCM1-NJ correct withhigh probability givensequences of lengthO(ln n eO(g ln n))
• Simulation study fromNakhleh et al. ISMB2001
NJDCM1-NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Erro
r Rat
e
![Page 39: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/39.jpg)
Afc methodsA method M is “absolute fast converging”, or afc, if for
all positive f, g, and ε, there is a polynomial p(n) s.t.Pr(M(S)=T) > 1- ε, when S is a set of sequencesgenerated on T of length at least p(n).
Notes:1. The polynomial p(n) will depend upon M, f, g, and ε.
2. The method M is not “told” the values of f and g.
![Page 40: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/40.jpg)
Fast converging methods (and related work)
• 1997: Erdos, Steel, Szekely, and Warnow (ICALP).
• 1999: Erdos, Steel, Szekely, and Warnow (RSA, TCS); Huson, Nettlesand Warnow (J. Comp Bio.)
• 2001: Warnow, St. John, and Moret (SODA); Cryan, Goldberg, andGoldberg (SICOMP); Csuros and Kao (SODA); Nakhleh, St. John,Roshan, Sun, and Warnow (ISMB)
• 2002: Csuros (J. Comp. Bio.)
• 2006: Daskalakis, Mossel, Roch (STOC), Daskalakis, Hill, Jaffe,Mihaescu, Mossel, and Rao (RECOMB)
• 2007: Mossel (IEEE TCBB)
• 2008: Gronau, Moran and Snir (SODA)
• 2010: Roch (Science)
![Page 41: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/41.jpg)
“Character-based” methods• Maximum parsimony• Maximum Likelihood• Bayesian MCMC (also likelihood-based)
These are more popular than distance-based methods,and tend to give more accurate trees. However,these are computationally intensive!
![Page 42: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/42.jpg)
Standard problem: Maximum Parsimony(Hamming distance Steiner Tree)
• Input: Set S of n sequences of length k• Output: A phylogenetic tree T
– leaf-labeled by sequences in S– additional sequences of length k labeling
the internal nodes of T
such that is minimized.!" )(),(
),(TEji
jiH
![Page 43: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/43.jpg)
Maximum parsimony (example)• Input: Four sequences
– ACT– ACA– GTT– GTA
• Question: which of the three trees has thebest MP scores?
![Page 44: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/44.jpg)
Maximum Parsimony
ACT
GTT ACA
GTA ACA ACT
GTAGTT
ACT
ACA
GTT
GTA
![Page 45: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/45.jpg)
Maximum Parsimony
ACT
GTT
GTT GTA
ACA
GTA
12
2
MP score = 5
ACA ACT
GTAGTT
ACA ACT3 1 3
MP score = 7
ACT
ACA
GTT
GTAACA GTA1 2 1
MP score = 4
Optimal MP tree
![Page 46: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/46.jpg)
Maximum Parsimony:computational complexity
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Finding the optimal MP tree is NP-hard
Optimal labeling can becomputed in linear time O(nk)
![Page 47: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/47.jpg)
Local search strategies
Phylogenetic trees
Cost
Global optimum
Local optimum
![Page 48: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/48.jpg)
Local search strategies• Hill-climbing based upon topological
changes to the tree
• Incorporating randomness to exit fromlocal optima
![Page 49: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/49.jpg)
Evaluating heuristics withrespect to MP or ML scores
Time
Scoreof besttrees
Performance of Heuristic 1
Performance of Heuristic 2
Fake study
![Page 50: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/50.jpg)
Maximum ParsimonyGood heuristics are available, but can take a very long time (days
or weeks) on large datasets.Typically a consensus tree is returned, since MP can produce
many equally optimal trees.Bootstrapping can also be used to produce support estimations.
MP is not always statistically consistent, even if solved exactly.
![Page 51: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/51.jpg)
MP is not statistically consistent
• Jukes-Cantor evolution• The Felsenstein zone
A
B
C
D
A
C
B
D
![Page 52: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/52.jpg)
Maximum LikelihoodML problem: Given set S of sequences, find the
model tree (T,Θ) such that Pr{S|T,Θ} ismaximized.
Notes:• Computing Pr{S|T,Θ} is polynomial (using
Dynamic Programming) for the GTR model.
![Page 53: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/53.jpg)
GTR Maximum LikelihoodNotes (cont.):• ML is statistically consistent under the General Time Reversible
(GTR) model if solved exactly.• Finding the best GTR model tree is NP-hard (Roch).• Good heuristics exist, but can be computationally intensive
(days or weeks) on large datasets. Memory requirements canbe high.
• Bootstrapping is used to produce support estimations on edges.
Questions:What is the computational complexity of finding the best
parameters Θ on a fixed tree T?What is the sequence length requirement for ML?
![Page 54: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/54.jpg)
Maximum Integrated LikelihoodProblem: Given set S of sequences, find tree topology T such
that∫Pr{S| T,Θ} d(T,Θ) is maximized.
• Recall that computing Pr{S|T,Θ} is polynomial (using DynamicProgramming) for models like Jukes-Cantor.
• We sample parameter values to estimate ∫Pr{S|T,Θ} d(T,Θ).This allows us to estimate the maximum integrated likelihoodtree.
• Question: Can we compute this integral analytically?
![Page 55: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/55.jpg)
Bayesian MCMC• MCMC is used to perform a random walk through model tree
space. After burn-in, a distribution of trees is computed basedupon a random sample of the visited tree topologies.
• A consensus tree or the MAP (maximum a posteriori) tree canalso be returned.
• Support estimations provided as part of output.
• The MAP tree is a statistically consistent estimation for GTR (forappropriate priors).
• Computational issues can be significant (e.g., time to reachconvergence).
![Page 56: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/56.jpg)
Summary (so far)• Distance-based methods are generally polynomial time and
statistically consistent.• Maximum likelihood is NP-hard but statistically consistent if
solved exactly. Good heuristics have no guarantees but can befast on many datasets.
• Bayesian methods can be statistically consistent estimators, butare computationally intensive.
• Maximum Parsimony is not guaranteed statistically consistent,and is NP-hard to solve exactly. Heuristics are computationallyintensive.
![Page 57: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/57.jpg)
Part II• Sequence alignment• From gene trees to species trees• Gene order phylogeny• Reticulate evolution
![Page 58: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/58.jpg)
UVWXY
U
V W
X
Y
AGTGGATTATGCCCATATGACTTAGCCCTAAGCCCGCTT
![Page 59: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/59.jpg)
Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
![Page 60: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/60.jpg)
Phase 1: Multiple SequenceAlignment
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
![Page 61: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/61.jpg)
Phase 2: Construct tree
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
S1
S4
S2
S3
![Page 62: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/62.jpg)
…ACGGTGCAGTTACCA…
…ACCAGTCACCA…
MutationDeletionThe true pairwise alignment is:
…ACGGTGCAGTTACCA…
…AC----CAGTCACCA…
The true multiple alignment is defined by the transitiveclosure of pairwise alignments on the edges of the true tree.
![Page 63: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/63.jpg)
Estimating pairwise alignmentsPairwise alignment is often estimated by
computing the “edit distance”(equivalently, finding a minimum costtransformation).
This requires a cost for indels andsubstitutions.
![Page 64: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/64.jpg)
Example of pairwise alignments1 = ACAT,s2 = ATGCAT,cost indel = 2, cost substitution = 1 Optimal alignment has cost 4:
A - - C A T A T G C A T
![Page 65: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/65.jpg)
Estimating alignments
• Pairwise alignment is typicallyestimated by finding a minimum costtransformation (cost for indels andsubstitutions). Issues: local vs. global.
• Multiple alignment: minimum costmultiple alignment is NP-hard, undervarious formulations.
![Page 66: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/66.jpg)
MSA methods, in practiceTypical approach:
1. Estimate an initial “guide tree”
2. Perform “progressive alignment” up thetree, using Needleman-Wunsch (or avariant) to align alignments
![Page 67: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/67.jpg)
So many methods!!!Alignment method• Clustal• POY (and POY*)• Probcons (and Probtree)• MAFFT• Prank• Muscle• Di-align• T-Coffee• Satchmo• Etc.Blue = used by systematistsPurple = recommended by Edgar and
Batzoglou for protein alignments
Phylogeny method• Bayesian MCMC• Maximum parsimony• Maximum likelihood• Neighbor joining• UPGMA• Quartet puzzling• Etc.
![Page 68: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/68.jpg)
Simulation Studies
S1 S2
S3S4
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-CT-------GACCGC--S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCAS2 = TAGCTATCACGACCGCS3 = TAGCTGACCGCS4 = TCACGACCGACA
S1 = -AGGCTATCACCTGACCTCCAS2 = TAG-CTATCAC--GACCGC--S3 = TAG-C--T-----GACCGC--S4 = T---C-A-CGACCGA----CA
Compare
True tree andalignment
S1 S4
S3S2
Estimated tree andalignment
UnalignedSequences
![Page 69: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/69.jpg)
Quantifying Error
FN: false negative (missing edge)FP: false positive (incorrect edge)
50% error rate
FN
FP
![Page 70: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/70.jpg)
1000 taxon models, ordered by difficulty (Liu et al., Science 2009)
![Page 71: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/71.jpg)
Problems with the two-phase approach• Current alignment methods fail to return reasonable alignments on
large datasets with high rates of indels and substitutions.
• Manual alignment is time consuming and subjective.
• Systematists discard potentially useful markers if they are difficult toalign.
This issues seriously impact large-scale phylogeny estimation (and Treeof Life projects)
![Page 72: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/72.jpg)
Co-estimation methods• Statistical methods (e.g., BAliPhy, StatAlign, Alifritz,
and others) have excellent statistical performance butare extremely computationally intensive.
• Steiner Tree approaches based upon edit distances(e.g., POY) are sometimes used, but these have poortopological accuracy and are also computationallyintensive.
• SATé (new method) has very good empiricalperformance and can run on large datasets, but noguarantees under any statistical models.
![Page 73: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/73.jpg)
SATé-1 and SATé-2 (“Next” SATé), on 1000 leaf models
![Page 74: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/74.jpg)
Alignment-free methods• Roch and Daskalakis (RECOMB 2010) show
that statistically consistent tree estimation ispossible under a single-indel model
• Nelesen et al. (in preparation) give practicalmethod (DACTAL) for estimating treeswithout a full MSA.
![Page 75: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/75.jpg)
DACTAL more accurate than all standardmethods, and much faster than SATé
Average results on 3 large RNA datasets(6K to 28K)
CRW: Comparative RNA database,structural alignments
3 datasets with 6,323 to 27,643 sequencesReference trees: 75% RAxML bootstrap
treesDACTAL (shown in red) run for 5 iterations
starting from FT(Part)
SATé-1 fails on the largest datasetSATé-2 runs but is not more accurate than
DACTAL, and takes longer
![Page 76: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/76.jpg)
Challenges• Large-scale MSA• Statistical estimation under “long” indels• Understanding why existing MSA
methods perform well
![Page 77: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/77.jpg)
Genomes As Signed Permutations
1 –5 3 4 -2 -6or
6 2 -4 –3 5 –1etc.
![Page 78: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/78.jpg)
Whole-Genome Phylogenetics
A
B
C
D
E
F
X
Y
ZW
A
B
C
D
E
F
![Page 79: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/79.jpg)
Genome-scale evolution
(REARRANGEMENTS)
InversionTranslocationDuplication
![Page 80: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/80.jpg)
Other types of events• Duplications, Insertions, and Deletions
(changes gene content)• Fissions and Fusions (for genomes with more
than one chromosome)
These events change the number of copies ofeach gene in each genome (“unequal genecontent”)
![Page 81: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/81.jpg)
Huge State Space
• DNA sequences : 4 states per site• Signed circular genomes with n genes:
states, 1 site
• Circular genomes (1 site)
– with 37 genes (mitochondria): states
– with 120 genes (chloroplasts): states
)!1(2 1 !! nn
521056.2 !2321070.3 !
![Page 82: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/82.jpg)
Why use gene orders?• “Rare genomic changes”: huge state space and
relative infrequency of events (compared to sitesubstitutions) could make the inference of deepevolution easier, or more accurate.
• Much research shows this is true, but accurateanalysis of gene order data is computationally veryintensive!
![Page 83: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/83.jpg)
Phylogeny reconstruction fromgene orders
• Distance-based reconstruction• Maximum Parsimony for Rearranged
Genomes• Maximum Likelihood and Bayesian methods
![Page 84: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/84.jpg)
Phylogeny reconstruction from gene orders
Distance-based reconstruction:• Compute edit distances between all pairs of
genomes• Correct for unseen changes (statistically-
based distance corrections)• Apply distance-based methods
![Page 85: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/85.jpg)
Maximum Parsimony on Rearranged Genomes (MPRG)
• The leaves are rearranged genomes.• NP-hard: Find the tree that minimizes the total number of
rearrangement events• NP-hard: Find the median of three genomes
A
B
C
D
3 6
2
3
4
A
B
C
D
EF
Total length= 18
![Page 86: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/86.jpg)
Challenges• Pairwise comparisons under complex models• Estimating ancestral genomes• Better models of genome evolution• Combining genome-scale events with
sequence-scale events• Accurate and scalable methods
![Page 87: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/87.jpg)
Estimating species trees frommultiple genes
![Page 88: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/88.jpg)
Multi-gene analysesAfter alignment of each gene dataset:
• Combined analysis: Concatenate (“combine”)alignments for different genes, and runphylogeny estimation methods
• Supertree: Compute trees on alignment andcombine gene trees
![Page 89: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/89.jpg)
. . .
Analyzeseparately
SupertreeMethod
Two competing approaches gene 1 gene 2 . . . gene k
. . . Combined Analysis
Spe
cies
![Page 90: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/90.jpg)
Supertree estimation vs. CA-ML
Scaffold Density (%)
(Swenson et al., In Press, Systematic Biology)
![Page 91: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/91.jpg)
From Gene Trees to Species Trees
• Gene trees can differ from species treesdue to many causes, including– Duplications and losses– Incomplete lineage sorting– Horizontal gene transfer– Gene tree estimation error
![Page 92: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/92.jpg)
Red gene tree ≠ species tree(green gene tree okay)
![Page 93: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/93.jpg)
Present
Past
Courtesy James Degnan
![Page 94: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/94.jpg)
Gene tree in a species tree
Courtesy James Degnan
![Page 95: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/95.jpg)
Incomplete Lineage Sorting(deep coalescence)
• Population-level process leading to gene trees differing fromspecies trees
• Factors include short times between speciation events andpopulation size
• Methods for estimating species trees under ILS includestatistical approaches (*BEAST, BUCKy, STEM, GLASS, etc.)and discrete optimization methods for MDC (minimize deepcoalescence).
![Page 96: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/96.jpg)
Species Networks
A B C D E
![Page 97: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/97.jpg)
Why Phylogenetic Networks?• Lateral gene transfer (LGT)
– Ochman estimated that 755 of 4,288 ORF’s inE.coli were from at least 234 LGT events
• Hybridization– Estimates that as many as 30% of all plant
lineages are the products of hybridization– Fish– Some frogs
![Page 98: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/98.jpg)
Species Networks
A B C D E
![Page 99: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/99.jpg)
Gene Tree I in Species Networks
A B C D E
A B C D E
![Page 100: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/100.jpg)
Gene Tree II in Species Networks
A B C D E
A B C D E A B C D E
![Page 101: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/101.jpg)
. . .
Analyzeseparately
SupernetworkMethod
Two competing approaches gene 1 gene 2 . . . gene k
. . . Combined Analysis
Spe
cies
![Page 102: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/102.jpg)
Most phylogenetic network methods donot produce “explicit” models ofreticulate evolution, but rather visualrepresentations of deviations fromadditivity. This produces a lot of falsepositives!
![Page 103: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/103.jpg)
• Current methods may be fast enough for typical (almost) wholegenome analyses (small numbers of taxa and hundreds tothousands of genes).
• New methods will need to be developed for larger numbers oftaxa.
• New methods are also needed to address the complexity of realbiological data (long indels, rearrangements, duplications,heterotachy, reticulation, fragmentary data from NGS, etc.)
• Tree-of-life scale analyses will require many algorithmicadvances, and present very interesting mathematical questions.
![Page 104: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/104.jpg)
Sequence length requirements
• The sequence length (number of sites) that aphylogeny reconstruction method M needs toreconstruct the true tree with probability atleast 1-ε depends on
• M (the method)• ε• f = min p(e),• g = max p(e), and• n, the number of leaves
![Page 105: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/105.jpg)
Better distance-based methods
• Neighbor Joining• Minimum Evolution• Weighted Neighbor Joining• Bio-NJ• DCM-NJ• And others
![Page 106: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/106.jpg)
Neighbor joining has poor performance on largediameter trees [Nakhleh et al. ISMB 2001]
Simulation studybased upon fixededge lengths, K2Pmodel of evolution,sequence lengthsfixed to 1000nucleotides.
Error rates reflectproportion ofincorrect edges ininferred trees.
NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Erro
r Rat
e
![Page 107: Introduction to Phylogenetic Estimation Algorithms · 2011-09-15 · • Maximum likelihood is NP-hard but statistically consistent if solved exactly. Good heuristics have no guarantees](https://reader033.vdocument.in/reader033/viewer/2022050309/5f7118f12ba04853640708f9/html5/thumbnails/107.jpg)
“Boosting” MP heuristics
• We use “Disk-covering methods”(DCMs) to improve heuristic searchesfor MP and ML
DCMBase method M DCM-M