phylogenetics - advances in bioinformatics and genomics...
TRANSCRIPT
PhylogeneticsAdvances in Bioinformatics and Genomics
GEN 240B
Jason Stajich
April 26 & 28, 2010
Phylogenetics Slide 1/82
IntroductionPhylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Slide 2/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Introduction Slide 3/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Introduction Phylogenetics Slide 4/82
Phylogeny: Evolutionary Relationships Among Organisms
Systematics is the study of the diversity of organisms whileTaxonomy is the classification of species. It uses and appliesPhylogenetics and Phylogenetic methods.
Phylogenetics is the study of the evolutionary relationships amongorganisms through time.
Phylogenies (evolutionary models or trees) provide a framework forstudying evolutionary patterns and mechanisms.
Phylogenetics Introduction Phylogenetics Slide 5/82
The Tree of Life
The similarity of molecular mechanisms suggests a common ancestor forall organisms.
Thus, any set of species is related. This relationship is called phylogeny.
Phylogenetic trees are often used to represent this relationship.
Phylogenetics Introduction Phylogenetics Slide 6/82
Morphological characters used to construct and interepretphylogenies
Morphological characteristics from living and fossilised organismshave been used for inferring and interpreting phylogenies.
Phylogenetics Introduction Phylogenetics Slide 7/82
Sequence-Based Phylogenies
Molecular relationships can provide more accurate distancemeasures, such as:
Enzyme and Immunological dataGenetic Mapping dataDNA and protein sequence data
Because certain sequences change with an almost constant rate overtime they can serve as molecular clocks.
These clock-like sequences allow to infer often much more reliablephylogenies than traditional approaches.
Phylogenetics Introduction Phylogenetics Slide 8/82
Common Workflow of Phylogenetic Sequence Analyses
1. Select the Appropriate Sequences for a Phylogenetic QuestionImportant: sequences should show significant similarity.
2. Create a Multiple Alignment for Chosen SequencesImportant: unalignable sequence areas should be removed.
S1 FMPFSAGKRICAGEGLARMELFLFLT 450S2 FMPFSAGKRICVGEALAGMELFLFLT 450S3 .LAFGCGARVCLGEPLARLELFVVLT 443S4 SLPFGFGKRSCMGRRLAELELQMALA 470S5 YTPFGSGPRNCIGMRFALMNMKLALI 457consensus ..PFg.GkR.C.Ge.LA.mELfl.Lt
3. Compute a Distance Matrix for Multiple Alignment
S1 S2 S3 S4 S5S1 0.0 0.43 0.71 0.71 0.48S2 0.0 0.57 0.57 0.39S3 0.0 0.29 0.21S4 0.0 0.13S5 0.0
⇒
4. Calculate Phylogenetic TreeImportant: choose a tree building method.
5. Tree Post ProcessingImportant: tree rooting and bootstrapping.
Unresolved Partiallyresolved
Fullyresolved
BifurcationPolytomy or
multifurcation
Species B
Species A
Orthologues
ParaloguesSpeciation
Duplications
= internal node
= external node
= internal branch
= external branch
1 2 3 4 1 2 3 4
=
Taxon A
Taxon B
Taxon C
Taxon D
Taxon A
Taxon B
Taxon C
Taxon D5
4
1
11
1
Taxon A
Taxon B
Taxon C
Taxon D
0510152025
Cladogram Phylogram Ultrametric Tree
Branch lengths have no meaning.
Branch lengths are proportional to (genetic) change.
Branch lengths are proportional to time.
[Million Years]
A B C DA
B
C
-
((A:0.5, B:0.5):0.1,(C:0.5,D:0.5):0.1);
Slanted style
Rectangularstyle
Newick format:
A
B
D
C
D
Unrooted Tree Rooted Trees
A
B
C
D
A B C D
e
a
b
c
d
Root:
e
e
A B C D
a
B A C D
b
Root: a
Root: b
Outgroup
Ingroup
A
B
C
D
A
B
C
D
Cladogram
A
B
C
D5
4
1
11
1
Phylogram
Midpoint Rooting
10
7
6
A
B
C
A B
6 6
4
A
B
C
1
1
1
1
A
B
C
1
6
1
1
A
B
C
1
1
1
1
A
B
C
1
6
1
1
Tree A Tree B
A
B
C
D
0.1
0.20.1
0.3
0.4
1 2 54
67
1 2
4
5
3
1/2 d45
1 2
6
1 2
4
5
3
1/2 d12
1 2 5 34
67
8
91 2
4
5
31/2 d68
3-4. Iteration
2. Iteration
1. Iteration
A
C
B
D A
C
B
D
AAG AAA GGA AGA
AAA AGA
AAA
1 1
1
AAG AGA AAA GGA
AAA AAA
AAA
1 21
������������
Alignment Two Possible Parsimony Trees
S2
S4
S1
S35
3
1
11
1
3 S5
100
78
99
88
Phylogenetics Introduction Phylogenetics Slide 9/82
The Tree of Life
Phylogenies from sequences assume that they have evolved from thesame ancestral gene in a common ancestral species.
Phylogenetics Introduction Phylogenetics Slide 10/82
Incomplete Phylogenies of the Tree of Life
Unresolved Partiallyresolved
Fullyresolved
BifurcationPolytomy or
multifurcation
Some parts of the tree of life are fully resolved, others are only partiallyresolved or completely unresolved.
Phylogenetics Introduction Phylogenetics Slide 11/82
Evolution of Genes
Complication: the evolution of genes is driven by
Gene duplications
Speciation
Special case in bacteria: horizontal gene transfer
Phylogenetics Introduction Phylogenetics Slide 12/82
Orthologous and Paralogous Genes
Species B
Species A
Orthologues
ParaloguesSpeciation
Duplications
Homologous genes: evolved from a common ancestor.
Orthologous gene: homologous genes that evolved by speciation.
Paralogous genes: homologous genes that evolved by gene duplications.
To infer phylogenies of species, orthologous genes need to be used.
To infer phylogenies of gene duplications, paralogous genes need to beused.
Phylogenetics Introduction Phylogenetics Slide 13/82
Interpreting trees: Character evolution
In comparison with its ancestor, an organism has both shared and derivedcharacteristics.
A Shared ancestral character is one that originates in an ancestor beforethe clade formed.
Shared derived character is an evolutionary novelty unique to a particularclade and originated at the time of cladogenesis (everyone in the cladehas it)
Phylogenetics Introduction Phylogenetics Slide 14/82
Interpreting trees: Characters
TAXA
Lanc
elet
(out
grou
p)
Lam
prey
Sala
man
der
Leop
ard
Turt
le
Tuna
Vertebral column(backbone)
Hinged jaws
Four walking legs
Amniotic (shelled) egg
CH
AR
AC
TER
S
Hair
(a) Character table
0
0 0
0
0
0
0 0
0
0
0 0
0 0 0 1
11
111
1
11
1
1
11
11
Phylogenetics Introduction Phylogenetics Slide 15/82
Interpreting trees: Characters on Trees
Hair
Hinged jaws
Vertebralcolumn
Four walking legs
Amniotic egg
(b) Phylogenetic tree
Salamander
Leopard
Turtle
Lamprey
Tuna
Lancelet(outgroup)
Phylogenetics Introduction Phylogenetics Slide 16/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Tree Basics Slide 17/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Tree Basics Tree Topology Slide 18/82
Basics on Trees
Usually, the leaves (end nodes) of a tree are labelled.
In some cases the labels can be swapped (see Fig 7) withoutchanging the tree.
A tree with a given leaf labelling is called a labelled branchingpattern or the tree topology T .
The lengths of the edges are denoted by ti .
Phylogenetics Tree Basics Tree Topology Slide 19/82
Basics on Trees
Typically, binary trees are used in phylogenetics, where three branchesmeet at each branch node.
Tree components
internal and external branches (edges)internal and external nodes
Unresolved Partiallyresolved
Fullyresolved
BifurcationPolytomy or
multifurcation
Species B
Species A
Orthologues
ParaloguesSpeciation
Duplications
= internal node
= external node
= internal branch
= external branch
A true phylogeny has a root, which represents the ultimate ancestor forall the other items in a tree.
Certain algorithms like parsimony or probabilistic models provide noinformation about the position of the root in a tree.
Phylogenetics Tree Basics Tree Topology Slide 20/82
Tree Styles
Phylogenetics Tree Basics Tree Topology Slide 21/82
Tree Styles
Circle Tree
Phylogenetics Tree Basics Tree Topology Slide 22/82
Significance of Branch Lengths in Cladograms, Phylogramsand Ultrametric Trees
Unresolved Partiallyresolved
Fullyresolved
BifurcationPolytomy or
multifurcation
Species B
Species A
Orthologues
ParaloguesSpeciation
Duplications
= internal node
= external node
= internal branch
= external branch
1 2 3 4 1 2 3 4
=
Taxon A
Taxon B
Taxon C
Taxon D
Taxon A
Taxon B
Taxon C
Taxon D5
4
1
11
1
Taxon A
Taxon B
Taxon C
Taxon D
0510152025
Cladogram Phylogram Ultrametric Tree
Branch lengths have no meaning.
Branch lengths are proportional to (genetic) change.
Branch lengths are proportional to time.
[Million Years]
Phylogenetics Tree Basics Tree Topology Slide 23/82
Rotating Branches in Phylogenetic Trees
Unresolved Partiallyresolved
Fullyresolved
BifurcationPolytomy or
multifurcation
Species B
Species A
Orthologues
ParaloguesSpeciation
Duplications
= internal node
= external node
= internal branch
= external branch
1 2 3 4 1 2 3 4
=Figure 7: Branch Rotations
Rotations of internal nodes yield exactly the same tree.
Phylogenetics Tree Basics Tree Topology Slide 24/82
Rooting Trees
There are five possibilities to root the unrooted tree on the top left.
Trees rooted on branches c and d are not shown.
Phylogenetics Tree Basics Tree Topology Slide 25/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Tree Basics Counting Trees Slide 26/82
Number of Nodes and Branches in Trees
N nodes in rooted tree = (2n − 1)
N branches in rooted tree = (2n − 2)
N nodes in unrooted tree = (2n − 2)
N branches in unrooted tree = (2n − 3)
n = number of leaves (taxa)
Phylogenetics Tree Basics Counting Trees Slide 27/82
Number of Unrooted and Rooted Trees
The number of possible trees grows more than exponentially as thenumber of taxa n increases:
N unrooted trees =(2n − 5)!
2n−3(n − 3)!= (2n − 5)!! (1)
N rooted trees =(2n − 3)!
2n−2(n − 2)!= (2n − 3)!! (2)
n = number of leaves (taxa)N = number of possible trees
Example
n N Unrooted Trees N Rooted Trees
3 1 34 3 155 15 1057 954 10,395
10 2,027,025 34,459,425
Phylogenetics Tree Basics Counting Trees Slide 28/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Tree Basics Tree Rooting Methods Slide 29/82
Methods for Rooting Phylogenetic Trees
Outgroup Method
Rooting by including one or more outgroup taxa/sequences.
Gene Duplication
Paralogous gene duplication predating the common ancestorof a clade are used.
Midpoint Rooting
Tree is rooted by midpoint between the two most distantbranches.
Phylogenetics Tree Basics Tree Rooting Methods Slide 30/82
Rooting by Outgroup
Rooting is accomplished by including one or more outgroup(taxa/sequences) that differ from all ingroup members morethan all the ingroup members among each other.
The main assumption of this method is that outgroup taxafall outside of the ingroup.
Phylogenetics Tree Basics Tree Rooting Methods Slide 31/82
Rooting with Duplicated Genes
A gene duplication in an ancestral organism gives rise toparalogous genes.
Speciation processes give rise to orthologous genes.
Phylogenetics Tree Basics Tree Rooting Methods Slide 32/82
Rooting with Duplicated Genes
The root is placed between paralogous gene populations.
Gene Copies A
Gene Copies B
Phylogenetics Tree Basics Tree Rooting Methods Slide 33/82
Rooting the Tree of Life
Universal trees based on single gene orthologs cannot be rooted by theoutgroup method, because of the lack of an ancestral sequence.
Solution: use ancient gene duplication that predates the last commonancestor (cenancestor) of all living organisms [Iwabe 1989].
Cenancestor
Phylogenetics Tree Basics Tree Rooting Methods Slide 34/82
Midpoint Rooting
Choose the midpoint between the two most distant branches.
Midpoint rooting assumes that the rate of evolution is thesame on the longest branches of the tree.
Phylogenetics Tree Basics Tree Rooting Methods Slide 35/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Tree Basics Inferring Trees from Distances Slide 36/82
Transforming Characters to Distances
DNA Alignment
G1 TTATTAAG2 AATTTAAG3 AAAAATAG4 AAAAAAT
Distance Matrix
G1 G2 G3 G4
G1 0.43 0.71 0.71G2 3 0.57 0.57G3 5 4 0.29G4 5 4 2
Absolute distances in bottom triangle and uncorrected relative distances in top triangle.
Phylogenetics Tree Basics Inferring Trees from Distances Slide 37/82
Similarity vs. Phylogenetic RelationshipSimilarity and phylogenetic relationships are not the same.
Similarity refers to likeness or resemblance.
Phylogenetic relationship refers to historical connections through commonancestry.
Similarity: evolutionary relationship when distances are ultrametric (e.g.
sequences are evolving in a perfectly clock-like manner).
Example Tree A: B is most similar to A and is also most closelyrelated to A.
When distances are not ultrametric, two taxa can be most similar without
being closely related.
Example Tree B: B is more similar to C, but it is most closelyrelated to A.
Phylogenetics Tree Basics Inferring Trees from Distances Slide 38/82
Properties of Distances
Metric Distances: A matrix of metric distances must satisfy the following four
conditions for all taxa.
1 Identityd(A,A) = 0
2 Symmetryd(A,B) = d(B,A)
3 Non-negativityd(A,B) ≥ 0 if A 6= B
4 Triangle inequalityd(A,C) ≤ d(A,B) + d(B,C)
Phylogenetics Tree Basics Inferring Trees from Distances Slide 39/82
Properties of Distances
Ultrametric Distances: Satisfy conditions 1-4 plus condition 5.
5. Ultrametric conditiond(A,B) ≤ max [d(A,C), d(B,C)]d(A,C) ≤ max [d(A,B), d(B,C)]d(B,C) ≤ max [d(A,B), d(A,C)]
Condition 5 can only be true if the two largest distances are equal anddefine the longest sides of an isosceles triangle.
Phylogenetics Tree Basics Inferring Trees from Distances Slide 40/82
Properties of Distances
Additive Distances: They must be metric (conditions 1-4) or ultrametric
(conditions 1-5) and also satisfy conditions 6.
6. A matrix is additive if and only if the following conditions apply forevery combination of four taxa (A, B, C, D):
d(A,B) + d(C ,D) ≤ max [d(A,C) + d(B,D), d(A,D) + d(B,C)]
d(A,C) + d(B,D) ≤ max [d(A,B) + d(C ,D), d(A,D) + d(B,C)]
d(A,D) + d(B,C) ≤ max [d(A,B) + d(C ,D), d(A,C) + d(B,D)]
Condition 6 is also know as Buneman’s four-point metric. For distancesto fit perfectly into an evolutionary tree, they must satisfy this rule.
Tree additivity occurs when the evolutionary distances between each pairof taxa is equal to the sum of branch lengths between the members ofeach pair.
Phylogenetics Tree Basics Inferring Trees from Distances Slide 41/82
Additivity of Distances in Trees
Additive property: the distances in the distance matrix match the relativebranch lengths in the tree.
In a perfectly additive tree the branch lengths match the distances in thedistance matrix perfectly. In such a case there will be only a single andunique additive tree that fits the distance matrix (perfect fit theorem).
Example for perfectly additive tree:
A B C D
A 0.3 0.6 0.5B 0.7 0.6C 0.7
Phylogenetics Tree Basics Inferring Trees from Distances Slide 42/82
Application of the Buneman’s Test
The Buneman’s four-point metric simply means that of the three sums ofdistances, one sum must be smaller than the other two, and these othertwo must be equal.
For example:
d(A,B) + d(C ,D) < d(A,C) + d(B,D) = d(A,D) + d(B,C)
(0.3 + 0.7) < (0.6 + 0.6) = (0.5 + 0.7)
1.0 < 1.2 = 1.2
Phylogenetics Tree Basics Inferring Trees from Distances Slide 43/82
The Neighbor-Relations Methods
The neighbor-relations method takes advantage of the Buneman’sfour-point metric to choose the correct tree when the distances areadditive.
Neighbors are taxa that are joined through a single internal node.Non-neighbors are joined through more than one internal node.
If the tree above is taken as the true tree, thend(A,B) + d(C ,D) < d(A,C) + d(B,D) = d(A,D) + d(B,C), becaused(A,B) and d(C ,D) are distances between neighbors and do not containthe internal branch.
In the case of four taxa with unknown phylogenetic relationships, identifythe two sets of neighbors. Once the neighbors are identified, so is thetopology.
Phylogenetics Tree Basics Inferring Trees from Distances Slide 44/82
The Neighbor-Relations Methods
With four taxa three unrooted trees are possible.
Each tree contains different neighbors.
To determine which taxa are neighbors, compute the distances for allpossible pairs of taxa.
A B C D
A 0.3 0.6 0.5B 0.7 0.6C 0.7
The given tree is the correct one, because the sum d(A,B) + d(C ,D) isthe smallest of the three possible sums.
Phylogenetics Tree Basics Inferring Trees from Distances Slide 45/82
Superimposed Substitutions Cause Distances to BeNon-Additive
Complete inventory of all genetic events (unique and superimposedsubstitutions) would constitute a set of perfectly additive distances.
However, superimposed substitutions cause observed distances to benon-additive and underestimate true evolutionary distances.
Phylogenetics Tree Basics Inferring Trees from Distances Slide 46/82
Corrections for Non-Additive Distances
Models of sequence evolution are employed to correct observed distancesfor superimposed substitutions.
Ideally, this will result in additive distances.
In reality, corrected distances are unlikely to exhibit a perfect fit to
pathlength distances on a tree because of:
1 Inadequate models of sequence evolution.2 Stochastic (random) error associated with sequences of finite length.
Phylogenetics Tree Basics Inferring Trees from Distances Slide 47/82
Additivity and Distance Matrices
Phylogenetics Tree Basics Inferring Trees from Distances Slide 48/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Tree Building Methods Slide 49/82
Desirable Features of Tree Building Methods
Consistency: will the method converge on the correct solutiongiven enough data?
Efficiency: how fast is the method?
Robustness: will minor violations of the assumptions result inpoor estimates of phylogeny?
Phylogenetics Tree Building Methods Slide 50/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Tree Building Methods Clustering Methods Slide 51/82
Clustering Methods
Algorithmic methods in which the algorithm itself defines thetree selection criterion.
No optimality criteria applied.
Advantage: tend to be very fast (efficient) computations thatproduce singular trees.
Disadvantages:
Do not allow evaluation of competing hypotheses.No objective function (e.g. likelihood, number of steps) is usedto compare different trees to each other, even if numerousother trees could explain the data equally well.
Examples that construct trees from distances:
UPGMANeighbor joining (NJ)
Phylogenetics Tree Building Methods Clustering Methods Slide 52/82
UPGMA
UPGMA stands for unweighted pair group method usingarithmetic averages [Sokal & Michener 1958].
Clusters taxa (sequences) agglomeratively and creates at thesame time a hierarchical tree.
The branch (edge) lengths and node positions are determinedby the average distance between clusters.
There are variants of UPGMA that define the distancebetween clusters (linkage method) as the minimum ormaximum of the distances between clusters, rather than theaverage.
The average linkage seems to have the best performancerecords.
Phylogenetics Tree Building Methods Clustering Methods Slide 53/82
Basic Definitions for UPGMA
Initially, each sequence is assigned to its own cluster (i , j , ...)of size 1.
The distance dij between two clusters Ci and Cj is the averagedistance between pairs of sequences from each cluster:
dij =1
|Ci ||Cj |∑
p in Ci ,q in Cj
dpq (3)
|Ci | and |Cj | are the number of items in the clusters i and j .
Ck is defined as the union of two clusters |Ci | and |Cj |.If Ckl = Ci ∪ Cj and Cl is any other cluster, then
dkl =dil |Ci |+ djl |Cj ||Ci |+ |Cj |
(4)
Phylogenetics Tree Building Methods Clustering Methods Slide 54/82
UPGMA Algorithm
Initialization
Assign each sequence to its own cluster Ci .
Define one leaf of T (tree) for each sequence, and place it atheight zero.
Iteration
Determine the two clusters i , j for which dij is minimal. Ifthere are several equidistant choices than pick one randomly.
Define a new cluster k by Ck = Ci ∪ Cj
Define a node k with daughter nodes i and j , and place it atheight dij/2.
Add k to the current clusters and remove clusters i and j .
Update distances by computing dkl for all other clusters l .
Termination
When only two clusters i and j remain place the root atheight dij/2.
Phylogenetics Tree Building Methods Clustering Methods Slide 55/82
Illustration of UPGMA Algorithm
Unresolved Partiallyresolved
Fullyresolved
BifurcationPolytomy or
multifurcation
Species B
Species A
Orthologues
ParaloguesSpeciation
Duplications
= internal node
= external node
= internal branch
= external branch
1 2 3 4 1 2 3 4
=
Taxon A
Taxon B
Taxon C
Taxon D
Taxon A
Taxon B
Taxon C
Taxon D5
4
1
11
1
Taxon A
Taxon B
Taxon C
Taxon D
0510152025
Cladogram Phylogram Ultrametric Tree
Branch lengths have no meaning.
Branch lengths are proportional to (genetic) change.
Branch lengths are proportional to time.
[Million Years]
A B C DA
B
C
-
((A:0.5, B:0.5):0.1,(C:0.5,D:0.5):0.1);
Slanted style
Rectangularstyle
Newick format:
A
B
D
C
D
Unrooted Tree Rooted Trees
A
B
C
D
A B C D
e
a
b
c
d
Root:
e
e
A B C D
a
B A C D
b
Root: a
Root: b
Outgroup
Ingroup
A
B
C
D
A
B
C
D
Cladogram
A
B
C
D5
4
1
11
1
Phylogram
Midpoint Rooting
10
7
6
A
B
C
A B
6 6
4
A
B
C
1
1
1
1
A
B
C
1
6
1
1
A
B
C
1
1
1
1
A
B
C
1
6
1
1
Tree A Tree B
A
B
C
D
0.1
0.20.1
0.3
0.4
1 2 54
67
1 2
4
5
3
1/2 d45
1 2
6
1 2
4
5
3
1/2 d12
1 2 5 34
67
8
91 2
4
5
31/2 d68
3-4. Iteration
2. Iteration
1. Iteration
Phylogenetics Tree Building Methods Clustering Methods Slide 56/82
Limitations of UPGMA Algorithm
The edge lengths of an UPGMA tree correspond roughly to the timesmeasured by a molecular clock with constant rate.
The method assumes that the divergence of sequences occurs at allpoints in the tree with a constant rate and the distances are additive.
If the molecular clock assumption applies to a given distance matrix, thenUPGMA constructs the tree correctly.
However, if this assumption does not apply to the underlying distancematrix, then UPGMA may construct the tree incorrectly.
Example Fig 7: correct tree on the left and incorrect UPGMA tree on theright:
Unresolved Partiallyresolved
Fullyresolved
BifurcationPolytomy or
multifurcation
Species B
Species A
Orthologues
ParaloguesSpeciation
Duplications
= internal node
= external node
= internal branch
= external branch
1 2 3 4 1 2 3 4
=
Taxon A
Taxon B
Taxon C
Taxon D
Taxon A
Taxon B
Taxon C
Taxon D5
4
1
11
1
Taxon A
Taxon B
Taxon C
Taxon D
0510152025
Cladogram Phylogram Ultrametric Tree
Branch lengths have no meaning.
Branch lengths are proportional to (genetic) change.
Branch lengths are proportional to time.
[Million Years]
A B C DA
B
C
-
((A:0.5, B:0.5):0.1,(C:0.5,D:0.5):0.1);
Slanted style
Rectangularstyle
Newick format:
A
B
D
C
D
Unrooted Tree Rooted Trees
A
B
C
D
A B C D
e
a
b
c
d
Root:
e
e
A B C D
a
B A C D
b
Root: a
Root: b
Outgroup
Ingroup
A
B
C
D
A
B
C
D
Cladogram
A
B
C
D5
4
1
11
1
Phylogram
Midpoint Rooting
10
7
6
A
B
C
A B
6 6
4
A
B
C
1
1
1
1
A
B
C
1
6
1
1
A
B
C
1
1
1
1
A
B
C
1
6
1
1
Tree A Tree B
A
B
C
D
0.1
0.20.1
0.3
0.4
1 2 54
67
1 2
4
5
3
1/2 d45
1 2
6
1 2
4
5
3
1/2 d12
1 2 5 34
67
8
91 2
4
5
31/2 d68
3-4. Iteration
2. Iteration
1. Iteration
A
C
B
D A
C
B
D
Solution: test if the distance matrix is ultrametric, where in any triplet ofdistances one pair must be equal and the remaining one is the smallest.
Phylogenetics Tree Building Methods Clustering Methods Slide 57/82
Neighbor-Joining Method
If the molecular clock property fails for a given data set, but additivityholds, then the neighbor-joining method can construct a correct tree.
To overcome the problem that neighboring leaves can be more distant toeach other than to non-neighboring leaves (see Fig 7), one can calculatethe rate corrected distances Dij by subtracting from dij the averageddistances to all other leaves:
Dij = dij − (ri + rj ), where ri =1
|L| − 2
∑k∈L
dik (5)
|L| is the size of the leaf set L.
Consequence: i and j are neighboring leaves if their Dij is minimal!
Phylogenetics Tree Building Methods Clustering Methods Slide 58/82
Example: Rate Corrected Distance Matrix
Distance Matrix
A B C D
A 0.3 0.7 0.4B -1.2 0.8 0.5C -1.0 -1.0 0.5D -1.0 -1.0 -1.2
Upper right triangle: original distance values.Lower left triangle: rate corrected distances calculated by eq 5 as:
rA = (0.3 + 0.7 + 0.4)/2 = 0.7rB = (0.3 + 0.8 + 0.5)/2 = 0.8rC = (0.7 + 0.8 + 0.5)/2 = 1.0
...DAB = 0.3− 0.7− 0.8 = −1.2DAC = 0.7− 0.7− 1.0 = −1.0
...
Phylogenetics Tree Building Methods Clustering Methods Slide 59/82
Neighbor-Joining Algorithm
Initialization
T is the set of leaf nodes corresponding to the number ofitems in distance matrix L.
Iteration
Pick a pair i , j in L for which Dij is minimal as defined by eq 5.
Define a new node k and set for all other leaves m in L todkm = 1/2(dim + djm − dij ).
Add k to T with edges of lengths dik = 1/2(dij + ri − rj ) anddjk = dij − dik that connect i and j to node k .
Remove i and j from L and k .
Termination
When L consists of two leaves i and j add the remainingedges between i and j with length dij .
Phylogenetics Tree Building Methods Clustering Methods Slide 60/82
Main Differences Between UPGMA and Neighbor-Joining
UPGMA
1 Assumes additivity and ultrametricity.
2 Does not use rate corrected distances values for tree construction.
3 Results in a rooted tree with branch pairs reflecting the distanceinformation.
4 Tree type: cladogram.
5 May fail to generate the correct tree from distance values that violate theultrametricity rule.
Neighbor-Joining
1 Assumes additivity, but not ultrametricity.
2 Uses rate corrected distances values for tree construction.
3 Results in an unrooted tree with the branch lengths reflecting thedistance information.
4 Tree type (after rooting): phylogram or cladogram.
5 Generates correct tree from distance values that may violate theultrametricity rule.
Phylogenetics Tree Building Methods Clustering Methods Slide 61/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Tree Building Methods Parsimony Methods Slide 62/82
Parsimony Methods
Simple hypotheses are preferred over more complicatedhypotheses (∼ evolution is parsimonious).
No explicit correction for superimposed substitutions.
Maximum parsimony tree is the tree that requires theminimum number of steps
Known to be inconsistent in some situations (e.g. parsimonyis susceptible to long branch attraction).
Phylogenetics Tree Building Methods Parsimony Methods Slide 63/82
Parsimony
Basic principle: find the tree that explains the observedsequences with the minimal number of substitutions.
Instead of building a tree, like with distance methods,parsimony assigns a cost to a given tree.
This requires searching through all possible trees or a subsetof trees that contains the best or close to best tree topology.
Parsimony algorithms consist of two major steps:1 Computation of the cost for a given tree T .2 Search for the tree with minimum cost.
Phylogenetics Tree Building Methods Parsimony Methods Slide 64/82
Example: Computing the Cost of a Given Tree
Given a multiple alignment and a tree topology, one can count thenumber of substitution needed for each tree.
The following figure shows two possible trees for the alignment on theleft. Trees differ in the order the sequences are assigned to the leaves.
Hypothetical sequences have been assigned to the ancestral nodes thatminimize the number of substitution need in the entire tree.
Unresolved Partiallyresolved
Fullyresolved
BifurcationPolytomy or
multifurcation
Species B
Species A
Orthologues
ParaloguesSpeciation
Duplications
= internal node
= external node
= internal branch
= external branch
1 2 3 4 1 2 3 4
=
Taxon A
Taxon B
Taxon C
Taxon D
Taxon A
Taxon B
Taxon C
Taxon D5
4
1
11
1
Taxon A
Taxon B
Taxon C
Taxon D
0510152025
Cladogram Phylogram Ultrametric Tree
Branch lengths have no meaning.
Branch lengths are proportional to (genetic) change.
Branch lengths are proportional to time.
[Million Years]
A B C DA
B
C
-
((A:0.5, B:0.5):0.1,(C:0.5,D:0.5):0.1);
Slanted style
Rectangularstyle
Newick format:
A
B
D
C
D
Unrooted Tree Rooted Trees
A
B
C
D
A B C D
e
a
b
c
d
Root:
e
e
A B C D
a
B A C D
b
Root: a
Root: b
Outgroup
Ingroup
A
B
C
D
A
B
C
D
Cladogram
A
B
C
D5
4
1
11
1
Phylogram
Midpoint Rooting
10
7
6
A
B
C
A B
6 6
4
A
B
C
1
1
1
1
A
B
C
1
6
1
1
A
B
C
1
1
1
1
A
B
C
1
6
1
1
Tree A Tree B
A
B
C
D
0.1
0.20.1
0.3
0.4
1 2 54
67
1 2
4
5
3
1/2 d45
1 2
6
1 2
4
5
3
1/2 d12
1 2 5 34
67
8
91 2
4
5
31/2 d68
3-4. Iteration
2. Iteration
1. Iteration
A
C
B
D A
C
B
D
AAG AAA GGA AGA
AAA AGA
AAA
1 1
1
AAG AGA AAA GGA
AAA AAA
AAA
1 21
������������
Alignment Two Possible Parsimony Trees
The tree on the left is more parsimonious than the one on the rightbecause it requires only 3 instead of 4 changes.
As shown here, parsimony treats each site independently and then sumsup the substitutions needed for all sites.
Phylogenetics Tree Building Methods Parsimony Methods Slide 65/82
Traditional Parsimony [Fitch 1971]
Summary
Requires simply counting of the number of substitutions for a tree. Toobtain the cost of a tree, generate a list of minimal cost residues Rk ateach node k, along with the current cost C . To compute the minimalcost at site u, one can proceed as follows:
Initialization
Set C = 0 and k = 2n − 1.
Recursion - to obtain the set Rk :
If k is leaf node:Set Rk = xk
u .
If k is not a leaf node:Compute Ri ,Rj for the daughter nodes i , j of k, and setRk = Ri ∩ Rj if intersection is not empty, or elseset Rk = Ri ∪ Rj and increment C .
Termination
Minimal cost of tree = C .
Phylogenetics Tree Building Methods Parsimony Methods Slide 66/82
Algorithm Weighted Parsimony
Definitions
S(a, b) is the cost for each substitution. To compute the minimal cost atsite u, let Sk (a) denote the minimal cost for the assignment of a to nodek.
Initialization
Set k = 2n − 1, the number of nodes..
Recursion - compute Sk (a) for all a as follows:
If k is leaf node:Set Sk (a) = 0 for a = xk
u , Sk (a) =∞, otherwise.
If k is not a leaf node:Compute Si (a), Sj (a) for all daughter nodes i , j , and defineSk (a) = minb(Si (b) + S(a, b)) + minb(Sj (b) + S(a, b)).
Termination
Minimal cost of tree = minaS2n−1(a).
Note: If S(a, a) = 0 and S(a, b) = 1 then the algorithm is identical to thetraditional parsimony.
Phylogenetics Tree Building Methods Parsimony Methods Slide 67/82
Tree Search Methods
Several tree search methods can be considered:1 Exhaustive Searches: All trees are evaluated only possible for
trees with less than 20 taxa.2 Heuristic Searches: Not guaranteed to find the best tree (e.g.
random branch changes and re-scoring of tree).3 Branch and bound algorithm which does not evaluate all trees,
but guarantees to find the best tree.4 Many other approaches.
Branch and bound algorithm
It exploits the idea that the cost (number of substitutions) of asubtree can only increase by adding an extra edge.It systematically builds trees with increasing numbers of leavesand abandons avenues of tree building whenever an incompletetree exceeds the smallest cost of a complete tree.
Phylogenetics Tree Building Methods Parsimony Methods Slide 68/82
Branch and Bound Method in Phylogeny
Phylogenetics Tree Building Methods Parsimony Methods Slide 69/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Quality Assessment by Resampling Slide 70/82
Assessing the Reliability of Tree Branches
Nonparametric resampling methods are often used to estimatethe variance associated with a statistic when the underlyingsampling distribution for a statistic is either unknown ordifficult to derive analytically.
Resampling methods include the bootstrap and the jackknife,both of which operate by repeatedly resampling data from theoriginal data set to estimate the variance of the samplingdistribution.
Although both methods have been used for evaluating thereliability of branches, the bootstrap method is morecommonly applied.
Phylogenetics Quality Assessment by Resampling Slide 71/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Quality Assessment by Resampling Bootstrap Slide 72/82
Bootstrap Resampling
Data points are randomly resampled from the original dataset, with replacement, until new data sets with the originalnumber of observations are obtained.
Statistic of interest (e.g. a tree) is computed for eachreplicated data set.
Agreement among the resulting trees is summarized with amajority-rule consensus tree (agreement > 50%).
A bootstrap proportion (BP) is the frequency of occurrence ofa clade (for all replicated data sets) and is a measure ofsupport for a group.
Phylogenetics Quality Assessment by Resampling Bootstrap Slide 73/82
Example: Bootstrapping
Original matrix (alignment) Resampled matrix (alignment)
Majority rule tree with bootstrap proportions (BP) at branch nodes.
Phylogenetics Quality Assessment by Resampling Bootstrap Slide 74/82
Interpreting Bootstrap Values
Bootstrap proportions are sometimes interpreted as confidenceintervals for phylogenies.
This interpretation assumes that characters (e.g. nucleotide oramino acid sites) are independent and identically distributed,and that the method is consistent.
Even in a best case scenario bootstrapping sometimes givesunderestimates of accuracy at high bootstrap values andoverestimates of accuracy at low bootstrap values.
Zharkikh and Li (1995) suggested the complete and partialbootstrap technique to reduce the bias of bootstrapproportions.
Phylogenetics Quality Assessment by Resampling Bootstrap Slide 75/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Quality Assessment by Resampling Jackknife Slide 76/82
Jackknife Resampling
The original data set, which contains n data points, isresampled by dropping k data points at a time. This results ina resampled data set with n − k data points.
The statistic of interest is computed for each replicated dataset.
Agreement among the resulting trees is summarized with amajority-rule consensus tree. A jackknife proportion (JP) isthe frequency of occurrence of a clade (for all replicated datasets) and is a measure of support for a group.
Like the bootstrap, the jackknife assumes that characters (e.g.nucleotides) are independent and identically distributed.
Phylogenetics Quality Assessment by Resampling Jackknife Slide 77/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Software Slide 78/82
Phylogenetics Software (Selection!)
PAUP: complex phylogenetic tool collection (partiallycommercial)
PHYLIP: complex phylogenetic tool collection (free)
MrBayes: popular package for Bayesian inference ofphylogenies.
BEAST: Another tool for Bayesian inference of phylogenies.
PAML: Molecular evolution software for determining sequenceevolutionary rates.
HyPhy: Molecular evolution software.
Many more: see for instance this Link Collection.
Phylogenetics Software Slide 79/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics Software Slide 80/82
Alignment processing
trimAl: automated alignment trimming - with built-in andflexible models
Gblocks: automated alignment trimming
Phylogenetics Software Slide 81/82
OutlineIntroduction
Phylogenetics
Tree BasicsTree TopologyCounting TreesTree Rooting MethodsInferring Trees from Distances
Tree Building MethodsClustering MethodsParsimony Methods
Quality Assessment by ResamplingBootstrapJackknife
Software
Software
References
Phylogenetics References Slide 82/82
References
Durbin R, Eddy S, Krogh A and Mitchison G (1999) Biological SequenceAnalysis - Probabilistic Models of Proteins and Nucleic Acids. CambridgeUniversity Press. Chapters 7-8.
Felsenstein J (2004) Inferring Phylogenies. Sinauer Associates, Inc, MA. Pages1-664.
Fitch (1971) Toward defining the course of evolution: minimum change for aspecified tree topology. Systematic Zoology 20: 406-416.
Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T (1989) Evolutionaryrelationship of archaebacteria, eubacteria, and eukaryotes inferred fromphylogenetic trees of duplicated genes. Proc Natl Acad Sci U S A 86: 9355-9359.URL http://www.hubmed.org/display.cgi?uids=2531898
Sokal RR and Michener CD (1958) A statistical method for evaluatingsystematic relationships. University of Kansas Scientific Bulletin 28: 1409-1438.
Phylogenetics References Slide 82/82