1 phylogenetic tree reconstruction modified version of dr. chun-chieh shih’ institute of...
Post on 15-Jan-2016
234 views
TRANSCRIPT
11
Phylogenetic TreePhylogenetic TreeReconstructionReconstruction
Modified version of Dr. Chun-Chieh Shih’Modified version of Dr. Chun-Chieh Shih’
Institute of Information SciencesInstitute of Information Sciences
Academia SinicaAcademia Sinica
22
OUTLINEOUTLINE
Tree reconstruction methods
Flowchart of phylogenetic analysis
Concept 0f evolutionary trees
Evaluation of reconstructed trees
33
Why RECONSTRUCT phylogenetic trees
Understand evolutionary history
Map pathogen strain diversity for vaccines
Assist in epidemiology of infectious diseases
Aid in prediction of function of novel genes
Biodiversity studies
Understanding microbial ecologies
For Example
44
Concept 0f evolutionary trees
Rooted treeOne sequence (root) defined to becommon ancestor of all other sequences
If molecular clock hypothesis holds,it is possible to predict a root
Unrooted treeIndicates evolutionary relationshipwithout revealing location of oldest ancestry
55 Image: Image: http://www.ncbi.nlm.nih.gov/About/primer/phylo.htmlhttp://www.ncbi.nlm.nih.gov/About/primer/phylo.html
66
Unrooted treesUnrooted trees Rooted treesRooted trees
# sequences# sequences# pairwise d# pairwise d
istancesistances # trees# trees# branches # branches
/tree/tree # trees# trees
# branches# branches
/tree/tree
33 33 11 33 33 44
44 66 33 55 1515 66
55 1010 1515 77 105105 88
66 1515 105105 99 945945 1010
1010 4545 2,027,0252,027,025 1717 34,459,42534,459,425 1818
3030 435435 8.69 8.69 10 103636 5757 4.95 4.95 10 103838 5858
NN NN ( (NN - 1) - 1)
22
(2(2NN - 5)! - 5)!
22NN - 3 - 3 ((N N - 3)!- 3)!
22N - N - 33 (2(2NN - 3)! - 3)!
22NN - 2 - 2 ((N N - 2)!- 2)!
22N - N - 22
Taken From: http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Concept 0f evolutionary trees
Number of Trees
77
Types of data used in phylogenetic inference
Use the aligned characters, such as DNA or protein sequences, directly during tree inference.
Character-based methods:
Transform the sequence data into pairwise distances, and usethe matrix during tree building.
Distance-based methods:
88
Data set collection
Multiple sequence alignment
Tree construction
Character-based Distance-based
Optimal criteriaParsimony Maximum likelihood
UPGMA NJ
Fitch-MargoliashKITCH Distance
Test reliability of the tree by analytical and/or resampling procedure
99
Distance Methods
Calculate changes between each pair in a groupof sequences (The first step in producing a multiple sequenceAlignment)
Identify tree that correctly positions neighbors
and that also has branch lengths that reproduce
the original data as closely as possible
Finding closest neighbors among a group of
Sequences
1010
Distance Methods - Example
distancesbetweensequences
distance table
1111
FITCHFITCH: estimates phylogenetic tree assuming additivi: estimates phylogenetic tree assuming additivity of branch lengths using the Fitch-Margoliash methty of branch lengths using the Fitch-Margoliash methodod
KITSHKITSH: same as FITCH, but under the assumption of : same as FITCH, but under the assumption of a molecular clocka molecular clock
NEIGHBORNEIGHBOR: estimates phylogenies using either:: estimates phylogenies using either: Neighbor-joiningNeighbor-joining (no molecular clock assumed) (no molecular clock assumed) Unweighted Pair Group Method with ArithmeticUnweighted Pair Group Method with Arithmetic MeanMean
((UPGMAUPGMA) (molecular clock assumed)) (molecular clock assumed)
Distance Methods - Example
Distance Programs in Phylip
1212
Distance Methods - UPGMA
Construct a distance treeA -GCTTGTCCGTTACGATB –ACTTGTCTGTTACGATC –ACTTGTCCGAAACGATD -ACTTGACCGTTTCCTTE –AGATGACCGTTTCGATF -ACTACACCCTTATGAG
AA BB CC DD EE
BB 22
CC 44 44
DD 66 66 66
EE 66 66 66 44
FF 88 88 88 88 88
Clustering
All leaves are assigned to a cluster, which then are iteratively merged according to their distance
1313
Distance Methods - UPGMA
The distance between two clusters i and jis defined as:
where |Ci| and |Cj| denote the number of sequencesin cluster i and j, respectively.
Replacing
Ck = Ci Cj
The new distances between the new node k and all otherclusters l are computed according to:
||||
||||
ji
jjliilkl CC
CdCdd
1414
Distance Methods - UPGMA
Step I: Initialization. Assign each sequence i to its own cluster Ci.
. Define one leaf of T for each sequence, and place at height zero.
Step II: Iteration. Determine the two clusters i, j for which di,j is minimal
. Define a new cluster k by Ck= CiU Cj, and define dkl for all l
. Define a node k with daughter nodes i and j, and place it at
height di,j/2.
. Add k to the current clusters and remove i.
Step III: Termination
. When only two clusters i, j remain, place the root at height di,j/2.
1515
Distance Methods - UPGMA
First round
AA BB CC DD EE
BB 22
CC 44 44
DD 66 66 66
EE 66 66 66 44
FF 88 88 88 88 88
A
B
1
1
dist((A,B),C) = (distAC+distBC)/2 =4
dist((A,B),D) = (distAD+distBD)/2 = 6
dist((A,B),E) = (distAE+distBE)/2 = 6
dist((A,B),F) = (distAF+distBF)/2 = 8
A,BA,B CC DD EE
CC 44
DD 66 66
EE 66 66 44
FF 88 88 88 88
Choose the most similar pair, cluster them togetherand calculate the new distance matrix.
1616
Distance Methods - UPGMA
Second roundA
B
1
1A,BA,B CC DD EE
CC 44
DD 66 66
EE 66 66 44
FF 88 88 88 88
D
E
2
2
Third roundA
B
1
1A,BA,B CC D,ED,E
CC 44
D,ED,E 66 66
FF 88 88 88D
E
2
2
C
1
2
1717
Distance Methods - UPGMA
Fourth round
AB,CAB,C D,ED,E
D,ED,E 66
FF 88 88
A
B
1
1
D
E
2
2
C
1
2
1
1
Fifth round
ABC, DEABC, DE
FF 88
A
B
1
1
D
E
2
2
C
1
21
1
F4
1
ROOT
1818
Distance Methods - UPGMA
The UPGMA clustering method is very sensitiveto unequal evolutionary rates Assumes that the evolutionary rate is the same for all branches
Clustering works only if the data are ultrametric
Ultrametric treeUltrametric tree
Special kind of additive treein which the tips of the trees areall equidistant from the root
A cladogram with branch lengths,also called phylograms and metrictrees
1 1 1 1
1
1
23
13 1
Additive treeAdditive tree7
2 3
1 35
1
4
33
1919
Distance Methods - UPGMA
UPGMA fails when rates of evolution are not constant
A
B
1
4
D
E
3
2
C
1
21
1
F4
1
AA BB CC DD EE
BB 55
CC 44 77
DD 77 1100
77
EE 66 99 66 55
FF 88 1111
88 99 88
Wrong topology
A
C
1
1
D
E
2.5
2.5
A
C
B
2
23
1
D
E
2.5
2.51.5
0.5
A
C
B
2
23
1
D
E
2.5
2.51.5
0.5
F4.5
0.5
A
C
2
23
1
2020
Distance Methods – Neighbor Joining
The Four Point Condition
dAC+ dBD= dAD+ dBC= a + b + c + d + 2x = dAB+ dCD+ 2x
The 4-point condition
dAB+ dCD< dAC+ dBD
dAB+ dCD< dAD+ dBC
neighbors non-neighbors
• Neighbors are closer than non-neighbors
2121
Distance Methods – Neighbor Joining
Sequences chosen to give best least-squares
estimate of branch length
Begin with star topology – no neighbors have
been joined
B
A
C
D
E
Tree modified by joining pairs of sequences
2222
Distance Methods – Neighbor Joining
Pair is chosen by calculating sum of branch
lengths for the corresponding tree
22)2(2
N
dd
N
ddS ijmninimmn
If A and B are joined:
B
A
C
D
E
2323
Distance Methods – Neighbor Joining
Neighbor-Joining approximates the least squares
tree, assuming additivity, but without resorting
to the assumption of a molecular clock.
Idea: join clusters that are not only close to one
another, but are also far from the rest.
In each iteration: find direct ancestor of two
species in the tree neighboring leaves.
2424
Distance Methods – Neighbor Joining
Example: neighboring leaves i, j with ancestor k. Join i and
j remove them from list of leave nodes add k to list with
distances to other leave(s) m defined as
)(2
1ijjmimkm dddd
Problem: it is not sufficient to pick simply the two
closest leaves
2525
Distance Methods – Neighbor Joining
Solution: For node i, define average distance ui to all othe
r leaves: and correct distances:
Minimum-evolution criterion: minimize the sum of all branc
h lengths. Nodes i and j that are clustered next are those fo
r which Dij is smallest.
2626
Distance Methods – Neighbor Joining
Initialization
Iteration:
1. Initialize n clusters with the given species, one species per cluster
2. Set the size of each cluster to 1: ni 1
3. In the output tree T, assign a leaf for each species
1. For each species, compute
2. Choose the i and j for which dij − ui − uj is smallest.
3. Join clusters i and j to new cluster, with corresponding node k and set
Calculate the branch lengths from i and j to the new node as:
,
4. Delete clusters i and j from T and add k
5. If more than two nodes remain, go back to 1. Otherwise, --- end
2727
Maximum Parsimony
Predicts evolutionary tree by minimizing numberof steps required to generate observed variation
For each position, a phylogenetic tree requiressmallest number of evolutionary changes toproduce observed sequence changes are identified
Trees producing smallest number of changes forall sequence positions are identified
Time consuming algorithm
Only works well if the sequences have a strongsequence similarity
2828
Maximum Parsimony
Step I
Step II
Input: multiple sequence alignment
For each aligned position, identify phylogenetic trees that
require the smallest number of evolutionary changes to
produce the observed sequence changes
Step III
Continue analysis for every position in the sequence alignment
Step IV
Sequence variations at each site in the alignment are placed at the
tips of the trees
2929
Maximum Parsimony - Example
Sequences
positions
Informative sites: must favor one tree over another
site 5 is informative, but sites 1, 6, 8 are not
To be informative, a site must also have the same sequence character in
at least two genomes
only sites 5, 7, and 9 are informative according to this rule
E.g. trees for position 5:
Combining sites 5, 7, and 9, the left tree is the best tree for these 4 sequences
3030
Maximum Parsimony - Example
What is the parsimony score of
3131
Maximum Parsimony - Example
Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G
1 2 3 4 5 6 7 8 9 10
How many possible unrooted trees?
3232
Maximum Parsimony - Example
How many substitutions?
tree 1 change 5 changes
3333
Maximum Parsimony - Example
Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G
1 2 3 4 5 6 7 8 9 10
0
0
0
3434
Maximum Parsimony - Example
Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G
1 2 3 4 5 6 7 8 9 10
0 3
0 3
0 3
3535
Maximum Parsimony - Example
Species 1 - A G G G T A A C T GSpecies 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G
1 2 3 4 5 6 7 8 9 10
0 3 2 2
0 3 2 2
0 3 2 1
3636
Maximum Parsimony - Example
1 2 3 4 5 6 7 8 9 10Species 1 - A G G G T A A C T G
Species 2 - A C G A T T A T T ASpecies 3 - A T A A T T G T C TSpecies 4 - A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3
0 3 2 2 0 1 2 1 2 3
0 3 2 1 0 1 2 1 2 3
14
16
16
Minimum substitutions
3737
Maximum Parsimony – Searching for Trees
#Taxa#Taxa 33 44 55 1010 5050 100100
#Trees#Trees 11 33 1515 22101066 2210107474 221010182182
Imagine how large of 10182 ...
3838
Maximum Parsimony
Parsimony can give misleading information when rates of sequence
change vary in the different branches of a tree that are represented by
the sequence data
Where maximum parsimony fails
Real tree: 2 long branches in which
G has turned to A independently,
possibly with some intermediate
steps.
In parsimony analysis rates of change along all branches of the tree are assumed equal. Therefore the tree predicted from parsimony will not be correct.
3939
Standard problem: Maximum Parsimony Standard problem: Maximum Parsimony (Hamming distance Steiner Tree)(Hamming distance Steiner Tree)
InputInput: Set : Set SS of of nn aligned sequences of aligned sequences of length klength k
OutputOutput: A phylogenetic tree : A phylogenetic tree TT– leaf-labeled by sequences in leaf-labeled by sequences in SS– additional sequences of length additional sequences of length kk labeling the labeling the
internal nodes of internal nodes of TT
such that is minimized. such that is minimized. )(),(
),(TEji
jiH
Maximum Parsimony - Example
4040
Maximum parsimony (example)Maximum parsimony (example)
InputInput: Four sequences: Four sequences– ACTACT– ACAACA– GTTGTT– GTAGTA
QuestionQuestion: which of the three trees has the : which of the three trees has the best MP scores?best MP scores?
Maximum Parsimony - Example
4141
All possible unrooted treesAll possible unrooted trees
ACT
GTT ACA
GTA ACA ACT
GTAGTT
ACT
ACA
GTT
GTA
Maximum Parsimony - Example
4242
Possible substitutionsPossible substitutions
ACT
GTT
GTT GTA
ACA
GTA
12
2
MP score = 5
ACA ACT
GTAGTT
ACA ACT
3 1 3
MP score = 7
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Optimal MP tree
Maximum Parsimony - Example
4343
Maximum Parsimony: Maximum Parsimony: computational complexitycomputational complexity
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Finding the optimal MP tree is NP-hard
Optimal labeling can becomputed in linear time O(nk)
Maximum Parsimony - Example
4444
Maximum likelihood approach
Method uses probability calculations to find atree that best accounts for the variation in aset of sequences
Similar to maximum parsimony method in thatanalysis is performed on each column of amultiple sequence alignment
Start with an evolutionary model of sequencechange that provides estimates of rates ofsubstitution of one base for another(transitions and transversions).
4545
Maximum likelihood approach
Statistical method - powerful and flexible,also computationally complex
Given a particular tree and a model of theevolutionary change, calculate the likelihoodof the tree based on data, i.e. the givenmultiple sequence alignment
Likelihood (tree | data) proportional toProbability( data | tree)
4646
Maximum likelihood approach
Tree with branches, vk branch lengths
Probability of character change PAC(t) for
A C in time t
Don’t know character states inside tree (inthe past) so calculate for all possibilities,e.g. A, C, G, T
4747
Maximum likelihood approach
L = p(A) PAA(v1) PAA(v2) PAG(v4) PAA(v5) PAA(v6) PAA(v3) PAA(v7) PAA(v8)
L = p(A) PAA(v1) PAG(v2) PGG(v4) PGA(v5) PAA(v6) PAA(v3) PAA(v7) PAA(v8)
4848
Maximum likelihood approach
L = p(s0) Ps0s1(v1) Ps1s2(v2) Ps2s4(v4) Ps2s5(v5) Ps1s6(v6) Ps0s3(v3) Ps3s7(v7) Ps3s8(v8)
Maximum likelihood does best in simulationbut is also slowest method
Variety of new heuristics to find ML tree faster
4949
Maximum Likelihood (ML)Maximum Likelihood (ML) Given: stochastic model of sequence evolution Given: stochastic model of sequence evolution
(e.g. Jukes-Cantor) and a set S of sequences (e.g. Jukes-Cantor) and a set S of sequences Objective: Find tree T and probabilities p(e) of suObjective: Find tree T and probabilities p(e) of su
bstitution on each edge, to maximize the probabibstitution on each edge, to maximize the probability of the data.lity of the data.
Preferred by some systematists, but even harder tPreferred by some systematists, but even harder than MP in practice.han MP in practice.
Maximum likelihood approach
5050
Quality of the tree
Phylogenetic trees can vary dramatically with
slight changes in data
We want to know which branches are reliable, and
which branches do not have strong support from the
data
Bootstrapping is the most common method used
A general statistical technique for determining how
much error is in a set of results
5151
Confidence assessment
Bootstrapping
Original data set with n characters
Draw n characters randomly with re-placement.Repeat m times.
m pseudo-replicates, each with n characters.
Original analysis,e.g. MP, ML, NJ.
Repeat original analysison each of thepseudo-replicate data sets.
Evaluate the resultsfrom the m analyses.
5252
Confidence assessment
Bootstrap sampling of phylogenies
5353
Confidence assessment
What do the bootstrap values mean?
Bootstrap values for phylogenetic trees do not
follow proper statistical behavior
Bootstrap value 95% actually close to 100%
confidence in that branch
Bootstrap value 75% often close to 95%
confidence
Bootstrap value 60% is much lower confidence
Less than 50% bootstrap: no confidence in that
branch over an alternative
5454
Computer Software for PhylogeneticsComputer Software for Phylogenetics Due to the lack of consensus among evolutionary biologists Due to the lack of consensus among evolutionary biologists about basic principles for phylogenetic analysis, it is not about basic principles for phylogenetic analysis, it is not surprising that there is a wide array of computer software surprising that there is a wide array of computer software available for this purpose.available for this purpose.– PHYLIPPHYLIP is a free package that includes 30 programs that is a free package that includes 30 programs that
compute various phylogenetic algorithms on different kinds of compute various phylogenetic algorithms on different kinds of data.data.
– The The GCGGCG package (available at most research institutions) package (available at most research institutions) contains a full set of programs for phylogenetic analysis contains a full set of programs for phylogenetic analysis including simple distance-based clustering and the complex including simple distance-based clustering and the complex cladisticcladistic analysis program analysis program PAUPPAUP ( (PPhylogenetic hylogenetic AAnalysis nalysis UUsing sing PParsimony)arsimony)
– CLUSTALXCLUSTALX is a multiple alignment program that includes the is a multiple alignment program that includes the ability to create tress based on ability to create tress based on Neighbor Joining.Neighbor Joining.
– MacCladeMacClade is a well designed cladistics program that allows is a well designed cladistics program that allows the user to explore possible trees for a data set.the user to explore possible trees for a data set.
5555
Phylogenetics on the WebPhylogenetics on the Web There are several phylogenetics servers available There are several phylogenetics servers available
on the Web on the Web – some of these will change or disappear in the near futuresome of these will change or disappear in the near future
– these programs can be very slow so keep your sample sets smallthese programs can be very slow so keep your sample sets small The Institut Pasteur, Paris has a The Institut Pasteur, Paris has a PHYLIPPHYLIP server at: server at:
http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.htmlhttp://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html Louxin Zhang at the Natl. University of Singapore has a Louxin Zhang at the Natl. University of Singapore has a WebPhylipWebPhylip server: server:
http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/ The Belozersky Institute at Moscow State University has their own "The Belozersky Institute at Moscow State University has their own "GeneBeeGeneBee" "
phylogenetics server:phylogenetics server:
http://www.genebee.msu.su/services/phtree_reduced.htmlhttp://www.genebee.msu.su/services/phtree_reduced.html The The PhylodendronPhylodendron website is a tree drawing program with a nice website is a tree drawing program with a nice
user interface and a lot of options, however, the output is limited to user interface and a lot of options, however, the output is limited to gifs at 72 dpi - not publication qualitygifs at 72 dpi - not publication quality..
http://iubio.bio.indiana.edu/treeapp/treeprint-form.htmlhttp://iubio.bio.indiana.edu/treeapp/treeprint-form.html
5656
Other Web ResourcesOther Web Resources
Joseph Felsenstein (author of PHYLIP) maintains a Joseph Felsenstein (author of PHYLIP) maintains a
comprehensive list of comprehensive list of Phylogeny programsPhylogeny programs at: at: http://evolution.genetics.washington.edu/phylip/software.htmlhttp://evolution.genetics.washington.edu/phylip/software.html
Introduction to Phylogenetic Systematics,Introduction to Phylogenetic Systematics,Peter H. Weston & Michael D. Crisp, Society of Australian Systematic Peter H. Weston & Michael D. Crisp, Society of Australian Systematic BiologistsBiologists
http://www.science.uts.edu.au/sasb/WestonCrisp.htmlhttp://www.science.uts.edu.au/sasb/WestonCrisp.html
University of California, Berkeley Museum of Paleontology University of California, Berkeley Museum of Paleontology (UCMP)(UCMP)http://www.ucmp.berkeley.edu/clad/clad4.htmlhttp://www.ucmp.berkeley.edu/clad/clad4.html
5757
Software HazardsSoftware Hazards There are a variety of programs for Macs and There are a variety of programs for Macs and
PCs, but you can easily tie up your machine for PCs, but you can easily tie up your machine for many hours with even moderately sized data many hours with even moderately sized data sets (i.e. fifty 300 bp sequences)sets (i.e. fifty 300 bp sequences)
Moving sequences into different programs can Moving sequences into different programs can be a major hassle due to incompatible file be a major hassle due to incompatible file formats.formats.
Just because a program can perform a given Just because a program can perform a given computation on a set of data does not mean that computation on a set of data does not mean that that is the appropriate algorithm for that type of that is the appropriate algorithm for that type of data.data.
5858
Which Method to Choose?
Depends upon the sequences that are being compared
Strong sequence similarity:
Maximum parsimony
Clearly recognizable sequence similarity
Distance methods
All others:
Maximum likelihood
Best to choose at least two approaches
Compare the results – if they are similar,you can have more confidence
5959
Which Method to Choose?
6060
Neighbor-joiningNeighbor-joining Maximum parsimonyMaximum parsimony Maximum likelihoodMaximum likelihood
Uses only pairwise Uses only pairwise distancesdistances
Uses only shared Uses only shared derived charactersderived characters
Uses all dataUses all data
Minimizes distance Minimizes distance between nearest between nearest neighborsneighbors
Minimizes total Minimizes total distancedistance
Maximizes tree likelihood Maximizes tree likelihood given specific parameter given specific parameter valuesvalues
Very fastVery fast SlowSlow VeryVery slow slow
Easily trapped in local Easily trapped in local optimaoptima
Assumptions fail when Assumptions fail when evolution is rapidevolution is rapid
Highly dependent on Highly dependent on assumed evolution modelassumed evolution model
Good for generating Good for generating tentative tree, or choosing tentative tree, or choosing among multiple treesamong multiple trees
Best option when Best option when tractable (<30 taxa, tractable (<30 taxa, homoplasy rare)homoplasy rare)
Good for very small data Good for very small data sets and for testing trees sets and for testing trees built using other methodsbuilt using other methods
Tony Weisstein, http://bioquest.org:16080/bedrock/terre_haute_03_04/phylogenetics_1.0.ppt
Comparison of Methods
6161
More Topics
Related to
Phylogenetics
6262
More topics related to Phylogenetics
Phylogeny epidemiology
Supertree / Tree of life
Phylogeography
6363
Idea of the ‘Tree of Life’
The idea that the evolution of life can be represented as a tree, with leaves corresponding to extant species and nodes to extinct ancestors, came from Charles Darwin
The earliest trees formed by Ernst Haeckel and others were based on a general idea of a hierarchy of relationships between species and higher taxa
Gradually, quantitative criteria have been developed to measure the degree of morphological difference that was thought to reflect evolutionary distance
6464
Winds of Change
In the early days of molecular phylogenetics, a gene tree was usually equated with the species tree. This view was typified using ribosomal RNA (rRNA) sequences as the principal molecular phylogenetic marker
This resulted in the discovery of a previously unrecognized domain of life, the Archaea, and in a tree topology that has been aptly called the ‘standard model’ of evolution
This model involves the early descent of the bacterial clade from the last universal common ancestor and a subsequent separation of archaea and eukaryotes.
All this was to change once comparative genomics yielded more information and multiple complete genome sequences became available for comparison
6565
The three domains of Life
Identified by phylogenetic analysis of the highly
conserved 16S ribosomal RNA
6666
Three strategies for constructing phylogenies
Homologous single-gene data set
Sequence concatenation
Supertree construction
Rely on many taxa for a single gene
Combine or concatenate multiple
sequences for the same set of species
Need for close concordance of species
sampling among genes, which is difficult
because of the hit-or-miss sampling in
the databases.
Less genes and less samples
Large number sequence alignment
Sample multiple genes only for minimally overlapping sets of species
Tree constructed by a set of subtrees
6767
With current computational tools, phylogenetic analyses for 1,000 species
is possible with adequate computer resources
It is currently impossible to reach a reasonable solution for 500,000 species,
even with months of computation .
Tree of Life( 30,000 species )
Assembling the Tree of Life (ATOL )
What difficulty in computing
David Hillis, Science, 2003
PARALLEL ALGORITHMS FOR GENETICS
6868
6969
Assembling large data matrices by concatenation
Advantages
Improve the accuracy of a specific portion of a tree
The addition of species can be useful in cases of so-called
‘long-branch attraction’, in which high substitution rates or long
intervals of time can mislead phylogenetic inference methods
Two potential problems
Multiple genes can mix phylogenetic signals arising from different
evolutionary histories
Some sequences are usually unavailable for some species,
‘missing data’, with possible deleterious effects on accuracy
Domination by biological problems
7070
Reconstruction of trees from large data matrices
Two issues in constructing phylogenetic trees Computation time
Reliability
Two time-consuming computational problems
Multiple sequence alignment
Phylogenetic inference
Domination by computational problems
Optimal methods ( parsimony and maximum likelihood ) are time-consuming
Even heuristic approach
Months of processor time were devoted to a heuristic parsimon
y analysis of the Chase et al. dataset of ~ 500 sequences, and i
t never ran to completion ( Sanderson and Driskell, 2004)
7171
Synthesis of large trees: supertree
Tree constructed by a set of trees
Advantages Independent studies can be combined into a single tree
Initial trees can be based on different kinds of data
Initial trees can be obtained by different methodologies
Initial trees often have been selected from competing trees by professional judgment
There are most likely no common data for all species
Methods such as maximum likelihood would not be computationally tractable on such a large dataset
+
7272
Synthesis of large trees: supertree
Classification ( Wilkinson et al, 2001, Bininda-Emonds et al, 2002 )
Present Past
Supertree technique past and present ( Bininda-Emonds, 2004 )
7373
Reconstructing the “Tree” of Life
Handling large datasets: millions of species
The “Tree of Life” is not
really a tree: reticulate evolution
7474
PhylogeneticEpidemiology
7575
Infectious diseases are caused by pathogens
pathogen: microbe that causes disease
microbe: microscopic organism
The major classes of disease-causing microbes are
viruses, bacteria, and eukaryotes (protists, fungi, and worms)
RNA Viruses
The RNA viruses are more often associated with epidemic and
emerging diseases in humans than DNA viruses.
The gene sequences of many RNA viruses change so rapidly
that it is possible to watch spatial and temporal patterns unfold on
a ‘real time’ scale that is not usually visible in other organisms.
Diseases caused by RNA viruses: avian influenza, HIV, dengue...
7676
The rapidity of RNA virus evolution is caused by acombination of (Holmes, 2004)
Extremely high mutation rates
Short generation times
Immense population sizes.
These factors produce rates of nucleotide substitution that are, on
average, some six orders of magnitude higher than those in eukaryotes
and DNA viruses (Jenkins et al. 2002).
The high rates of substitution found in viruses and bacteria allow
phylogenies to be reconstructed for sequences that have diverged
only recently
Molecular phylogenies have come to play an increasingly important
role in epidemiological studies of microbial pathogens, as they
provide information about the location, timing, and mechanisms by
which virulent strains arise.
7777
Guan et al. (2002) Emergence of multiple genotypes of H5N1 avian influenza virusesin Hong Kong SAR. Proc Natl Acad Sci U S A, 99, 8950-8955.
7878
Moya, A., Holmes, E.C., and Gonzalez-Candelas, F. (2004) The population geneticsand evolutionary epidemiology of RNA viruses. Nat Rev Microbiol, 2, 279-288.
7979
Maximum likelihood estimate of phylogeny of eight strains of influenza A isolated from humans, swine, and birds based on an analysis of the HA gene. The divergence years prior to 1870, estimated using a partially constrained molecular clock, are shown at the left of the branch. The branch lengths (after 1870) are calibrated in units of years (scale at bottom).
Rannala, B. 2002. Molecular phylogenies and virulence evolution.In Adaptive Dynamics of Infectious Diseases: In Pursuit of Virulence Management
8080
Difficulties With Phylogenetic Analysis
Horizontal or lateral transfer of genetic material
(for instance through viruses) makes it difficult to
determine phylogenetic origin of some evolutionary
events
Garbage in, garbage out ! Alignment crucial
Genes selective pressure can be rapidly evolving,
masking earlier changes that had occurred
phylogenetically
8181
Difficulties With Phylogenetic Analysis
Two sites within comparative sequences may be
evolving at different rates
Rearrangements of genetic material can lead to
false conclusions
duplicated genes can evolve along separate pathways,
leading to different functions
8282
Gene trees vs species trees Gene duplication can complicate phylogenetic analysis
Paralogues (duplicated genes) do not fit in evolutionary tree
Phylogenetics - Issues
Choice of target sequence type
Use for very long-term evolutionary studies, spanning species boundaries & biological kingdoms
Ribosomal RNA (slowest change / mutation rate)
(a) Use for short-term studies of closely-related species
DNA / RNA (fastest change / mutation rate)
(b) Contains more evolutionary information than protein
(a) Use for wide species comparisons
Protein (medium change / mutation rate)
(b) More reliable alignment than DNA
8383
NO HOMEWORK! Happy??A problem will be appeared in the Final Exam:
Give an example and design a flowchart to
show how to construct a tree
Give an example and design a flowchart to
show how to construct a tree
Your answer should include, at least:
(a) Where you find the example? ( Google, books, or papers )
(b) Why you choose this example? ( curiosity, simple, or no reason? )
(c) Where you plan to get the sequences? ( database in the public domain )
(d) What kind of the methods you plan to use to construct your tree?
(e) Why you plan not use other methods
Just go to Google and find YOUR OWN Answer !