phylogenetic analysis in the context of multigene sequences sudhindra r. gadagkar university of...
TRANSCRIPT
Phylogenetic analysis in the context of multigene sequences
Sudhindra R. Gadagkar
University of Dayton
DNA Evolution
The unifying force of all life on earth is DNA
Adenine, Cytosine, Guanine, Thymine
ATGGCATACGTGCAGTTCATCGGCTAGTGTGACATGA
DNA sequence evolution
t0
t1ATGGCATACGTGCA
ATGGTATAGGTGCA
ATGGCATACGTGAA
A phylogenetic treeA pattern of branching events, with each branching point showing a speciation (or divergence) event
Taxon ATaxon A
Taxon BTaxon B
3.53.5
3.53.5
7.57.5
44
Taxon CTaxon C
•Nodes (extinct ancestors) •Tips (living species)
•Branches (amount of evolution) •Taxon (pl. Taxa)
• Reconstruction of the evolutionary relationships among “taxa”
•Representation in a graphical form.
What is phylogenetic inference?
M Fin Whale
M Blue Whale
M Cow
M Rat
M Mouse
M Opossum
B Chicken
A Xenopus
F Rainbow Trout
F Loach
F Carp
L Lamprey
S Sea urchin
0.05
M F
in W
hal
e
M B
lue
Wha
le
M C
ow
M Rat
M Mouse
M O
possum B C
hicken
A X
eno
pu
sF Rai
nbow
Tro
ut
F Loach
F Carp
L Lamprey
S S
ea urchin
0.05
M F
in W
hal
e
M Blue W
hale
M Cow
M Rat
M M
ouse
M O
po
ssum
B C
hic
ken
A X
enop
us
F Rainbow Tro
ut
F Loach
F Carp
L Lamprey
S S
ea urch
in
0.05
Parts of a tree• Tree size: no. of taxa in the
phylogeny.
• Interior branch: partitions an unrooted tree into 2 subtrees, each containing 2 taxa.
• Cluster size: minimum of two subtree sizes partitioned by an interior branch.
• Depth of a branch: defined in terms of the no. of taxa clustered by it.
Root
Internal BranchF. Whale
B. Whale
Cow
Rat
Mouse
Opossum
External branch
Node
Outgroup
Example of a 6-sequence treeF. Whale
B. Whale
Cow
Rat
Mouse
Opossum
F. Whale
B. Whale
Cow
Rat Mouse
Opossum
Rooted Tree
Unrooted Tree
Phylogenetic analysis using DNA sequences
t0
t1ATGGCATACGTGCA
ATGGTATAGGTGCA
ATGGCATACGTGAA
Gene Sequences
Homologous (orthologous) gene sequences• D. melanogaster ATGTCGTTGACCAACAAGAACGTGATTTTCGTGGCCGGTCT...• D. pseudoobscura ATGTCTCTCACCAACAAGAACGTCGTTTTCGTGGCCGGTCT...• D. crassifemur ATGTTCATCGCTGGCAAGAACATCATCTTTGTCGCTGGTCT...• D. mulleri ATGGCCATCGCTAACAAGAACATCATCTTCGTCGCTGGACT...
[ D.me D.ps D.cr D.mu][D.me] [D.ps] 0.14 [D.cr] 0.24 0.24 [D.mu] 0.21 0.20 0.21
Distance Matrix
D. melanogaster
D. pseudoobscura
D. mulleri
D. crassifemur
Expected or Species tree
F. Whale
B. Whale
Cow
Rat
Mouse
Opossum
Realized tree for gene X
F. Whale
B. Whale
Cow
Rat
Mouse
Opossum
Two-fold Challenge
• Today’s challenge is the flood of data, in two ways:
1. The increasing number of taxa (say, species) for which molecular data is available.
2. The increasing amount of molecular data that is available for each taxon.
The number of possible trees increases enormously as the number of taxa increases
Why is reconstructing the evolutionary history of a large number of taxa a challenge?
Number of rooted trees
• The number of bifurcating rooted trees is given by the following formula, where m is the number of taxa.
2
1 3 5 2 3
2 3 !
2 2 !m
m
m
m
Source: Nei and Kumar, 2000. Molecular Evolution and Phylogenetics
3 taxa
Source: Brian Golding, Reconstructing Phylogenieshttp://helix.biology.mcmaster.ca/721/phylo/phylo.html
4 taxa
Source: Brian Golding, Reconstructing Phylogenieshttp://helix.biology.mcmaster.ca/721/phylo/phylo.html
More taxa
Source: Brian Golding, Reconstructing Phylogenieshttp://helix.biology.mcmaster.ca/721/phylo/phylo.html
So many trees!
0
400
600
800
1000
1200
0 100 200 300 400
Millions
Billions
10200
10
10
10
10
10N
o. o
f P
oss i
ble
Tre
es
No. of Sequences
1079 atoms in the universe
1037 atoms in the bodies of all humans by year 2035
5 1030 prokaryotes living today
5 1011 stars in the milky way
How many trees represent the true relationship?
Only ONE out of all possible trees is the true tree!
Which is the true tree?
Choose a criterion (optimality criterion).
Score the fit of the data to a given tree for that criterion
Tree with the optimal score is chosen as the best tree.
Optimal tree found in this way is expected to be closest to the true tree.
Optimality Criteria
Branch lengths computed for each tree using pair-wise distances obtained from sequences. Sum of branch lengths (S) is used as the optimality score.
Minimum Evolution (ME)
Branch lengths Computer
Data
Topology
Sum of branch lengths
Substitution Model
Distance Computer
Tree with the smallest S-value is chosen.
The Neighbor-Joining method (Saitou and Nei, Mol. Biol.Evol. 4: 406 - 425, 1987)
• Computationally efficient
• Desirable statistical properties
• Accuracy
• Performance with large phylogenies?
Properties of the NJ method
Research ProblemPerformance of NJ optimality criteria in inferring large trees
Performance worse with more sequences?
More difficult to infer deep branches as compared to the shallow ones?
Reconstruct branches at similar depths in large and small trees with same efficiency?
• 4 basic 6-taxa trees (topologies)
• Equal interior branch lengths
• Trees stacked to make larger trees (e.g., Dx = x trees of type D stacked)
Model trees and their features
E F G
D
D
D
D
D
D
D
D
B D
8
9
9
11
1 1
1
1
C
8
11
1
1
9
9
11 1 1
6
6
7
8
6
6
A
1 11
4
46
57
8
1
Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 1)
Additional model topology - the rbcL tree
(From Hillis, Nature, 383:130-131, 1996)
Tree parameters
• Rate: Up to 10 fold differences in rate.
• Sequence Length: Up to 10 multiples of 100 sites.
• Tree size: Ax, Bx, Cx, Dx, where x varied from 1 to 10, 16, and 32
Simulating Evolutionary Change
• Starting point or “root” chosen.
• Random ancestral sequence generated
for the root.
• Branch length randomly obtained
from a Poisson distribution with mean
= expected no. of substitutions
(evolutionary rate sequence length
multiplier).
4
4
5
6
7
8
1
1
1
1
• Equal probability of transition from one state to another.
• Process carried out for all branches
• Resulting data are sequences for the taxa for that “gene”.
• These sequences used to infer back the evolutionary
relationships using NJ.
• 1000 replications (A to D trees; 60 taxa), 100 reps (>
60 taxa, rbcL tree).
Simulating Evolutionary Change (contd.)
Accurate Inference of Complete Trees
0
20
40
60
80
100
0 50 100 150 200
Number of sequences
% R
eco
very
of
co
mp
lete
tre
es
200 sites
500 sites
1000 sites
Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Table 1)
Effect of 0-length branches on NJ performance
Sequence length (s)
0 200 400 600 800 10000
20
40
60
80
P0
PModel
PRealized100
% b
ranc
hes
corr
ect
Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 3)
Reconstruction efficiency of 6 taxa monophyletic clusters
70
80
90
100
0 50 100 150 200
Number of sequences
200 sites
500 sites
1000 sites
% c
orre
ct r
epli
cate
s
% branches inferred correctly
04
05
60
70
80
90
1 00
Tree size
618 30 42 54 96
192
0.00625
0.03125
0.0625
1000 sites
500 sites
200 sites
100 sites
Rat
e (r
)
Per
cen t
Eff
i ci e
ncy
(PB
R)
Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 4)
Branch depth and NJ efficiency
70
80
90
100
Branch depth2 3 4 5 6 1 2 18 24 3 0 48 96
6
24
42
60
Tree size
192
pB
Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 5B)
Shallow versus deep branchesResults from rbcL tree
70
80
90
100
2 5 8 11 14 18 23 26 29 32 47 51 54 67 74
Kumar and Gadagkar, 2000, J. Mol. Evol.,51:544-553 (Fig. 8B)
Branch depth
Rec
onst
ruct
ion
effi
cien
cy
Branch depth and efficiency for different
inference methods (JC simulations)
Rosenberg and Kumar, 2001, Mol. Biol. Evol.,18:1823-1827 (Fig. 1)
Branch depth and efficiency for different
inference methods (HKY simulations)
Rosenberg and Kumar, 2001, Mol. Biol. Evol.,18:1823-1827 (Fig. 2)
The Challenge of Multi-Gene Sequences
• Multi-Gene/Whole Genome sequences increasingly available for many taxa.
• How best to obtain phylogenetic information from these multiple sequences?
Concatenation vs Consensus
Concatenation approach
ATGCTGACTG ATGTCGTCAGTC
ATGCTGACTGATGTCGTCAGTC
A B C D E
A B C D E A B C D E
ATGCTGACTG ATGTCGTCAGTC
Consensus approach
A B C D E
The worst-case scenario approach
• The worst-case scenario is when all the available genes yield highly incorrect phylogenetic reconstructions.
• When faced with such sequences, which strategy to employ: consensus or concatenation?
Simulation with estimated parameters
• Model tree based on the phylogenetic relationships among 66 mammals from Murphy et al., (Nature 409:614-618, 2001).
Source: Fig. 1 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Simulation with estimated parameters
• Sequences for 448 genes downloaded from HOVERGEN (Duret et al., Nucleic Acids Res. 22: 2360-2365, 1994).
• Sequence parameters (length, L, substitution rate, r, transition-transversion rate ratio, , and G+C content, ) were estimated from the data.
Simulation with estimated parameters (contd.)
• For each of the 448 genes, 100 replicate sequences generated by computer simulation, using the estimated parameters and the HKY model of evolution.
Computer Simulation
Rep1 Rep2 Rep3 . . . Rep100
Gene1
Gene2
Gene3
.
.
.
Gene448
Simulation with estimated parameters (contd.)
• Phylogenetic inference was done on each of the 44,800 simulation replicates using NJ-JC and NJ-TN methods.
• The accuracy of each tree was recorded in terms of the number of incorrect branches when compared to the model tree.
Spread of sequence attributes
0
20
40
60
80
Log Sequence Length
0
20
40
60
Substitution rate per site (x 10 - 9 )
0
20
40
60
80
log Kappa
0
20
40
60
G+C content
Simulation lets us play God!• In computer simulation, evolution is simulated
based on a model tree, and replicate sequences are obtained.
• These replicate sequences are then used to infer back the true tree.
• Therefore, for the 100 simulation replicates for each of the 448 genes, we know the worst performing replicate.
Simulation lets us play God!
D. melanogaster
D. pseudoobscura
D. mulleri
D. crassifemur
Start
The Two-Gene Case
• Data: For each of NJ-JC and NJ-TN, we picked
* 10,000 pairs of worst replicates
* 10,000 pairs of randomly chosen replicates
Two-gene concatenation
Source: Table 1 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Comparison of the number of incorrect
inferred branches (NJ-JC)
0
10
20
30
40
50
0 10 20 30 40 50
Gene 1 tree
Worst replicate pairs
0
10
20
30
40
50
0 10 20 30 40 50
Gene 1 tree
Random replicate pairs
Effect of Gene Attributes
Effect of gene attributes (contd.)
2.00
2.50
3.00
3.50
4.00
2.00 2.50 3.00 3.50 4.00
Log Length (Gene 1)
Log
Len
gth
(Gen
e 2)
Quality of second gene
Source: Fig. 2 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Worst case
Random case
Progressive addition of genes
Source: Fig. 3 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Whe
n al
l 448
gen
es w
ere
used
Sour
ce: F
ig. 4
fro
m G
adag
kar,
Ros
enbe
rg a
nd K
umar
, Mol
ecul
ar a
nd
Dev
elop
men
tal E
volu
tion
(A
ccep
ted)
Effect of neighboring branches
Source: Fig. 5 from Gadagkar, Rosenberg and Kumar, Molecular and Developmental Evolution (Accepted)
Summary & Conclusions• Heck of a lot of data available
• Two dimensions – number of species, and number of sequences per species
• Many methods available to infer phylogenies from a large number of species
• Neighbor-joining (NJ), a fast, distance based algorithm works well and infers trees correctly as long as there are no polytomies (multifurcations) in the true tree
• NJ also infers shallow and deep branches with good and equal efficiency
Summary and Conculsions – contd.
• Multigene data available for many species
• How best to obtain phylogenetic info from these sequences (consensus or concatenation)?
• Our simulation results, with biologically realistic parameters and the worst-case approach, show that concatenation is better
• However, concatenation approach appears excessively prone to certain systematic errors.
Acknowledgements• Co-authors:
– Sudhir Kumar– Michael Rosenberg
• Help:– Roman Johnson– Tushar Gadagkar– Sankar Subramanian– Balaji Ramanujam
Arizona State University
University of Dayton
please visit our Biology Department at:
http://biology.udayton.edu
To find out more about our graduate programs,
Apply online for free at:
http://gradadmission.udayton.edu