phylogenetic analyses › ~bolker › eeid › evolution › phylo › eeid_phyl… · 4) estimate...
TRANSCRIPT
1
Phylogenetic analyses
Roman Biek Institute of Biodiversity, Animal Health & Comparative Medicine [email protected]
EEID Evolution workshop 2012
0
What is phylogenetics?
Reconstructing the ancestral relationships among taxa in the form of genealogical trees Taxa can be species, individuals or particular genes Tree is only an estimate => “truth” usually unknown
1 Intro
A simple four taxa example
2
Who is our closest relative?
?
?
?
3
The overall aim
ATTTCTCTG!!ATTTCCTTA!!ATGTCCTTA!!ATGTCCTTA!!ATGTCCTCA!
Analysis and Interpretation
Measure variation at the molecular level
Develop models that fit the observed patterns
Infer process from patterns
Non-taxonomic questions
4
Molecular clocks e.g. “How long ago since two groups split?”
Selection e.g. “Which sites have undergone adaptive change?”
Ancestral state change e.g. “Movement rate between population A and B?”
Demographic reconstruction e.g. “How has population size changed through time?”
5
Outline
Basic terminology and concepts Estimating phylogenies:
alignment substitution models methods for tree building quantifying uncertainty
2
6
The parts of a tree
7
Trees are like mobiles
8
Tree not always strictly bifurcating
9
Different ways to depict a tree
Topology only Topology + Branch lenghts
Shape of the tree
10
Inferring character state change
11
Monophyly vs Non-Monophyly
All descendents derived from one ancestor AND all descendants included
Does not include all descendants
3
12
Rooted vs unrooted trees
Multiple options for placing the root
1
2
3 4
5
A
B
C
D
B A C D
Rooting most commonly done using outgroup: taxon or taxa that fall just outside the group of interest
Number of possible trees rises quickly!
Taxa Unrooted trees Rooted trees 3 1 3 4 3 15 5 15 105 6 105 945 7 954 10,395 8 10,395 135,135 9 135,135 2,027,025
10 2,027,025 34,459,425 20 2.22E+20 8.20E+21 30 8.69E+36 4.95E+38
13
14
The basic steps of phylogenetic analysis
1) Collect homologous sequences 2) Conduct multiple alignment 3) Fit an appropriate substitution model 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret and apply the phylogenetic tree 7) Potentially repeat steps 4-6 using different
tree building methods and/or additional data
15
Homology requirement
Are sequences correctly aligned so that each nucleotide position has its own unbroken history? Only an issue because sequences may contain insertions and deletions (indels)
Need algorithm that can determine the least costly alignment
A T G C G T C T T C C A C A G A !!A T G C A T C G T T C C A C A A A !!A T G C G T C -- T T C C A C A G A !!A T G C A T C G T T C C A C A A A !!
16
Models of substitution
How to measure distance between two sequences? Easiest measure would be number (or proportion) of different sites => Problem of multiple ‘hits’ at the same site
Empirical mtDNA data from bovine mammals
Jukes - Cantor model
All nucleotides undergo changes at the same rate Nucleotide frequencies are the same qA = qC = qG = qT = ¼
A T C G!A - α α α!T α - α α!C α α - α!G α α α - !
17
4
Kimura 2-parameter model
Transitions (α) (purine to purine or pyrimidine to pyrimidine subsitutions) are more common than transversions (β)
A T C G#A - β β α$T β - α β$C β β - β$G α β β -$
C T
A G
α
α
Pyrimidines
Purines
β β β β
18
Variation among sites
Some sites undergo changes more frequently than others
Can be expressed using a gamma distribution
19
20
Finding a substitution model
21
Choosing the right model
jModeltest Available from: http://darwin.uvigo.es/software/jmodeltest.html Fits up to 88 candidate models fit to your sequence data
model selection based on AIC model averaging
22
The basic steps of phylogenetic analysis
1) Collect homologous sequences 2) Conduct multiple alignment 3) Fit an appropriate substitution model 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret and apply the phylogenetic tree 7) Potentially repeat steps 4-6 using different
tree building methods and/or additional data
23
The basic steps of phylogenetic analysis
1) Collect homologous sequences 2) Conduct multiple alignment 3) Fit an appropriate substitution model 4) Estimate tree(s) under that model 5) Test the reliability of the estimated tree(s) 6) Interpret and apply the phylogenetic tree 7) Potentially repeat steps 4-6 using different
tree building methods and/or additional data
5
24
Estimating phylogenies
General approaches for building trees
• Distance based methods
• Maximum parsimony
• Maximum likelihood
• Bayesian methods
25
Estimating phylogenies
Involves two processes: Estimation of the topology Estimation of the branch lengths
Optimality criterion How well do the data fit a particular tree topology? Is used to compare and rank different trees Allows to search for the best tree (under given criterion)
Distance-based methods
SpA ATGCAGGTA!SpB ATGCTGCTA!SpC ATGCAGCTC!SpD TAGCAGGAC!
!! !SpA !! !SpB !!!!!!!SpC!!!!!!!!!!SpD!!!SpA!!!!!!!!(!!!SpB!! !2/9!=!0.22!!!! !!(!!!SpC!! !0.22!!!!!!!!! !0.22!!!!!!!!!!!!!!!(!!!SpD!! !0.44!!!!!!!! !0.66!!!!!!!!!4/9!=!0.44!!!!!!!(!
26 27
Distance-based methods
Basic procedure Calculate pairwise distances among all sequences (according to some substitution model) Use distances to build tree (according to some rule e.g. “neighbour joining” method)
Important features Very quick way to generate tree, even for large data sets Usually no attempt to evaluate alternative trees Information about character state change is lost
28
Maximum likelihood
Basic procedure Optimality criterion: likelihood score Maximize the probability of the sequences, given a tree and its branch lengths and an evolutionary model and its parameters
Important features Allows full use of evolutionary models Relies heavily on model chosen => can be misleading if there is much variation in the substitution process among lineages Computationally much more demanding
29
Bayesian phylogenetics
Basic procedure Objective: determine the posterior distribution of trees given the sequence data Based on this distribution, ‘best’ tree can be identified
Important features Allows full use of evolutionary models Need to include priors Posterior probabilities are approximated through Markov Chain Monte Carlo methods that sample from the posterior Clade probabilities provide measure of uncertainty
6
Bayes’ rule in statistics
30
Bayesian vs. ML parameter estimation
31
Some parameter (e.g. transition/transversion ratio)
Like
lihoo
d or
p
oste
rior p
roba
bilit
y
Holder and Lewis et al 2003, Nature Reviews Genetics
32
How well supported is a grouping?
Non-parametric bootstrap Sample from the original data to create ‘new’ data sets Count how often a particular clade appears in the resampled data
Values > 70 considered strong support
Bootstrapping
“new” datasets of same size are generated from original data by sampling columns with replacement
Trees built from these new data sets The frequency with which a node appears across replicate
trees is taken as a measure of confidence for that node
123456789!ATGCAGGTA!ATGCTGCTA!ATGCAGCTC!TAGCAGGAC!ORIGINAL!
516446789!AAGCCGGTA!TAGCCGCTA!AAGCCGCTC!TTGCCGGAC!REPLICATE1!
33
34
How well supported is a grouping?
Posterior probabilities Count the frequency of a clade within the posterior distribution of trees
Less conservative: values >95 considered strong support
35
Estimating phylogenies
Approaches Commonly used software Distance based methods MEGA, Geneious, Paup*,R Maximum parsimony MEGA, Geneious, Paup* Maximum likelihood MEGA, Geneious, Paup*,R, PhyML Bayesian methods Geneious, MrBayes, BEAST
Program names in bold can also have capabilities for sequence viewing and alignment.
7
36
The basic steps of phylogenetic analysis
Holder and Lewis et al 2003, Nature Reviews Genetics 37
Further resources
Molecular Evolution Workshop, Woods Hole http://workshop.molecularevolution.org/
Non-taxonomic questions
38
Molecular clocks e.g. “How long ago since two groups split?”
Selection e.g. “Which sites have undergone adaptive change?”
Ancestral state change e.g. “Movement rate between population A and B?”
Demographic reconstruction e.g. “How has population size changed through time?”
Evolutionary change vs genome size
Gago et al 2009,Science 39
Molecular clocks
Mutations that are selectively neutral should accumulate over time
• Creates expectation that genetic and temporal divergence are correlated => molecular clock
Clock rate: 0.21 (genome-1 year-1) 4.75 x 10-8 (site-1 year-1)
Mycobacterium bovis
40
Molecular clocks
Clocks traditionally calibrated using fossil data
for measurably evolving pathogens possible to measure evolutionary rate based on dated tips
41
8
Raccoon rabies in eastern North America
Raccoon rabies cases 2001, CDC
Raccoon (Procyon lotor)
Viral clocks: raccoon rabies
42
Maine
North Carolina 500 km
Rabies invasion history 1977-1999
43
Sampling scheme
44
Sampling scheme
45
Estimating the evolutionary rate
Consider viruses sampled at different points in time:
Tree has two scales: 1) Time in years 2) Exp. number of subst./site ⇒ Estimate evolutionary rate µ,
which gives linear relationship between the two scales
µ = 5 x 10-4 subst/site/yr or 5% per hundred years
46
Bayesian tree estimated under a molecular clock model (Drummond et al 2002, Genetics)
Phylogenies with time scales
47
9
Phylogeny reflects spatial organization
48
Phylogeny reflects spatial organization
49
Genealogies and the coalescent
Coalescent theory (Kingman 1982) relates tree shape to population history
declining growing
Emerson (2001, TREE)
50
genealogy
rate of evolution coalescent-based
estimate of population size
population size through time
Inferring the number of infected raccoons
51
Inferring the number of infected raccoons
Biek et al 2007,PNAS
52
Inferring the number of infected raccoons
Biek et al 2007,PNAS
53
10
Close correspondence to observed data
Biek et al 2007,PNAS
54
Global spread of H1N1
55
Ancestral state changes
Ancestral state reconstruction is used to infer character state change across phylogenetic tree
⇒ state change may refer to any kind of trait e.g. • movement event • host switch
56
Cougars (n=353) sampled along Rocky Mountains (USA/ Canada)
WCS/TKRuth
Landscape genetics of host and virus
57
Landscape genetics of host and virus
Bayesian genetic clustering approach: program GENELAND (Guillot et al. 2005, Genetics)
S
Applied to cougar microsatellite data: indicates two populations
How frequently do cougar viruses move across this
boundary?
58
Viral movement rate lower than expected?
Rate estimation repeated for 75 alternative assignments
59
11
Minimal viral movement across boundary
Mean rate only Variability across sampled trees
60
Non-taxonomic questions
61
Molecular clocks e.g. “How long ago since two groups split?”
Selection e.g. “Which sites have undergone adaptive change?”
Ancestral state change e.g. “Movement rate between population A and B?”
Demographic reconstruction e.g. “How has population size changed through time?”
Summary
• Molecular clocks predict that genetic divergence increases regularly with time
• In rapidly evolving pathogens, possible to estimate rate of change from dated samples => allows to calibrate phylogenies
• Can be combined with coalescent techniques to reconstruct population history through time
• Ancestral state reconstruction can reveal movement among discrete states (e.g. geographic locations)
62