phylogenetics. phylogenetic trees time node branch root operational taxonomic unit (otu)...
TRANSCRIPT
Phylogenetics
Phylogenetic Trees
time
time
NODE BRANCH
ROOTOperationalTaxonomicUnit (OTU)
HypotheticalTaxonomic Unit
Information
• Branching order (topology)– Relative closeness of different taxa
• Branch length– Amount of divergence
Rooted and unrooted trees
A
B
C
D
E
A
B
E
C
D
ROOTED UNROOTED
Rooted and unrooted trees
A
B
C
D
E
A
B
E
C
D
ROOTED UNROOTED
Rooted and unrooted trees
A
B
C
D
E
A
B
E
C
D
ROOTED UNROOTED
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
D
A B
C DA
B
C
D
A
BC
D
A
BC
D
A
BC
D
A
B
CD
… 15 rooted trees of 4 OTUs
3 OTUs
4 OTUs
UNROOTED ROOTED
Monophyletic & Paraphyletic
Mammals
Turtles and tortoises
Snakes and lizards
Crocodiles
Birds
REPTILES
Monophyletic & Paraphyletic
• Monophyletic– Natural clade; all of the taxa are derived from
a common ancestor
• Paraphyletic– Taxonomic group whose most recent common
ancestor is shared by another taxon
Reconstruct phylogeny from molecular data
ACTGTTACCGA
ACTGTTACCGA
ACTGTTACCGA
ACTGTTACCGA
ACTGTTACCGA
?
Types of phylogenetic analysis methods
• Phenetic: trees are constructed based on observed characteristics, not on evolutionary history
• Cladistic: trees are constructed based on fitting observed characteristics to some model of evolutionary history
Distancemethods
ParsimonyandMaximumLikelihoodmethods
Methods of Tree reconstruction
• Distance• Maximum Parsimony• Maximum Likelihood• Bayesian
Phylogeny Estimation: Traditional and Bayesian Approaches
Nature Reviews Genetics (2003) 4:275
Genetic distance
• Distance from one sequence to another• Hamming Distance
– Count number of differences
• Multiple hits – number of events is greater than number of differences – Estimate number of events
• Infer tree from genetic distance using Neighbour-joining (NJ) method
UPGMA shown for illustrative purposes. Neighbour-joining is preferred method.
• The algorithm in the text means: find the closest distance between two sequences, cluster those; then find the next closest distance, cluster those; as sequences are added to existing clusters find the average distance between existing clusters
• Work through the notation!• UPGMA assumes a molecular clock
mechanism of evolution
• Neighbor-joining: corrects for UPGMA’s assumption of the same rate of evolution for each branch by modifying the distance matrix to reflect different rates of change.
• The net difference between sequence i and all other sequences is
• ri = Sdik
• The rate-corrected distance matrix is then • Mij = dij - (ri + rj)/(n - 2)
• Join the two sequences whose Mij is minimal; then calculate the distance from this new node to all other sequences using
• dkm = (dim + djm - dij)/2• Again correct for rates and join nodes.
Maximum Parsimony (MP)
• Find topology requiring smallest number of evolutionary changes
• Consider each position (site) in the sequence alignment independently
• Not all sites are informative
• Informative– Favours one topology over others
Informative sites
a. A A G A G T T C Ab. A G C C G T T C Tc. A G A T A T C C Ad. A G A G A T C C T
a
b
c
d
a b
c d
a
b
c
d
Maximum Likelihood (ML)
• Likelihood L of a tree is the probability of observing the data given the treeL = P(data|tree)
• Find the tree with the highest L value
• Results depends on model of nucleotide substitution
• Computationally time-consuming
• Actually, all the other methods discussed implicitly use a simple model of evolution similar to the typical model made explicit in maximum likelihood:
• All sites selectively neutral• All mutate independently, forward and
reverse rates equal, given by m
• Also assume discrete generations and sites change independently
• Given this model, can calculate probability that a site with initial nucleotide I will change to nucleotide j within time t:
• Ptij = dije-mt + (1 - e-mt)gj, where dij = 1 if i = j
and dij = 0 otherwise, and where gj is the equilibrium frequency of nucleotide j
• The likelihood that some site is in state i at the kth node of a tree is Li
(k)
• The likelihoods for all states for each site for each node are calculated separately; the product of the likelihoods for each site gives the overall likelihood for the observed data
• Different tree topologies are searched to find the highest overall likelihood
• Maximum likelihood is maybe the “gold standard” for phylogenetic analysis; but because of its computational intensity it can only be used for select data and only after much initial fine tuning of many parameters of sequence alignments
• Often used to distinguish between several already generated trees
Bayesian (B) Phylogeny Estimation
• Searches for best trees consistent with both model and data
• Incorporates prior knowledge (prior probability)
• B maximises probability of tree given data and model
• Searches for best set of trees
Comparison of methods
How much information are they using?• MP, ML, B use actual DNA whereas NJ
summarises information into distance matrix• BUT, not all sites are used by MP (“informative”
sites only)How can the nature of the data affect the
methods?• NJ better for recent divergences• MP works well for a high number of informative
sites
Comparison of methods
How do they cope with lots of sequences?• MP requires comparison of all possible trees
– Not possible for large number of taxa
• ML is computationally intensive and very slow for large number of taxa
• NJ efficient for large number of taxaAnything else?• ML requires explicit assumptions about rate and
pattern of substitution (model)– ML may perform poorly if model is incorrect
• ML or B may get stuck on local maxima
Outgroup rooting of unrooted trees
• Outgroup – related sequence that definitely diverged earlier (paleontological evidence)
humanmouse
rat
human
mouse
rat
chicken
Rate (r) of evolution
• K = number of substitutions per site
• T = time since divergence
• r = K/2T
• Rate is expressed as substitutions per site per year
Species A
Species BT
Estimating species divergence times
• fossil evidence shows that T1 = 310 mya
• What is T2 ?
• Only need to have sequences and information on one divergence time
Human (B)
Chicken (C)
Rat (A)T2
T1
True tree and inferred tree
• There is only one true tree of species relationships
• Inferred tree may not be correct
1. Some genes may not be representative
2. Tree inference method may have produced an incorrect tree– e.g. parsimony method:
may get several equally parsimonious results
How credible is the tree?
• The tree is a hypothesis of the true relationship
• Need some measure of the support for that hypothesis
• Note: Bayesian methods simultaneously estimate tree and measures of uncertainty for each branch
Standard Error of branches
Human
Chimp
Gorilla
Orangutan
• The bootstrap: randomly sample all positions (columns in an alignment) with replacement -- meaning some columns can be repeated -- but conserving the number of positions; build a large dataset of these randomized samples
Bootstrap
• Then use your method (distance, parsimony, likelihood) to generate another tree
• Do this a thousand or so times • Note that if the assumptions the method is based
on hold, you should always get the same tree from the bootstrapped alignments as you did originally
• The frequency of some feature of your phylogeny in the bootstrapped set gives some measure of the confidence you can have for this feature
Applications of phylogenetics
• Detection of orthology and paralogy
• Estimation of divergence times• Reconstruction of ancient
proteins• Identifying residues important
to selection• Detecting recombination points• Identifying mutations likely to
be associated with disease• Determining the identity of new
pathogens
The time will come, I believe, though I shall not live tosee it, when we shall have fairly true genealogical treesof each great kingdom of Nature.
Charles Darwin
The Tree of Life
• Traditional classification of life into five kingdoms– Bacteria (inc
cyanobacteria)– Protista (inc. cilliates,
flagellates, amoebae)– Fungi– Plantae– Animalia
Archaebacteria
• Carl Woese and colleagues• Study relationships by
comparing rRNAs • Methanogens were expected
to group with other bacteria• BUT, found to be equally
distant from bacteria and eukaryotes
• Made new taxon - Archaebacteria
• Includes many extremophiles– thermophiles– hyperthermophiles– halophiles (salt dependent)
The Tree of Life
Where is the root of the Tree of Life?
• No possible outgroup (by definition)• Iwabe et al. (1989)• Examined phylogenetic tree of pairs of genes that
exist in all organisms– derived from gene duplication that predates lineage
divergences
lineage 1
lineage 2
lineage 3
lineage 1
lineage 2
lineage 3
Gene A
Gene A1
Gene A2
• Homologous elongation factor genes EF-Tu and EF-G present in all prokaryotes and eukaryotes
• Both genes show the same topology
Archaea
Eucarya
Bacteria
Archaea
Eucarya
Bacteria
EF-Tu
EF-G
based on morphological characteristics (Chatton, 1925)
Changing view ofThe Tree of Life …(Gaucher et al, 2010)
based on DNA sequence analysis (Woese & Fox, 1977)
based on ancient gene duplication
based on phylogenies of hundreds of genes
based on membrane architecture & gene indels
Most modern view …
Phylogeny of humans and apes
• Darwin – Gorilla and Chimpanzee our closest relatives and human evolutionary origins in Africa
• Many people preferred anthropocentric idea that humans were special
Human
Chimp
Gorilla
Orangutan
Gibbon
Traditional view
So what is the evidence?
• Serological precipitation (Goodman 1962) – H, G, C constitute a natural clade, orangutans & gibbons earlier diverging
• However, H,G,C relative relationships remained unclear
• Most DNA sequence data support ((H,C),G)
• Some genes show different relationship
Human
Chimp
Gorilla
Orangutan
Gibbon
Conservation biology – the dusky seaside sparrow
• Last one died June 1987 (DisneyWorld)
• Discovered 1872• Ammodramus maritimus
nigrescens• Geographically confined to
small salt marsh in Florida• 2000 individuals in 1900• 6 individuals (all male) in 1980 • Conservation program
– artificial breeding
Conservation genetics
• Mating of remaining males with females from closest subspecies available
• Female hybrids of first generation then “back-crossed” to original males
• Continue as long as original males live
• Which species to choose to take the females from??
• 8 other A. maritimus subspecies
• Geographically dispersed along coast
• Artificial breeding with Scott’s seaside sparrow (A. m. peninsulae)
• Chosen based on Morphological and behavioural similarities
• Was this the best choice?
nigrescens
peninsulae
AtlanticCoast
GulfCoast
Woops!
• Two subspecies diverged about 250,000 – 500,000 years ago
• A. m. nigrescens almost indistinguishable molecularly from other Atlantic Coast subspecies
• Any Atlantic Coast subspecies would have been a better choice
• Created a new species instead of saving old• Dusky seaside sparrow officially declared extinct in 1990
Origin of angiosperms
• Flowering plants: carpel-enclosed ovules and seed
• Fossils – began to radiate mid-
Cretaceous (~115 mya)– Dominant land plants 90
mya
• 275,000 species described
Origin of angiosperms
• Probably arose from gymnosperm-like ancestor up to 370-380 mya
• Gymnosperm = “naked seed” (e.g. conifers)
• Long time span of possible origin
• Why no fossils?– Didn’t exist prior to
Cretaceous?– Lived in habitats not
conducive to fossilisation?
Monocot and Dicot divergence
• Monocotyledons• Dicotyledons• Two major classes of
angiosperm• Date of their divergence
gives minimum estimate for age of angiosperms
• Phylogenetic analysis of DNA sequences
Monocot – Dicot divergence
• Initial estimate of 300-320 mya (Martin et al. 1989)– Glyceraldehyde-3-phosphaste dehydrogenase from plants,
animals and fungi
• Implied origin close (within 100myr) to the time of origin of earliest land plants – seems too ancient– implies all vascular plants arose within 100myr
• Alternative study (Wolfe et al., 1989)• Calibrated molecular clock with maize-wheat divergence
(50-70 mya)• Monocot-dicot divergence estimated as 200 mya• Existed long before prominence in paleoflora
Cetaceans
• Link to ungulates (hoofed mammals) suggested by comparative anatomy
• Early protein and mtDNA phylogenetic studies indicated that Cetaceans are closely related to Artiodactyls
Cow
Deer
Hippo
Pig
Peccary
Art
ioda
ctyl
s
Camel
• Graur and Higgins (1994)• Protein and DNA
sequence from several cetaceans and from three suborders of artiodactyls
• Showed cetaceans are within artiodactyls
• Confirmed by analysis of distribution of SINE elements
Cetartiodactyls