1
What is phylogenetic analysis and whyshould we perform it?
Phylogenetic analysis has two major components:
(1) Phylogeny inference or “tree building”the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.)
(2) Analyzing change in traits (phenotypes, genes)using phylogenies as analytical frameworksfor rigorous understanding of the evolution ofvarious traits or conditions of interest
Germline and somatic evolution included!
Uses of Phylogenetics in the Study ofHealth & Disease
(1) Evolutionary history of humans, between and withinspecies
(2) Analysis of evolution of phenotypic and genetic traits inhumans, especially human-specific traits - evolved when,where, why, how
(3) Evolution of parasites and pathogens, in relation to theirhosts (us)
(4) Evolution of cancer cell lineages, and somatic evolutionmore generally.
(5) Study of adaptation in humans and other taxa
What you will learn in this lecture
(1) About phylogenies, terminology, what they are,how they work, ‘tree thinking’
(2) How to infer phylogenies
(3) How we can use phylogenies to answer questionsrelated to human adaptation, health and disease
Ancestral Nodeor ROOT of
the TreeInternal Nodes orDivergence Points
(represent hypotheticalancestors of the taxa)
Branches or Lineages
Terminal Nodes
A
B
C
D
E
Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny
Common Phylogenetic Tree Terminology
Phylogenetic trees diagram the evolutionary relationships between the taxa
((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses
Taxon A
Taxon B
Taxon C
Taxon E
Taxon D
No meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.
This dimension either can have no scale (for ‘cladograms’),can be proportional to genetic distance or amount of change(for ‘phylograms’ or ‘additive trees’), or can be proportionalto time (for ‘ultrametric trees’ or true evolutionary trees).
These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.
Taxon A
Taxon B
Taxon C
Taxon D
11
1
6
3
5
geneticchange
Taxon A
Taxon B
Taxon C
Taxon D
time
Taxon A
Taxon B
Taxon C
Taxon D
nomeaning
Three types of trees
Cladogram Phylogram Ultrametric tree
All show the same evolutionary relationships, or branching orders, between the taxa.
2
Completely unresolvedor "star" phylogeny
Partially resolvedphylogeny
Fully resolved,bifurcating phylogeny
A A A
B
B B
C
C
C
E
E
E
D
D D
Polytomy or multifurcation A bifurcation
A major goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees:
RESOLUTION AND SUPPORT for nodes
There are three possible unrooted treesfor four taxa (A, B, C, D)
A C
B D
Tree 1A B
C D
Tree 2A B
D C
Tree 3
Phylogenetic tree building (or inference) methods are aimed atdiscovering which of the possible unrooted trees is "correct".We would like this to be the “true” biological tree — that is, onethat accurately represents the evolutionary history of the taxa.However, we must settle for discovering the computationallycorrect or optimal tree for the phylogenetic method of choice.
The number of unrooted trees increases in a greaterthan exponential manner with number of taxa
# Taxa (N)
3 4 5 6 7 8 910 . . . .30
# Unrooted trees
1 3 15 105 945 10,935 135,135 2,027,025 . . . . !3.58 x 1036
(2N - 5)!! = # unrooted trees for N taxa
CA
B D
A B
C
A D
B E
C
A D
B E
C
F
Inferring evolutionary relationships betweenthe taxa requires rooting the tree:
To root a tree mentally,imagine that the tree ismade of string. Grab thestring at the root andtug on it until the ends ofthe string (the taxa) fallopposite the root: A
BC
Root D
A B C D
RootNote that in this rooted tree, taxon A isno more closely related to taxon B thanit is to C or D.
Rooted tree
Unrooted tree
TIME
Now, try it again with the root at another position:
A
BC
Root
D
Unrooted tree
Note that in this rooted tree, taxon A is mostclosely related to taxon B, and together theyare equally distantly related to taxa C and D.
C D
Root
Rooted tree
A
B
TIME
An unrooted, four-taxon tree theoretically can be rooted in fivedifferent places to produce five different rooted trees
The unrooted tree 1:
A C
B D
Rooted tree 1d
C
D
A
B
4
Rooted tree 1c
A
B
C
D
3
Rooted tree 1e
D
C
A
B
5
Rooted tree 1b
A
B
C
D
2
Rooted tree 1a
B
A
C
D
1
These trees show five different evolutionary relationships among the taxa!
3
All of these rearrangements show the same evolutionaryrelationships between the taxa
B
A
C
D
A
B
D
C
B
C
A
D
B
D
A
C
B
ACD
Rooted tree 1aB
A
C
D
A
B
C
D
By outgroup:Uses taxa (the “outgroup”) that areknown to fall outside of the group ofinterest (the “ingroup”). Requiressome prior knowledge about therelationships among the taxa.
Main way to root trees:
outgroup
Molecular phylogenetic tree building methods:Are mathematical and/or statistical methods for inferring the divergenceorder of taxa, as well as the lengths of the branches that connect them.There are many phylogenetic methods available today, each havingstrengths and weaknesses. Most can be classified as follows:
COMPUTATIONAL METHODClustering algorithmOptimality criterion
DA
TA T
YPE Cha
ract
ers
Dis
tanc
es
PARSIMONY
MAXIMUM LIKELIHOOD
UPGMA
NEIGHBOR-JOINING
MINIMUM EVOLUTION
LEAST SQUARES
Types of data used in phylogenetic inference:Character-based methods: Use the aligned characters, such as DNAor protein sequences, directly during tree inference.
Taxa CharactersSpecies A ATCGCTAGTCCTATAGTGCASpecies B ATCGCTAGTCCTATATTGCASpecies C TTCGCTAGACCTGTGGTCCASpecies D TTGACCAGACCTGTGGTCCGSpecies E TTGACCAGTTCTGTGGTCCG ETCETC
Similarity vs. Evolutionary Relationship:
Similarity and relationship are not the same thing, even thoughevolutionary relationship is inferred from certain types of similarity.
Similar: having likeness or resemblance (an observation)
Related: genetically connected (an historical fact)
Two taxa can be most similar without being most closely-related:
Taxon A
Taxon B (eg HUMANS!)
Taxon C
Taxon D
11
1
6
3
5
C is more similar in sequence to A (d = 3) than to B (d = 7),but C and B are most closelyrelated (that is, C and B shareda common ancestor more recentlythan either did with A).
Main computational approach:
Optimality approaches: Use either character or distance data.First define an optimality criterion (minimum branch lengths, fewestnumber of events, highest likelihood), and then use a specific algorithmfor finding trees with the best value for the objective function. Canidentify many equally optimal trees, if such exist.
Warning: Finding an optimal tree is not necessarily the same as findingthe "true” tree. Random data will give you an ‘optimal’ (best ) tree!
4
Parsimony methods:
Optimality criterion: The ‘most-parsimonious’ tree is the one thatrequires the fewest number of evolutionary events (e.g., nucleotidesubstitutions, amino acid replacements) to explain the sequences.
Advantages:• Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).• Can be used on molecular and non-molecular (e.g., morphological) data.• Can be used for character (can infer the exact substitutions) and rate analysis.• Can be used to infer the sequences of the extinct (hypothetical) ancestors.
Disadvantages:• Not explicitly statistical• Can be fooled by high levels of parallel evolution
Use parsimony to infer the optimal (best) treeCharacter-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference.
Taxa CharactersSpecies A ATCG CTAGACCTATAGTGCASpecies B ATCG CTAGACCTATATTGCASpecies C TTCG CTAGACCTGTGGTCCASpecies D TTGA CCAGACCTGTGGTCCGSpecies E TTGA CCAGTTGTGTGGTCCG
OUTGROUP TTAC CCATTTGTGTCCTCCG
Infer maximum parsimony tree using first four characters
Quality of trees (how likely it is that they reflect the one True Tree) can be evaluated in various ways (random data will give you alow-quality ‘best’ tree)
We can Statistically Compare alternative trees, corresponding to specific biological hypothesesof the history of some set of lineages
Time scales on trees:molecular clocks
% g
enet
ic d
iver
genc
e
Time since divergence (Myr)
100%
50%
75%
25%
1500300 600 900 1200
Fibrinopeptides
Hemoglobin
Cytochrome c
Histone IV
Why such differentprofiles? Variation inmutation rate?
Variation in selection.Genes coding for somemolecules under verystrong stabilizing selection.
Dates for calibrating molecular clocks can come from geology, fossils, or historical data
From known ages
of islands, for two genes
Calibrating using fossil data
chimps
humans
whales
hippos56 mya
60 substitutions
6 substitutions
5
Calibrating from known dates of the ages of samples: for very fast-evolvingtaxa such as HIV
Uses of Phylogenetics in the Study ofHealth & Disease
(1) Evolutionary history of humans, between andwithin species
(2) Analysis of evolution of phenotypic and genetictraits in humans, especially human-specifictraits - evolved when, where, why, how
(3) Taxonomy and evolution of parasites andpathogens, and evolution in relation to theirhosts
(4) Evolution of cancer cell lineages, and somaticevolution more generally.
(5) Study of adaptation in humans and other taxa,via analysis of divergence and convergence
VIRUS - what IS it?Sequence it’s DNA and relatesequence to known viruses
Evolution of SIV and HIV viruses:multiple transfers to humans, fromchimps and from green monkeys
EMERGING VIRUSES - THE GREATEST KNOWN HEALTH THREAT TO HUMANITY SARS (severe acute respiratory syndrome) what causes it and where did it come from?
HIV phylogenywithin humans in different regions: Haiti as stepping stone to North America
6
HIV evolves very rapidly WITHIN hosts, as a result of interactions with the immune system
Can do phylogenetics:-Pathogens within individuals, -Pathogens between Individuals (eg in different or same regions)
How originate? From other species?How spread? How does resistance toAntibiotics evolve in pathogens, & resistance to chemotherapeuticagents evolve in cancer?
Cancer evolves genetically in the body during carcinogenesis, allowing the inference of ‘oncogenetic trees’
Cytogenetic data:Gains and losses of Chromosomal regionsDuring evolution of cancers;Lose tumor suppressorgene copies, gain Oncogene copies
Involves losses of heterozygosityand losses of imprinting
7
CancerEvolutionaryPhylogenomics
Compare primary cancerwith metastatictumors
What you learned in this lecture
(1) About phylogenies, terminology, what they are, how they work, ‘tree thinking’
(2) How to infer and evaluate phylogenies
(3) How to use phylogenies to answer questions related to human adaptation, healthand disease (viruses, cancer, etc)
(4) How to THINK in terms of evolutionary trees(historical patterns of evolution), within and between species