biology 4900 biocomputing. chapter 7 molecular phylogeny and evolution
Post on 02-Jan-2016
226 Views
Preview:
TRANSCRIPT
Biology 4900Biology 4900
Biocomputing
Chapter 7Chapter 7
Molecular Phylogeny and Evolution
OutlineOutline
• Introduction to evolution and phylogeny
• Nomenclature of relationship trees
• Five stages of molecular phylogeny:1. Selecting sequences2. Multiple sequence alignment3. Models of substitution4. Tree-building5. Tree evaluation
Historical background: EvolutionHistorical background: Evolution
• Evolution: Theory that groups of organisms change over time; descendants are structurally and functionally different than ancestors.
• Darwin (1859) proposed the role of natural selection in hereditary change.
– Heredity is generally conservative: changes in offspring tend to be small.
• Prior to 1950’s, phylogeny (evolution history of organisms) was studied by comparing living species with fossils.
– Modern techniques of molecular biology not available!
Pevsner, Bioinformatics and Functional Genomics, 2009; http://www.ucsd.tv/evolutionmatters/lesson2/study.shtml
Current View of the Tree of LifeCurrent View of the Tree of Life
http://www.allvoices.com/contributed-news/4553607-is-chimps-as-smart-as-human; After Pace NR (1997) Science 276:734; http://en.wikipedia.org/wiki/File:E_coli_at_10000x,_original.jpg
4 main mechanisms of evolutionary change4 main mechanisms of evolutionary change• Natural Selection
– One or more differences in an organism that increase it’s likelihood of surviving in an environment long enough to reproduce and pass traits to offspring.
– Ex. Green beetles are easier to identify by predators than brown beetles.• Mutation
– Offspring may acquire small mutations in genetic material from 2 parents.– Ex. Green beetle parents produce a brown beetle offspring.
• Genetic drift– Change in frequency of gene variant in population due to random sampling.– Ex. A family of green beetles in a mixed population is killed by insecticide,
leaving a higher % of brown beetles for reproduction.• Migration
– Transfer of genes from one population to another.– Ex. Several brown beetles move into an area full of green beetles and
reproduce with the green beetles.
Pevsner, Bioinformatics and Functional Genomics, 2009
Nucleotide SubstitutionsNucleotide Substitutions
• Nonsynonymous substitution: Nucleotide mutations that alters the amino acid sequence of a protein. • These result in a biological change in the organism, so they are subject
to natural selection.
• Synonymous substitution: Nucleotide mutations that do not alter amino acid sequences. • i.e., Different codons that encode the same AA.
Natural SelectionNatural Selection
• Positive natural selection: Force that drives the increase in prevalence of advantageous traits.– If a change leads to improvement, keep it
• Negative natural selection: The selective removal of harmful alleles.– “If it ain’t broke, don’t fix it”– Helps to explain why detrimental mutations are not
passed along hereditary lines
History of Molecular EvolutionHistory of Molecular Evolution
• Studies of molecular evolution began with the first sequencing of proteins, beginning in the 1950s (following discovery of DNA in 1953 by Watson and Crick).
• Also in 1953, Frederick Sanger and colleagues determined the primary amino acid sequence of insulin (NP_000198)– Most amino acid changes in insulin A chain were restricted to a
disulfide loop region. These were non-random, but neutral changes.– Later studies at the DNA level showed that rate of nucleotide (and of
amino acid) substitution is about 6- to 10-fold higher in the C peptide (proinsulin, the insulin precursor), relative to the A and B chains.
Pevsner, Bioinformatics and Functional Genomics, 2009; http://en.wikipedia.org/wiki/Frederick_Sanger
Note the sequence divergence in the disulfide loop region of the A chain Fig. 7.3
Page 220
Number of nucleotide substitutions/site/year
0.1 x 10-9
0.1 x 10-91 x 10-9
Fig. 7.3Page 220
Surprisingly, insulin from the guinea pig (and from the related coypu) evolve 7 times faster than insulin from other species. Why?
The answer: Guinea pig and coypu insulin do not bind two zinc ions, while insulin molecules frommost other species do. There was a relaxation on the structural constraints of these molecules, and so the genes diverged rapidly.
Page 219
Sus scrofa insulin (1ZNI.pdb)
Historical background: insulinHistorical background: insulin
Guinea pig and coypu insulin have undergone anGuinea pig and coypu insulin have undergone anextremely rapid rate of evolutionary changeextremely rapid rate of evolutionary change
Arrows indicate positions at which guinea pig insulin (A and B chains) differs from human and mouse
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 220
A chain
B chain
Historical Background: Vasopressin and OxytocinHistorical Background: Vasopressin and Oxytocinvasopressin oxytocin
• Peptides vasopressin and oxytocin also sequenced in 1950’s.• Sequences differ by only 2 residues, but functions differ significantly
• Vasopressin – Reduces the volume of urine formed in the body to retain water
• Oxytocin– Causes contractions in smooth muscles of uterus during labor
Molecular clock hypothesisMolecular clock hypothesis• In 1960s, sequence data accumulated for small, abundant
proteins (e.g., globins, cytochromes c, fibrinopeptides). • Some proteins appeared to evolve slowly, while others
evolved rapidly.• Linus Pauling, Emanuel Margoliash and others proposed
the hypothesis of a molecular clock: • For every given protein, rate of molecular evolution is
approximately constant in all evolutionary lineages.
Species A
Species B
Species C
Species D
Protein X
Mutations over time
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 221
Molecular clock hypothesisMolecular clock hypothesis
• Richard Dickerson (1971) plotted sequence data from three protein families: cytochrome c, hemoglobin, and fibrinopeptides.
• X-axis shows the divergence times of the species, estimated from paleontological data.
• Y-axis shows m, the corrected number of amino acid changes per 100 residues.
• n is observed number of amino acid changes per 100 residues, and it is corrected to m to account for changes that occur but are not observed.
Dickerson, 1972
Millions of years since divergence
corr
ecte
d a
min
o a
cid
ch
ang
es
per
100
res
idu
es (m
)
Molecular clock hypothesisMolecular clock hypothesis
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225
Molecular Clock: Calculating rates of Molecular Clock: Calculating rates of nucleotide substitutionsnucleotide substitutions
• Rate of nucleotide substitution (r) = number of nucleotide substitutions that occur per site per year (same for AAs)
• Time of divergence (T) = how long ago the two sequences split from a common sequence.
• A constant (K) defines the number of non-synonymous substitutions per site.
• α-globins from rat and human differ by 0.093 non-synonymous substitutions by site• Rate of change = 0.58 × 10-9 nonsynonymous nucleotide substitutions per site per year
Dickerson drew the following conclusions:
• For each protein, the data lie on a straight line. Thus, the rate of amino acid substitution has remained constant for each protein.
• The average rate of change differs for each protein. The time for a 1% change to occur between two lines of evolution is 20 MY (cytochrome c), 5.8 MY (hemoglobin), and 1.1 MY (fibrinopeptides).
• The observed variations in rate of change reflect functional constraints imposed by natural selection.
Molecular clock hypothesis: conclusionsMolecular clock hypothesis: conclusions
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 223
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225
Molecular clock hypothesis: problemsMolecular clock hypothesis: problems
This hypothesis has largely been discarded due to the following reasons:•Rate of molecular evolution varies among different organisms (e.g., viral sequences changes very rapidly).•Clock varies among different genes and parts of genes – guided in part by selection (e.g., animals with shorter generational times may have faster clocks).•Clock only applicable when gene retains its function over evolutionary time.
– Genes may become nonfunctional– Rate of evolution may accelerate following gene duplication
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225
Neutral theory of molecular evolutionNeutral theory of molecular evolution
• Proposed after the Molecular Clock hypothesis to address inconsistencies
• E.g., significant amount of DNA polymorphism across all species that is difficult to account for by conventional natural selection.
• Kimura (1969, 1983) proposed this as an alternative model to Darwinian evolution at DNA level.– Observed that rate of AA substitution averages ~1 change per 28 × 106
years for proteins of 100 residues.– Corresponding rate of nucleotide substitution is 1 base pair in population
genome every 2 years.– Kimura concluded that most DNA substitutions must be neutral and that
main cause of evolutionary change at molecular level is random drift of mutant alleles.
Page 231
Molecular phylogeny: nomenclature of treesMolecular phylogeny: nomenclature of trees
• Molecular Phylogeny: Study of evolutionary relationships among organisms or molecules– Determined via molecular biology
• Two main kinds of information inherent to any tree: – topology– branch length
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are basedupon DNA and protein sequence data.
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
Change in Time Change in Magnitude
Cladogram Phylogram
A
B
C
D
E
F
G
HI
timeFig. 7.8Page 232
Tree NomenclatureTree Nomenclature
Node (intersection or terminating point of two or more branches)Branch (edge) connecting 2 nodes
A
B
C
D
E
F
G
HI
time
Tree Nomenclature: NodesTree Nomenclature: Nodes
Root NodeInternal Node (Taxon)Operational taxonomic unit (OTU)
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
OTUs: Taxa that provide observable features (e.g., existing protein sequences, morphological features)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Tree Nomenclature: Cladogram vs. PhylogramTree Nomenclature: Cladogram vs. Phylogram
Branches are unscaled... Branches are scaled...
…branch lengths areproportional to number ofamino acid changes
…OTUs are neatly aligned,and nodes reflect time
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
Cladogram Phylogram
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
22
D
Eone unit
Tree nomenclature: Internal NodesTree nomenclature: Internal Nodes
bifurcatinginternal node
multifurcatinginternalnode
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
A
B
C
D
E
F
G
HI
6
2
1 1
2
1
2
Tree nomenclature: cladesTree nomenclature: clades
Clade ABF (monophyletic group)
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232
Clade: Group of taxa derived from a common ancestor* Groups may include sub-groups
Clade CDH (monophyletic group)
A
B
C
D
E
F
G
HI
6
2
1 1
2
1
2
Clade ABF/CDH/G (paraphyletic)
Tree nomenclature: Newick FormatTree nomenclature: Newick Format
http://en.wikipedia.org/wiki/Newick_format
(,,(,)); no nodes are named (A,B,(C,D)); leaf nodes are named (only the terminal values)(A,B,(C,D)E)F; all nodes are named (:0.1,:0.2,(:0.3,:0.4):0.5); all but root node have a distance to parent (:0.1,:0.2,(:0.3,:0.4):0.5):0.0; all have a distance to parent (A:0.1,B:0.2,(C:0.3,D:0.4):0.5); distances and leaf names (popular) (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names ((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A; a tree rooted on a leaf node (rare)
Newick tree format is a way to represent graph-theoretical trees with edge lengths using parentheses and commas
Tree nomenclature: Newick FormatTree nomenclature: Newick Format
http://en.wikipedia.org/wiki/Newick_format
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names
A, B, E are branches related to root FC, D are branches related to node E
ExamplesExamples
Ex. 1 Convert the following Newick format to a tree format:
(A:0.3,B:0.5,(C:0.7,D:0.9)E:0.6)F
Ex. 2 Convert the following Newick format to a tree format:
(A:0.1,(D:1.1,E:0.8,(I:0.1,J:0.2)F:0.4)B:0.2,(G:0.7,H:0.9)C:0.6)K
A, B, E are branches from root FC, D are branches from node E
A, B, C are branches from root KD, E, F are branches from node BG, H are branches from node CI, J are branches from node F
ExamplesExamples
Ex. 1 Convert the following tree format to a Newick format:
((((A:0.2, B:0.3)F:0.3,(C:0.5, D:0.3)G:0.2):0.3, E:0.7)E:1.0)H
A, B are branches from node FC, D are branches from node GF, G, E are branches from root H
F
GH
ExamplesExamples
Ex. 1 Convert the following tree format to a Newick format:
(((A:1.00,B:2.00)n1:3.00,(C:4.00,D:5.00)n2:6.00)n3:7.00,E:8.00)n4
A, B are branches from node n1C, D are branches from node n2n1, n2 are branches from node n3 n3, E are branches from root n4
http://evolution.berkeley.edu/evosite/evo101/IIBPhylogeniesp2.shtml
Clade: Grouping that includes a common ancestor and all the descendents (living and extinct) of that ancestor. These can be represented by phylogenic trees.
Examples of cladesExamples of clades
Examples Examples of cladesof clades
Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10
Tree RootsTree Roots
• The root of a phylogenetic tree represents the common ancestor of the sequences. Some trees are unrooted, and thus do not specify the common ancestor.
past
present
1
2 3 4
5
6
7 8
9
4
5
87
1
2
36
Rooted tree (specifies evolutionary path)
Unrooted tree
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233
Tree RootsTree Roots
• A tree can be rooted using an outgroup (that is, a taxon known to be distantly related from all other Operational Taxonomic Units).
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233
past
present
1
2 3 4
5
6
7 8
9
Rooted tree
1
2 3 4
5 6
Outgroup(used to place the root)
7 9
10
root
8
Cavalii-Sforza and Edwards (1967) derived the numberof possible unrooted trees (NU) for n OTUs (n > 3):
NU =
The number of bifurcating rooted trees (NR)
NR =
For 10 OTUs (e.g. 10 DNA or protein sequences),the number of possible rooted trees is 34 million,and the number of unrooted trees is 2 million.
(2n-5)!2n-3(n-3)!
(2n-3)!2n-2(n-2)!
Unrooted Trees Are Faster to Calculate Than Rooted Unrooted Trees Are Faster to Calculate Than Rooted TreesTrees
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 235
[1] Selection of sequences for analysis (DNA vs. Protein)
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243
Five stages of phylogenetic analysisFive stages of phylogenetic analysis
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 240
Stage 1: Which is more useful: DNA or protein?Stage 1: Which is more useful: DNA or protein?DNADNA lets you study synonymous and nonsynonymous changes
Synonymous changes do not have corresponding protein changes.
synonymous substitution rate (dS)
nonsynonymous substitution rate (dN)
If dS > dN, DNA sequence is under negative (purifying) selection.
Selective removal of harmful alleles Limits change in the sequence (e.g. insulin A chain)
If dS < dN, positive selection occurs. Ex. Duplicated gene may evolve rapidly to assume new functions.
Rates of transitions and transversions can be measured
Evaluation of Changes: Transitions vs. TransversionsEvaluation of Changes: Transitions vs. Transversions
GC and AT are normal pairings
Transitions: interchanges of two-ring purines (A G) or of one-ring pyrimidines (C T): These involve bases of similar shape.
Transversions: Interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 241, Figure 7.16
Stage 1: Use of DNA or proteinStage 1: Use of DNA or proteinDNA•Some substitutions in a DNA sequence alignment can be directly observed:
A A A AAG G → C G → C CC Parallel substitutionT T T TTC C C → G CG Single substitutionC C C → T → A CA Sequential substitutionT T T TTG G G GGT T → A T → C AC Coincidental substitutionT T → G T GTC C → G C → T → G GG Convergent substitutionA A A AAG G → T → G G GG Back substitution
hypothetical ancestral
globin
human globin mouse globin observed alignment
Stage 1: Use of DNA or proteinStage 1: Use of DNA or protein
In this example, there are 2 observed amino acid mismatches and 5 observed nucleotide mismatches.
Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243
Stage 1: Use of DNA or proteinStage 1: Use of DNA or protein
Proteins•Proteins frequently more informative than DNA in pairwise alignment and in BLAST searching.•Proteins have 20 states (amino acids) instead of only 4 for DNA, so there is a stronger phylogenetic signal (we can observe changes over longer period).•Nucleotides are unordered characters: any one nucleotide can change to any other in one step.•Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value (i.e., may require more than 1 nucleotide change).
Five stages of phylogenetic analysisFive stages of phylogenetic analysis
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Stage 2: Multiple sequence alignmentStage 2: Multiple sequence alignment
Fundamental basis of phylogenetic tree is the MSA.
1.Confirm that all sequences are homologous
– Trees can still be generated even with misalignment, or if a non-homologous sequence is included in the alignment.
2.Adjust gap creation and extension penalties as needed to optimize the alignment
3.Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data or gaps. Similar to masking).
-----GADEG-YFGPVILAADGEVA---dlgnvGA-EGDYFGPAI--AEGEVArpl
Pevsner, Bioinformatics and Functional Genomics, 2009
Five stages of phylogenetic analysisFive stages of phylogenetic analysis
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Stage 3: Selection of substitution modelStage 3: Selection of substitution model
• There are different ways to mathematically model substitutions that occur between sequences.
• Simplest approach to measuring distances between sequences is to align pairs of sequences, and then to count the number of differences.
• Degree of divergence is called the Hamming distance. • For alignment of length N with n sites at which there are
differences, the degree of divergence D is:D = (n / N) × 100
• This models for observed differences, which do not equal genetic distance!– Genetic distance involves mutations that are not observed directly.
Pevsner, Bioinformatics and Functional Genomics, 2009
• Here, d is the distance and p is the proportion of residues that differ.
• The Poisson correction distance assumes equality of substitution rates among sites and equal amino acid frequencies while correcting for multiple substitutions at the same site.
Stage 3: Selection of substitution modelStage 3: Selection of substitution model
Pevsner, Bioinformatics and Functional Genomics, 2009
The Poisson correction was proposed to account for unobserved substitutions:
d = - ln(1 – p)
Poisson correction increased the distance values.
Also see reductions in some bootstrap values (red numbers).
Fundamental units of clades did not change.
Jukes and Cantor (1969) proposed a different corrective formula:
• This model describes the probability that one nucleotide will change into another.
• Still imperfect, as it assumes that each residue is equally likely to change into any other (i.e. the rate of transversions equals the rate of transitions).
• In practice, the transition is typically greater than the transversion rate.
Stage 3: Selection of substitution modelStage 3: Selection of substitution model
Pevsner, Bioinformatics and Functional Genomics, 2009
Stage 3: Selection of substitution modelStage 3: Selection of substitution model
Jukes and Cantor (1969) proposed a corrective formula:
Consider an alignment where 3/60 aligned residues differ.The normalized Hamming distance is 3/60 = 0.05.The Jukes-Cantor correction is
When 30/60 aligned residues differ, the Jukes-Cantor correction is more substantial:
Pevsner, Bioinformatics and Functional Genomics, 2009
Five stages of phylogenetic analysisFive stages of phylogenetic analysis
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Stage 4: Tree-building methodsStage 4: Tree-building methods
Distance-based methods•Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. •Examples of distance-based algorithms are UPGMA and neighbor-joining.Character-based methods•Include maximum parsimony and maximum likelihood.•Parsimony analysis involves the search for the tree with the fewest amino acid (or nucleotide) changes that account for the observed differences between taxa.
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: UPGMATree-building methods: UPGMA
1 2
3
4
5
UPGMA (unweighted pair group method using arithmetic mean) is based on clustering of sequences
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: UPGMA
Step 1: compute the pairwise distances of allthe proteins. Get ready to put the numbers 1-5at the bottom of your new tree.
1 2
3
4
5
Pevsner, Bioinformatics and Functional Genomics, 2009
Step 2: Find the two proteins with the smallest pairwise distance. Cluster them.
1 2
3
4
5
1 2
6
Tree-building methods: UPGMATree-building methods: UPGMA
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: UPGMATree-building methods: UPGMA
Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them.
1 2
3
4
5
1 2
6
4 5
7
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: UPGMATree-building methods: UPGMA
Step 4: Keep going. Cluster.
1 2
3
4
5 1 2
6
4 5
7
3
8
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: UPGMATree-building methods: UPGMA
Step 4: Last cluster! This is your tree.
1 2
3
4
5
1 2
6
4 5
7
3
8
9
Pevsner, Bioinformatics and Functional Genomics, 2009
Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix
Pevsner, Bioinformatics and Functional Genomics, 2009
AA BB CC DD
BB 99 ---- ---- ----
CC 88 1111 ---- ----
DD 1212 1515 1010 ----
EE 1515 1818 1313 55
AA BB CC
BB ddABAB ---- ----
CC ddACAC ddBCBC ----
DD ddADAD ddBDBD ddCDCD
Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html
Start with pair having shortest distance (A, B)The branching point is positioned at a distance of 2 / 2 = 1 substitution. We thus constuct a subtree as follows:
Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix
A, B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8
Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html
Calculate a new distance matrix
dist(A,B),C = (distAC + distBC) / 2 = 4 dist(A,B),D = (distAD + distBD) / 2 = 6 dist(A,B),E = (distAE + distBE) / 2 = 6 dist(A,B),F = (distAF + distBF) / 2 = 8
Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix
A, B C D, E
C 4
D, E 6 6
F 8 8 8
Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html
Third cycle of matrix
dist(A,B),C = (distAC + distBC) / 2 = 4 dist(A,B),(D,E) = (dist(A,B)D + dist(AB)E) / 2 = 6 distC,(D,E) = (distCD + distCE) / 2 = 6dist(D,E),F = (distDF + distEF) / 2 = 8
Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix
AB, C D, E
D, E 6
F 8 8
Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html
4th cycle of matrix
dist(AB,C),(D,E) = (dist(AB)DE + dist(C)DE) / 2 = 6 dist(AB,C),F = (dist(AB)F + dist(C)F) / 2 = 8
Another View of UPGMA: Pairwise Distance MatrixAnother View of UPGMA: Pairwise Distance Matrix
ABC, DE
F 8
Example: http://www.icp.ucl.ac.be/~opperd/private/upgma.html
5th cycle of matrix
dist(ABC,DE),(F) = (dist(ABC)F + dist(DE)F) / 2 = 8
UPGMA is a simple approach for making trees.
• A UPGMA tree is always rooted.
• An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong.
• While UPGMA is simple, it is less accurate than the neighbor-joining approach (described next).
Distance-based methods: UPGMA treesDistance-based methods: UPGMA trees
Pevsner, Bioinformatics and Functional Genomics, 2009
The neighbor-joining method of Saitou and Nei (1987) is especially useful for making trees with: •Large number of taxa•Lineages with varying rates of evolution
Begin by placing all the taxa in a star-like structure.
Making trees using neighbor-joiningMaking trees using neighbor-joining
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: Neighbor joiningTree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.
Pevsner, Bioinformatics and Functional Genomics, 2009
Define the distance from X to Y by
dXY = 1/2(d1Y + d2Y – d12)
Pevsner, Bioinformatics and Functional Genomics, 2009
Tree-building methods: Neighbor joiningTree-building methods: Neighbor joining
Example of aneighbor-joiningtree: phylogeneticanalysis of 13RBPs
0.1
1RRO
2PVB
1A75
1BU3
5PAL
1PVA
1RWY
1XO5
1OMR1G8I1S6C A
1K9U2OPO1QV1
1SL8
2SCP
2H2K2HQ8
1SRA
1S6C B
2CCM
1WDC A
2BL0 A 2AAO
1WDC B 2BL0 B
1WDC C2BL0 C
1EXR 1GGZ
1K96 1XK4 A
1E8A
1J55
1QX2
4ICB
1K94
1Y1X
2HQ8
2H2K
1WDC C 2BL0 C
S100 S100
S100
S100
S100
Calbindin d9k
Calbindin d9k
Penta-EF
Penta-EF
Parvalbumin
Parvalbumin
Parvalbumin
Parvalbumin
ParvalbuminParvalbumin
Parvalbumin
Polcalcin
Polcalcin
Osteonectin
R2
R1
Unrooted N-J Phylogenic Tree generated by Treeview
Distribution of SC angles
Parsimony PrincipleParsimony Principle
• The parsimony principle tells us to choose the simplest scientific explanation that fits the evidence. In tree-building, this means that the best hypothesis is the one that requires the fewest evolutionary changes.
http://evolution.berkeley.edu/evolibrary/article/phylogenetics_08
Hypothesis 1 is the most parsimonious because it requires fewer evolutionary changes. Hypothesis 2 requires formation of bony skeleton at 2 different points.
• Rather than pairwise distances between proteins, aligned columns of AA residues are evaluated.
• The main idea of character-based methods is to find the tree with the shortest branch lengths. Thus we seek the most parsimonious (“simple”) tree.
• Identify informative sites. For example, constant characters are not parsimony-informative.
• Construct trees, counting the number of changes required to create each tree.
• Select the shortest tree (or trees).
Building trees using character-based methodsBuilding trees using character-based methods
Pevsner, Bioinformatics and Functional Genomics, 2009
Pevsner, Bioinformatics and Functional Genomics, 2009
As an example of tree-building using maximum parsimony, consider these four taxa:
AAGAAAGGAAGA
How might they have evolved from a common ancestor such as AAA?
Parsimony PrincipleParsimony Principle
• The parsimony principle tells us to choose the simplest scientific explanation that fits the evidence. In tree-building, this means that the best hypothesis is the one that requires the fewest evolutionary changes.
http://evolution.berkeley.edu/evolibrary/article/phylogenetics_08
AAG AAA GGA AGA
AAAAAA
1 1AGA
AAG AGA AAA GGA
AAAAAA
1 2AAA
AAG GGA AAA AGA
AAAAAA
1 1AAA
1 2
Cost = 3 Cost = 4 Cost = 4
1
In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Selection of a substitution model
[4] Tree building
[5] Tree evaluation
Five stages of phylogenetic analysisFive stages of phylogenetic analysis
Pevsner, Bioinformatics and Functional Genomics, 2009
Bootstrapping is common approach to measuring robustness of a tree topology. Given a branching order, how consistently does an algorithm find that branching order in a randomly permuted version of the original data set?
To bootstrap, create artificial dataset by randomly sampling columns from your multiple sequence alignment. Make dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates. Observe the percent of cases in which the assignment of clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.
Stage 5: Evaluating trees via bootstrappingStage 5: Evaluating trees via bootstrapping
Pevsner, Bioinformatics and Functional Genomics, 2009
In 61% of the bootstrapresamplings, ssrbp and btrbp(pig and cow RBP) formed adistinct clade. In 39% of the cases, another protein joinedthe clade (e.g. ecrbp), or oneof these two sequences joinedanother clade.
top related