phylogenetic trees as a visualization tools for evolutionary classification
Post on 19-Dec-2015
227 views
TRANSCRIPT
Bifurcating / Multifurcating
s4 s5s1 s3s2
A multifurcation = Polytomy
s4 s5s1 s3s2
Dichotomy
There are two types of polytomies: soft (lack of information to resolve the tree) and hard (multiple divergence in short evolutionary time).
Terminology
A branch =An edge
External node - leaf
Human ChimpChicken Gorilla
The root
Internal nodes
Monophyletic groups
Human ChimpChicken Gorilla
The Gorilla+Human+Chimp are monophyletic.A clade is a monophyletic group.
Paraphyletic = Non-monophyletic groups
Whale ChimpDrosophila Zebrafish
The Zebrafish+Whale are paraphyletic
Genes: 0 = absence, 1 = presence
speciesg1g2g3g4g5g6
s1100110
s2001000
s3110000
s4110111
s5001110
3. Tree building
s1 s4 s3 s2 s5
Gene number 1, Option number 2.
Number of changes for gene 1 (character 1) = 1
1 1 1 0 0
1
0
0
1
3. Tree building
s1 s4 s3 s2 s5
Gene number 2, Option number 3.
0 1 1 0 0
0
0
0
0
Number of changes for gene 2 (character 2) = 2
3. Tree building
s1 s4 s3 s2 s5
Gene number 3, Option number 2.
0 0 0 1 1
0
1
1
0
Number of changes for gene 3 (character 3) = 1
3. Tree building
s1 s4 s3 s2 s5
Gene number 4, Option number 2.
1 1 0 0 1
0
0
0
1
Number of changes for gene 4 (character 4) = 2
3. Tree building
Gene number 5 is the same as Gene number 4
Number of changes for gene 5 (character 5) = 2
3. Tree building
s1 s4 s3 s2 s5
Gene number 6, 1 option only:
0 1 0 0 0
0
0
0
0
Number of changes for gene 6 (character 6) = 1
3. Tree building
Sum of changes
Number of changes for gene 6 (character 6) = 1
Number of changes for gene 5 (character 5) = 2
Number of changes for gene 4 (character 4) = 2
Number of changes for gene 3 (character 3) = 1
Number of changes for gene 2 (character 2) = 2
Sum of changes for this tree topology = 9
Can we do better ???
Number of changes for gene 1 (character 1) = 1
3. Tree building
s1 s4 s3 s2 s5
The MP (most parsimonious) tree:
Sum of changes for this tree topology = 8
3. Tree building
The Fitch algorithm (1971):
A GC CA
Human ChimpChicken GorillaDuck
{A,G}
{A,C,G}
{A,C}
{A,C}
Postorder tree scan. In each node, if the intersection between the leaves is empty: we apply a union operator. Otherwise, an intersection.
Number of changes
A GC CA
Human ChimpChicken GorillaDuck
{A,G}
{A,C,G}
{A,C}
{A,C}
Total number of changes = number of union operators.
Patterns:
A GC CA
Human ChimpChicken GorillaDuck
{A,G}
{A,C,G}
{A,C}
{A,C}
CACAG require the same number of changes as CACAT, or in general all those positions with the pattern XYXYZ.
Ex:
GACA GGGACAAG GCGAGAAA
Human ChimpChicken GorillaDuck
Find min. number of changes. Point to all identical patterns.
Ambiguous characters:
A GC CR = {A,G}
Human ChimpChicken GorillaDuck
{A,G}
{A,C,G}
{A,G,C}
{A,C,G}
R = {A,G} = Purine..
The Sankoff algorithm:
Generalization: they assume a cost function Cij for changing from i to j.
If Cij = 1, it just counts number of changes.
We now search for the tree with the min. cost.
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
Easy to compute for the leaves.
For example S2(A) = 0 (no cost in A there)
S2(C) = S2(G) = S2(T) ∞ (they just can’t be there).
7 82 53
A G A A C
6
4
1
0
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
78
25
3
A GA AC
6
4
1
0
[0, ∞, ∞, ∞] [∞, 0, ∞, ∞] [0, ∞, ∞, ∞] [0, ∞, ∞, ∞] [∞, ∞, 0 , ∞]
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
1
0
[s1(A), s1(C), s1(G), s1(T)]
ACGT
A0312
C3021
G1203
T2130
Costs:
2
[s2(A), s2(C), s2(G), s2(T)]
S0(A) = min x (CAX + S1(X)) + min Y (CAY+S2(Y))
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
1
0
[13, 17, 22, 14]
ACGT
A0312
C3021
G1203
T2130
Costs:
2
[15,14,21,17]
S0(A) = min { 13, 17 + 3, 22 + 1, 14 + 2 } + min { 15, 14 + 3, 21 + 1, 17 + 2 }
=13 + 15 = 28.
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
1
[13, 17, 22, 14]
ACGT
A0312
C3021
G1203
T2130
Costs:
2
[15,14,21,17]
S0(C) = min { 13 + 3, 17, 22 + 2, 14 + 1 } + min { 15 + 3, 14, 21 + 2, 17 + 1 }
=15 + 14 = 29.
[28,x,y,z}
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
1
[13, 17, 22, 14]
ACGT
A0312
C3021
G1203
T2130
Costs:
2
[15,14,21,17]
S0(G) = min { 13 + 1, 17 + 2, 22, 14 + 3 } + min { 15 + 1, 14 + 2, 21, 17 + 3 }
=14 + 16 = 30.
[28,29,y,z}
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
1
[13, 17, 22, 14]
ACGT
A0312
C3021
G1203
T2130
Costs:
2
[15,14,21,17]
S0(T) = min { 13 + 2, 17 + 1, 22 + 3, 14 } + min { 15 + 2, 14 + 1, 21 + 3, 17 }
=14 + 15 = 29.
[28,29,30,z}
Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.
1
[28,29,30,29}
[13, 17, 22, 14]
ACGT
A0312
C3021
G1203
T2130
Costs:
2
[15,14,21,17]
The cost of the tree is the minimum of this vector, which is 28.
Dynamic programming.
This is an example of dynamic programming, because you first solve some small problems, and then recursively, use these solutions to build a solution to a larger problem.