SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
MOLECULAR EVOLUTION AND PHYLOGENETICS:
AN INTRODUCTION TO PROBABILISTIC MODELS
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
COMPUTING DISTANCES BETWEEN SEQUENCES
Hamming or p-distance is the most obvious way to compute distances between two aligned homologous sequences
p-distances are very simple, but make many hidden assumptions, all of which are violated by biological data. Generally, they work reasonably well only for very closely related sequences.
In order to reconstruct trees from distance matrices, accurate estimation of large distances is necessary as well.
0.1
0.1
0.1
0.1
0.1
0.1
0.10.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.10.80.1 0.2 0.3 0.4 0.5 0.6 0.7
Each branch is short (0.1)
But the distance between sequences A and B is actually quite large: 0.9
A
B
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
MULTIPLE HITS AND REVERSIONS
A T G A A A G C G A
A T G A G A G T G A
LOW DIVERGENCE
Substitutions = 2p = 0.2
A T G A A A G C G AT C
A G T A G A G T G A
HIGH DIVERGENCE
Substitutions = 7p = 0.4
ReversionMultiple hits
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
EFFECT ON DISTANCE ESTIMATES.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1 0.2 0.3 0.4 0.5 0.6 0.7Correct
Estimated
Simulated 100 replicates of 1000 nucleotide long sequences for various divergence levels (substitutions/site)
Plotted ‘true’ divergence vs that estimated by p-distance.
Even for divergence of 0.25 (1/4 sites have mutation on average), p distance already significantly underestimates the true level: 0.2125 (0.19-0.241 95% range)
Underestimation becomes progressively worse for larger divergence levels.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
JUKES-CANTOR 1969
The idea is to model substitutions at a site using a Markov Process.
Very much like a Markov chain, except time is now continuous, instead of being measured in discrete steps.
X(t) defines the probability distribution that the observed quantity follows at time t≥0.
Markov property (memoryless process):
Pr{X(t) = x0|X(t1) = x1, . . . , X(tn) = xn, t > t1 > . . . > tn ≥ 0} =Pr{X(t) = x0|X(t1) = x1}
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
MARKOV PROCESSES
To completely define a Markov process, we need to specify the transition probability function: given that the process is in state A at time u, what is the probability that it will be in state B at a later time, u+t?
Often written as a matrix, T(u,t):
If one further assumes that the process is homogeneous, i.e. T(u,t) does not depend on u, then
T (u, t)AB = Pr{X(u + t) = B|X(u) = A}
T (t)AB = Pr{X(t) = B|X(0) = A}
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
In the homogeneous case, it is easier to define the process in terms of its rate matrix Q:
Given Q, it can be shown that for t≥0,
where the matrix exponential is defined by the standard Taylor series
There are abundant numerical algorithms that compute the matrix exponential in O(C3) time, where C is the dimension of the rate matrix.
MARKOV PROCESSES (CONT’D)
Q = limt↓0
T (t)− I
t
T (t) = expQt
expQt = I + Qt +(Qt)2
2!+
(Qt)3
3!+ . . .
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
HOW DOES THIS RELATE TO GENETIC DISTANCES?
Consider t as the evolutionary time for a mutational process that runs at a constant mutation rate r. Divergence is then obtained as d = r × t
The advantage of using a Markov process is that it automatically computes the probability of all possible paths from A to B over time t, whereas p-distance only considers the direct A to B path.
The optimal d can be inferred from the data!
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
JUKES CANTOR (’69) DISTANCE: JC69The Markov process assumes that all four bases are equally probable and that nucleotides mutate to other nucleotides with equal rates.
Diagonal rates are defined by the requirement that the transition matrix forms a valid probability distribution in each row: for this to hold, each row in the rate matrix must sum to 0.
From↓To → A C G T
A -0.75 0.25 0.25 0.25
C 0.25 -0.75 0.25 0.25
G 0.25 0.25 -0.75 0.25
T 0.25 0.25 0.25 -0.75
Rate matrix Q Transition matrix T (t)
14(1 + 3e−t)
14
�1− e−t
�14
�1− e−t
�14
�1− e−t
�
14
�1− e−t
�14(1 + 3e−t)
14
�1− e−t
�14
�1− e−t
�
14
�1− e−t
�14
�1− e−t
�14(1 + 3e−t)
14
�1− e−t
�
14
�1− e−t
�14
�1− e−t
�14
�1− e−t
�14(1 + 3e−t)
From↓To → A C G T
A
C
G
T
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
From↓To → A C G T
A 1 0 0 0
C 0 1 0 0
G 0 0 1 0
T 0 0 0 1
T (0)
From↓To → A C G T
A 0.753 0.082 0.082 0.082
C 0.082 0.753 0.082 0.082
G 0.082 0.082 0.753 0.082
T 0.082 0.082 0.082 0.753
T (0.1)
From↓To → A C G T
A 0.352 0.216 0.216 0.216
C 0.216 0.352 0.216 0.216
G 0.216 0.216 0.352 0.216
T 0.216 0.216 0.216 0.352
T (0.5)
From↓To → A C G T
A 0.25 0.25 0.25 0.25
C 0.25 0.25 0.25 0.25
G 0.25 0.25 0.25 0.25
T 0.25 0.25 0.25 0.25
T (∞)
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
ML FITTING JC69The objective is to find the optimal t, given the data
Use the principle of maximal likelihood to select t, which maximizes the probability of observing the alignment given the model
A T G A A A G C G A
A G T A G A G T G A
Independent sitesPr{data|t} =
�14
�1 + 3e−t
��6 �14
�1− e−t
��4
Simplify...
0
5e-06
1e-05
0 0.25 0.5 0.75 1t
Prob(data|t)
t=0.76214
Pr{data|t} = Pr{A→ A|t}4 ×Pr{T → G|t}× Pr{G→ T |t}×
Pr{A→ G|t}× Pr{G→ G|t}2 ×Pr{C → T |t}
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
ML FITTING (CON’T)For any pair of sequences with n sites, the JC69 function will only depend on the number of matches (m) and mismatches (n-m):
Easier to deal with sums than products: use the log-likelihood function:
To find the maximum solve (D is the p-distance):
L(t) = Pr{data|t} =14n
�1 + 3e−t
�m �1− e−t
�n−m
log L(t) = −n log 4 + m log(1 + 3e−t) + (n−m) log(1− e−t)
d log L(t)dt
= 0 =⇒ t = − log (1− 4/3D)
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
JC 69 DISTANCE ESTIMATEOne can show that for JC69, the distance estimator d (expected substitutions per site), is related to the time parameter t, as d=3/4 t
Note that the distance is only defined for divergences up to 0.75 Why does this make sense?
dJC69 = −34
log (1− 4/3D)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1 0.2 0.3 0.4 0.5 0.6 0.7Correct
Estimated
JC69 correction works!
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
NEXT STEP: FELSENSTEIN 81In biological sequences, base frequencies are not equal.
JC69 can become biased (overestimate distances, see below)
This is because there are necessarily more substitutions to frequent residues to maintain the frequencies
Base Frequency
A 0.39
C 0.17
G 0.20
T 0.24
HIV-1 frequencies
0
0.5
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7Correct
Estimated
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
From↓To → A C G T
A * πC πG πT
C πA * πG πT
G πA πC * πT
T πA πC πG *
F81: rate matrix Q
Distance estimator
dF81 = −F log (1−D/F )F = 1− π2
A − π2C − π2
G − π2T
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
NEXT STEP: DIFFERENT KINDS OF SUBSTITUTIONS
Nucleotides are split into two chemical groups
Adenine and Guanine (purines)
Cytosine and Thymine (pyrimidines)
Substitutions within group (e.g. A to/from G) are called transitions and are usually much more frequent than substitutions between groups: transversions.
Adenine
Guanine
Cytosine
ThymineFrom↓To → A C G T
A 2 20 1
C 2 * 3 18
G 24 1 * 1
T 1 10 0 * HIV-1 pol example
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
From↓To → A C G T
A * πC κπG πT
C πA * πG κπT
G κπA πC * πT
T πA κπC πG *
HKY85: rate matrix Q
κ: transition/transversion parameter (=1 to obtain F81)
GTR: rate matrix Q From↓To → A C G T
A * rACπC πG rATπT
C rACπA * rCGπG rCTπT
G πA rCGπC * rGTπT
T rATπA rCTπC rGTπG *
Most general in class: 6 parameters now
Closed form expressions either don’t exist (GTR), or are cumbersome (HKY85). Can always estimate numerically.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
JC69 HKY85
F81 GTR
ESTIMATING HIV-1 TREES WITH NJ UNDER DIFFERENT DISTANCES
B_US_90_WEAU160_ACC_U21135
B_US_86_JRFL_ACC_U63632
D_UG_94_94UG114_ACC_U88824
D_CD_84_84ZR085_ACC_U88822
D_CD_83_NDK_ACC_M27323
D_CD_83_ELI_ACC_K03454
B_US_83_RF_ACC_M17451
B_FR_83_HXB2_ACC_K03455
0.069940.01 0.02 0.03 0.04 0.05 0.06
D_UG_94_94UG114_ACC_U88824
B_US_90_WEAU160_ACC_U21135
D_CD_84_84ZR085_ACC_U88822
D_CD_83_NDK_ACC_M27323
D_CD_83_ELI_ACC_K03454
B_US_83_RF_ACC_M17451
B_US_86_JRFL_ACC_U63632
B_FR_83_HXB2_ACC_K03455
0.068750.01 0.02 0.03 0.04 0.05 0.06
D_UG_94_94UG114_ACC_U88824
B_US_90_WEAU160_ACC_U21135
D_CD_84_84ZR085_ACC_U88822
D_CD_83_NDK_ACC_M27323
D_CD_83_ELI_ACC_K03454
B_US_83_RF_ACC_M17451
B_US_86_JRFL_ACC_U63632
B_FR_83_HXB2_ACC_K03455
0.070090.01 0.02 0.03 0.04 0.05 0.06
D_CD_84_84ZR085_ACC_U88822
D_UG_94_94UG114_ACC_U88824
D_CD_83_NDK_ACC_M27323
D_CD_83_ELI_ACC_K03454
B_US_83_RF_ACC_M17451
B_US_90_WEAU160_ACC_U21135
B_US_86_JRFL_ACC_U63632
B_FR_83_HXB2_ACC_K03455
0.06956
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
MAXIMUM LIKELIHOOD SEQUENCE ANALYSIS
sapienschimpanzeebonobogorillaorangutanSumatrangibbon
AAAAAAA
AAAAAAA
CCCCCCC
GGGGGGG
AAAAAAA
AAAAAAA
AAAAGAA
AAAAAGA
TTTTTTT
CCCTCCC
10
TTTTTTT
GAGAAAA
TTTTTTT
TTTTTTT
CCCCCTC
GGGGAAA
CCCCCCC
TTTTCCT
TTTTCCT
CCCCCCC
20
AAAAAAA
TTTTTTT
TTTTTTT
CCCCCCC
AGGAAAG
TCCTCTC
TTTTTTT
GGGGAAA
CCCCCCC
CCCCCCC
30
CCCCCCC
CCCCCCC
CCCCCCC
AAAAAAA
CCCCCCC
AAAAAAA
GAAAGGA
TTTTTTT
CCCCCAT
CCCCCCC
40
TTTTTTT
AAAAAAA
GGGGGGG
GGGGGGG
CCCCCCC
CTCTCCT
TTTTTTT
AAAACCA
CCCCCCC
CCCCCCC
50
CCCCCCC
GGGGGGG
CCCCCCC
CCCCTCC
GGGGGGG
CCCCCCC
AAAAAAA
GGGGAGG
TTTTTTT
AAAATAA
60
Alignment of homologous sequences (column = site)Phylogeny
Substitution Models “All models are wrong, but some models are useful” Box (1976)
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
Define the probability of a point substitution along a branch at a given site:
Continuous time Markov chains.
Typically, the models are stationary and time-reversible.
MAXIMUM LIKELIHOOD MODELS (PLEASE SEE HTTP://WWW.HYPHY.ORG/DOCS/MAXIMUMLIKELIHOOD.PDF FOR DETAILS)
Qix,y(t; θ) = Prθ {x is replaced with y in time t : x, y ∈ C}
1
T
TT
T
T
C
A
T
TT
T
T
C
A
T
T
b1
b2
b4
b3
b5
b6
b7
b8
L(Ds; T , q) = Q1T,A(t1; q)Q8
T,T (t8; q)Q2T,C(t2; q)Q7
T,T (t7; q)
Q3T,T (t3; q)Q6
T,T (t6; q)Q4T,T (t4; q)Q5
T,T (t5; q)
1
If ancestral states were known.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
COMPUTING LIKELIHOODAncestral states are almost always unknown - must sum over all possible internal node character assignments.
Computations can be done efficiently, in O(C2N) time, as opposed to O(C2N) “brute force” time, using Felsenstein’s pruning algorithm (1981) that takes advantage of conditional independence of evolution along tree branches
T
T
T
C
A
T
T
T
c7
C
A
c8
c9
b1
b2
b4
b3
b5
b6
b7
b8
c6
L(Ds; T ,q ) =!
c9!C
!
c8!C
!
c7!C
!
c6!Cp(c9)Q1
c9,A(t1; q)Q8c9,c8(t8; q)Q
2c8,C(t2; q)
Q7c8,c7(t7; q)Q
3c7,T (t3; q)Q6
c7,c6(t6; q)Q4c6,T (t4; q)Q5
c6,T (t5; q)
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
MAXIMUM LIKELIHOOD FRAMEWORK FOR GENETIC SEQUENCE ANALYSIS.. 3
T
T
T
C
A
T
T
T
c7
C
A
c8
c9
b1
b2
b4
b3
b5
b6
b7
b8
c6
Figure 2. Example of a phylogenetic tree with unknown ancestral states.
Ds refers to the s-th column in a multiple sequence alignment (ACTTT in this
case).
Clearly, it is unreasonable to demand that ancestral sequences be known. Most
often, all that can be observed are leaf sequences, which correspond to modern
day organisms. Therefore, we need to be able to evaluate the likelihood of the
data knowing only leaf characters. To do so, we compute the sum over all possible
character assignments to internal nodes of the tree. Using Figure 2 as a reference,
such an evaluation would proceed as follows:
L(Ds; T , θ) =
�
c9∈C
�
c8∈C
�
c7∈C
�
c6∈Cπ(c9)Q
1c9,A(t1; θ)Q
8c9,c8
(t8; θ)Q2c8,C(t2; θ)× (2)
Q7c8,c7
(t7; θ)Q3c7,T (t3; θ)Q
6c7,c6
(t6; θ)Q4c6,T (t4; θ)Q
5c6,T (t5; θ),
where π(c) denotes the probability of observing character c ∈ C at the root of the
tree. While this calculation is straightforward, it is clearly not computationally
feasible, because for a tree on N sequences, there will be |C|N−2terms in the sum.
However, recalling that transition probabilities along a branch are independent of
other branches, it is possible to rearrange the sum in a computationally efficient
manner.
2. Recursive nature of the likelihood function.
Upon closer examination, Eq. (2), can be rewritten in a more computationally
efficient way by grouping the terms according to their hierarchical arrangement in
the tree:
L(Ds; T , θ) =
�
c9∈Cπ(c9)Q
1c9,A(t1; θ)
�
c8∈CQ
8c9,c8
(t8; θ)Q2c8,C(t2; θ)×
�
c7∈C
�Q
7c8,c7
(t7; θ)Q3c7,T (t3; θ)
�
c6∈C
�Q
6c7,c6
(t6; θ)Q4c6,T (t4; θ)Q
5c6,T (t5; θ)
��
The sum, as just written, can be evaluated with O(|C|2N) operations, which is
eminently feasible. This observation was first made by Felsenstein in [Felsenstein, 1981],
and he referred to it as the pruning algorithm.
MAXIMUM LIKELIHOOD FRAMEWORK FOR GENETIC SEQUENCE ANALYSIS.. 3
T
T
T
C
A
T
T
T
c7
C
A
c8
c9
b1
b2
b4
b3
b5
b6
b7
b8
c6
Figure 2. Example of a phylogenetic tree with unknown ancestral states.
Ds refers to the s-th column in a multiple sequence alignment (ACTTT in this
case).
Clearly, it is unreasonable to demand that ancestral sequences be known. Most
often, all that can be observed are leaf sequences, which correspond to modern
day organisms. Therefore, we need to be able to evaluate the likelihood of the
data knowing only leaf characters. To do so, we compute the sum over all possible
character assignments to internal nodes of the tree. Using Figure 2 as a reference,
such an evaluation would proceed as follows:
L(Ds; T , θ) =
�
c9∈C
�
c8∈C
�
c7∈C
�
c6∈Cπ(c9)Q
1c9,A(t1; θ)Q
8c9,c8
(t8; θ)Q2c8,C(t2; θ)× (2)
Q7c8,c7
(t7; θ)Q3c7,T (t3; θ)Q
6c7,c6
(t6; θ)Q4c6,T (t4; θ)Q
5c6,T (t5; θ),
where π(c) denotes the probability of observing character c ∈ C at the root of the
tree. While this calculation is straightforward, it is clearly not computationally
feasible, because for a tree on N sequences, there will be |C|N−2terms in the sum.
However, recalling that transition probabilities along a branch are independent of
other branches, it is possible to rearrange the sum in a computationally efficient
manner.
2. Recursive nature of the likelihood function.
Upon closer examination, Eq. (2), can be rewritten in a more computationally
efficient way by grouping the terms according to their hierarchical arrangement in
the tree:
L(Ds; T , θ) =
�
c9∈Cπ(c9)Q
1c9,A(t1; θ)
�
c8∈CQ
8c9,c8
(t8; θ)Q2c8,C(t2; θ)×
�
c7∈C
�Q
7c8,c7
(t7; θ)Q3c7,T (t3; θ)
�
c6∈C
�Q
6c7,c6
(t6; θ)Q4c6,T (t4; θ)Q
5c6,T (t5; θ)
��
The sum, as just written, can be evaluated with O(|C|2N) operations, which is
eminently feasible. This observation was first made by Felsenstein in [Felsenstein, 1981],
and he referred to it as the pruning algorithm.
MAXIMUM LIKELIHOOD FRAMEWORK FOR GENETIC SEQUENCE ANALYSIS.. 3
T
T
T
C
A
T
T
T
c7
C
A
c8
c9
b1
b2
b4
b3
b5
b6
b7
b8
c6
Figure 2. Example of a phylogenetic tree with unknown ancestral states.
Ds refers to the s-th column in a multiple sequence alignment (ACTTT in this
case).
Clearly, it is unreasonable to demand that ancestral sequences be known. Most
often, all that can be observed are leaf sequences, which correspond to modern
day organisms. Therefore, we need to be able to evaluate the likelihood of the
data knowing only leaf characters. To do so, we compute the sum over all possible
character assignments to internal nodes of the tree. Using Figure 2 as a reference,
such an evaluation would proceed as follows:
L(Ds; T , θ) =
�
c9∈C
�
c8∈C
�
c7∈C
�
c6∈Cπ(c9)Q
1c9,A(t1; θ)Q
8c9,c8
(t8; θ)Q2c8,C(t2; θ)× (2)
Q7c8,c7
(t7; θ)Q3c7,T (t3; θ)Q
6c7,c6
(t6; θ)Q4c6,T (t4; θ)Q
5c6,T (t5; θ),
where π(c) denotes the probability of observing character c ∈ C at the root of the
tree. While this calculation is straightforward, it is clearly not computationally
feasible, because for a tree on N sequences, there will be |C|N−2terms in the sum.
However, recalling that transition probabilities along a branch are independent of
other branches, it is possible to rearrange the sum in a computationally efficient
manner.
2. Recursive nature of the likelihood function.
Upon closer examination, Eq. (2), can be rewritten in a more computationally
efficient way by grouping the terms according to their hierarchical arrangement in
the tree:
L(Ds; T , θ) =
�
c9∈Cπ(c9)Q
1c9,A(t1; θ)
�
c8∈CQ
8c9,c8
(t8; θ)Q2c8,C(t2; θ)×
�
c7∈C
�Q
7c8,c7
(t7; θ)Q3c7,T (t3; θ)
�
c6∈C
�Q
6c7,c6
(t6; θ)Q4c6,T (t4; θ)Q
5c6,T (t5; θ)
��
The sum, as just written, can be evaluated with O(|C|2N) operations, which is
eminently feasible. This observation was first made by Felsenstein in [Felsenstein, 1981],
and he referred to it as the pruning algorithm. Only depends on c7Only depends on c8
Only depends on c9
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
FELSENSTEIN’S PRUNING ALGORITHM
Idea: for each node n in the tree, maintain a C (number of characters) - dimensional vector Ln, whose i-th element records the probability of the subtree rooted at n, given that the character at node n is i.
For leaves, Ln is easy to compute. Ln [i] = 1 if n is labeled with character i, and Ln [i] = 0, otherwise
For interior nodes, Ln [i] is computed by iterating over all children of n, and computing the cumulative probability of changing from i to any other state at child m (this uses Lm ), and then taking the product over all children
At the root node, r, compute the likelihood of the site, by summing over all characters Lr [i] x π (i), where π (i) is the (supplied) distribution of characters at the root.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
Data : A tree T , with leaves labeled with integers from 0 to C − 1, root r,transition probabilities function for each branch (identified uniquely by thenode m where the branch terminates) Pr(x→ y|m); π(i): probability ofobserving character each i at the root.
Result: The likelihood score of the treen← the first node in the post-order traversal of T starting at r;1
C ← the dimension of the alphabet;2
while 1 do3
if n is a leaf then4
Ln ← zero− vector(C);5
Ln[label n]← 1;6
end7
else8
Ln ← vector of ones (C);9
for p from 0 to C-1 do10
for node m is a child of n do11
s← 0;12
for q from 0 to C-1 do13
s+ = Pr(p→ q|m)Lm[q];14
end15
end16
Ln[p]← Ln[p]× s;17
end18
end19
if n = r then20
exit loop;21
end22
n← the next node in the post-order traversal of T ;23
end24
L← 0;25
for p from 0 to C-1 do26
L← Lr[p]× π(p);27
end28
return L;29
Felsenstein’s Pruning Algorithm
Leaf initialization
Interior node calculation
Root clause
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
ML TREE COMPARISONPartial mitochondrial DNA (5 taxa). Exhaustive tree search. GTR.
Chimpanzee
Gorilla
Orangutan
Gibbon
Human
0.052790.01 0.02 0.03 0.04
Chimpanzee
Gorilla
Gibbon
Orangutan
Human
0.056070.01 0.02 0.03 0.04
Chimpanzee
Gorilla
Gibbon
Orangutan
Human
0.052620.01 0.02 0.03 0.04
Gibbon
Gorilla
Chimpanzee
Orangutan
Human
0.05174
Gorilla
Gibbon
Chimpanzee
Orangutan
Human
0.05342
Orangutan
Gorilla
Chimpanzee
Gibbon
Human
0.05209
Orangutan
Gorilla
Gibbon
Chimpanzee
Human
0.052440.01 0.02 0.03 0.04
Gibbon
Gorilla
Orangutan
Chimpanzee
Human
0.055460.01 0.02 0.03 0.04
Gibbon
Gorilla
Orangutan
Chimpanzee
Human
0.052450.01 0.02 0.03 0.04
Gorilla
Gibbon
Orangutan
Chimpanzee
Human
0.05301
Gorilla
Orangutan
Chimpanzee
Gibbon
Human
0.0526
Gorilla
Orangutan
Gibbon
Chimpanzee
Human
0.05181
Gibbon
Orangutan
Gorilla
Chimpanzee
Human
0.054580.01 0.02 0.03 0.04
Orangutan
Gibbon
Gorilla
Chimpanzee
Human
0.057140.01 0.02 0.03 0.04
Orangutan
Gibbon
Gorilla
Chimpanzee
Human
0.054350.01 0.02 0.03 0.04
-2697.85
-2663.56
-2697.56
-2703.24
-2701.86
-2703.46
-2699.97
-2666.24
-2700.57
-2702.2
-2700.64
-2700.75
-2689.87
-2659.56
-2689.98
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
SELECTING THE BEST TREE.The tree with the best log-likelihood score can be reported as the true tree.
What if there are other trees whose scores are not too much worse?
How does one assess significance?
How does one compare trees?
-2663.56 -2703.46-2659.56
Orangutan
Gibbon
Gorilla
Chimpanzee
Human
0.057140.01 0.02 0.03 0.04
Chimpanzee
Gorilla
Gibbon
Orangutan
Human
0.056070.01 0.02 0.03 0.04
Orangutan
Gorilla
Chimpanzee
Gibbon
Human
0.05209
Best 2nd best Worst
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
COMPARING TREES
Most widely used tree distance measures are based on the concept of splits:
Each interior branch partitions all leaves into two disjoint sets.
The Robinson-Foulds (RF) distance between trees on N leaves is the number of different splits between the trees.
For binary unrooted trees, can range from 0 to N-3
Chimpanzee
Gorilla
Gibbon
Orangutan
Human
0.056070.01 0.02 0.03 0.04
Split 1 Split 2
O,Gib H,C,Gor
H,O,Gib C,Gor
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
Orangutan
Gibbon
Gorilla
Chimpanzee
Human
0.057140.01 0.02 0.03 0.04
Chimpanzee
Gorilla
Gibbon
Orangutan
Human
0.056070.01 0.02 0.03 0.04
Orangutan
Gorilla
Chimpanzee
Gibbon
Human
0.05209
Split 1 Split 2
Gor,O,Gib C,H
O,Gib H,C,Gor
Split 1 Split 2
H,O,Gib C,Gor
O,Gib H,C,Gor
Split 1 Split 2
H,Gib C,Gor,O
C,O H,Gib,Gor
Tree 1
Tree 2
Tree 3
T2 T3
T1 1 2
T2 - 2
Robison Foulds Distance Matrix
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
RF CALCULATION ALGORITHMS
Pairwise Robinson Foulds distance calculations can be performed exactly in O(nk2) time on k n-leaf trees trees
Probabilistic algorithms with a tunable error parameter (ε) can run very quickly in practice:
2048 trees x 2048 taxa
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
CONFIDENCE ASSESSMENT VIA BOOTSTRAP
Biological inference is performed on limited length, noisy data
Sampling variance: the error in our ability to infer a quantity (e.g. the tree topology) based on limited data
Extreme example: outlier effects.
Consider trying to approximate the mean net worth in the US from a sample of 10 tax returns.
Assume that one of 10 is Bill Gates; hence we may have a data matrix that looks like the one on the left:
The mean estimated from the table is ~$5B
We could resample the database of tax returns to get many tables and average over them
In biology, frequently all we have is one sample (an alignment); hence it’s impossible to obtain another set
Bill G $50B
1 $100K
2 $20K
3 $120K
4 $30K
5 $200K
6 $10K
7 $25K
8 $40K
9 $5K
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
BOOTSTRAPWe can approximate the true underlying distribution that generated our observed sample by resampling it with replacement:
Draw a value from a sample at random; replace it in the sample
Repeat N times (N - size of the sample).
Back to the Bill Gates example
1000 bootstrap replicates indicate that the standard deviation of our estimate is about the same as the mean: $5B!
In other words, we have very little confidence in the obtained mean (~34% of cases have mean < 200K for example).
0
0.1
0.2
0.3
5e+09 1e+10 1.5e+10 2e+10Value
Weight
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
PHYLOGENETIC BOOTSTRAP
Infer a phylogeny using your favorite method
Generate bootstrap replicates: in this case our samples are alignment columns
Repeat the inference procedure
Tabulate the number of phylogenetic splits that are recovered in the replicates
Branches with high bootstrap support values are those that have strong signal in the alignment
Joe Felsenstein is credited with introducing the phylogenetic bootstrap
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
EXAMPLE MTDNA IN PRIMATES
Human
Chimpanzee
Gorilla
Orangutan
Gibbon
1
0.722
0.1
Used NJ to reconstruct a tree
Drew 1000 alignment samples using the bootstrap
Inferred 1000 trees: one from each resampled alignment
If a given split from the inferred tree was found in a replicate, then its count was incremented by 1.
The accepted number for good support is 0.7 or greater, but 1.0 is sometimes desired.
This alignment is too short (or not informative enough) to unequivocally support the Gorilla-Chimp-Human branching order.
This echoes the observation that the 2nd best ML tree changed that order.
722/1000 replicates had this split
1000/1000 replicates had this split
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
ESTABLISHING HIV TRANSMISSION CHAINS
One of the major applications of molecular phylogenetics in HIV research is to attempt to identify putative transmission events (i.e. A may have infected B).
This is obviously useful for epidemiological purposes: to estimate transmission networks, transmission cluster sizes etc
A pop-sci application: court cases to prove HIV transmission.
Generally the problem is quite difficult.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
050106508-1
050108430-1
050109925-1
050112117-1
050106736-1
050105743-1
050105272-1
050107608-1
050100307-1
050102662-1
050102122-1
050104068-1
050108353-1
050108443-1
050109563-1
050109483-1
050109885-1
050109116-1
050100431-1
050100552-1
050100934-1
050101128-1
050101225-1
050101266-1
050100284-1
050100321-1
050100336-1050100398-1
050100618-1050100132-1
050100373-1050100493-1050100708-1050110355-1050106927-1050105201-1
050108848-1
050102654-1
050110989-1
050105328-1
050105328-2
050100402-2050100402-1
050112339-1
050102527-1
050104407-1050111376-1
050106514-1
050106754-1
050112134-1
050106553-1
050106335-1
050105949-1
050106675-1
050107971-1
050108372-1
050112128-1
050104235-1
050104326-1
050106293-1
050102019-1
050102040-1
050102066-1
050102318-1
050101692-1
050105033-1
050100595-1
050104114-1
050107239-1
050107332-1
050102953-1
050108596-1
050101808-1
050101837-1
050101258-1
050101433-1
050101679-1
050105123-1
050105593-1
050105164-1
050105580-1
050102163-1
050102859-1
050102516-1
050102447-1
050104609-1050105105-1
050102451-1 050111530-1
050104186-1
050105502-1
050105515-1
050107620-1
050112433-1
050104199-1
050104518-1
050104359-1
050102734-1
050109788-1
050107198-1
050102579-2
050102579-1
050107110-1
050104449-1050111496-1 050104498-1050104652-1
050104529-1
050105727-1
050107614-1
050106244-1
050104576-1
050104695-1
050101087-1
050101117-1
050104660-1
050107450-1
050101820-2
050101820-1
050104740-1
050107830-1
050102189-1
050104834-1050104869-1
050104939-1
050104951-1
050100074-1
050100048-1
050105065-1
050105214-1
050105196-2
050105196-1
050105286-1
050105409-1
050107394-1
050107661-1
050108256-1
050108564-1
050108677-1
050100870-1
050109313-1
050109387-1
050109428-1
050110210-1
050110254-1
050110499-1
050110833-1
050105272-2
050105438-1050109994-1
050105578-1050105648-1
050105663-1050105784-1
050105819-1
050105825-1
050107173-1
050105866-1
050106316-1050111501-1
050106824-1
050102058-1
050107288-2
050107288-1
050106852-1050107474-1
050107422-1
050107517-1
050107556-1050107591-1
050101967-1
050102464-1
050102077-1
050107741-1050100847-1
050107801-1
050108416-1
050109506-1
050110647-1
050101554-1
050112191-1
050107790-1050108102-1
050107827-1050108386-1
050111238-1 050112087-1
050107987-1050108098-1
050108062-1
050111189-1
050108085-1050109836-1
050108141-1
050108603-1
050109229-1
050108990-1
050109086-1
050109932-1
050109192-1
050108193-1
050110236-1
050110780-1
050111122-1
050108465-1
050110950-1
050100633-1
050108131-1
050105471-2
050105471-1
050108632-1
050109400-1050109693-1
050109550-1
050109667-1
050109680-1
050110028-1
050110187-1
050110221-1
050110195-1
050109218-1
050112009-1
050108782-1
050108918-1050108941-1 050100119-1050100125-1
050100035-1
050100097-1
050110034-1
050110160-1
050111775-1
050110316-1
050110379-1
050104001-1050104055-1
050106930-1
050107212-1
050107448-1
050110857-1
050104856-1050100753-1
050102111-1050102568-1050102673-1
050111365-1
050111003-1050110935-1
050104206-1050104206-2
050105237-2050105237-1
050105317-2050105317-1
050102333-2
050102333-1
050102341-2
050102341-1
050102357-1050102357-2
050102197-1050102197-2
050101351-2050101351-1
050101510-2050101510-1050111447-1050111412-1 050109849-1050111908-1
050111728-1050112038-1
000501207-1050112024-1
050100011-1050112266-1
050110249-1050112225-1
050111148-1
050112351-1
050112429-1
050107185-1050112418-1
050112388-1050112396-1
050108479-1050112601-1
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
ESTABLISHING TRANSMISSION?It is nearly impossible to use phylogenetic trees alone to definitively prove transmission.
What may a transmission event look like in a tree?
HIV has a high mutation rate, so viruses from two ‘unrelated’ individuals will be quite divergent (upwards of 20% in some genomic regions)
Direct transmissions tend to have lower pairwise distances than unrelated cases (this depends on many factors, most importantly, sampling time relative to transmission)
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
David Acer - a Florida dentist became infected with HIV in late 1986 (Kaposi’s sarcoma) and developed AIDS in 1987. He continued to practice general dentistry for two years following and performed invasive procedures on patients.
In 1990, a young woman in her 20s with AIDS but no identifiable risk factors, and a patient of Acer’s (patient A), prompted the CDC to investigate his practice.
Based on epidemiological evidence and sketchy molecular data, the CDC concluded that Acer might have infected patient A.
The dentist then requested that his former patients be tested for HIV.
Out of 1100 tested persons, two patients (B and C) were found to be HIV positive.
An additional HIV+ person (D - a former patient of Acer) was identified by using the AIDS case registry and two more patients (E,F) contacted the CDC to report that they were HIV infected.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
CAVEAT
HIV evolves a lot in a single patient, so it’s a moving target, and diverse at any given time point.
Consider the pattern of HIV-1 envelope evolution over time proposed by Shakarappa et al.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
MOLECULAR DATAMultiple clones from a short region of the envelope gene (C2-V3 region) was sequenced from each patient (and the dentist)
30 local controls (i.e. epidemiologically unrelated cases) were also included
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
PHYLOGENETIC ANALYSIS: PARSIMONY
Looks fairly convincing?
This i s in fe r red f rom 279 s i te s ( 146 informative). How reliable is the tree?
But the malpractice insurer (CIGNA) hired their own experts to argue that the analysis is flawed because the local controls are not representative.
Other types of evidence (e.g. molecular signature, known sexual contacts etc) were used to bolster the case.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
The prosecution argued that on August 4, 1994 a Lafayette, LA gastroenterologist made a mixture of blood or blood-products from two patients under the doctor’s care, one infected with HIV-1 and the other with hepatitis C, and infected his former girlfriend by intramuscular injection.
Molecular/epidemiological data were used to support the case
Other risk factors for the victim were examined (sexual transmission, IDU, occupational hazards – the victim was a nurse) and determined unlikely.
In January 1995 the victim tested positive for HIV-1 and accused the physician of infecting her.
Law enforcement identified the potential source – a homosexual patient infected with HIV-1 in 1990.
The hypothesis of transmission was tested.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
DUE DILIGENCE
This study used three different approaches: parsimony, maximum likelihood and Bayesian to build evolutionary trees from two different gene regions (RT and env)
28 local (Louisiana) HIV+ controls were used together with background sequences from GenBank.
Confidence values (bootstrap or posterior probability) were obtained.
SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011
The analysis does not preclude with 100% confidence that no intermediates were involved
It also does not establish directionality (although in this case there are other factors that do)
The fact that victim sequences cluster WITHIN the clade of source sequences makes a compelling case for transmission.