molecular evolution and phylogenetics an …cse/bimm/beng 181 may 27, 2011 sergei l kosakovsky pond...

SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 27, 2011

MOLECULAR EVOLUTION AND PHYLOGENETICS:

AN INTRODUCTION TO PROBABILISTIC MODELS

mailto:[email protected]



COMPUTING DISTANCES BETWEEN SEQUENCES

Hamming or p-distance is the most obvious way to compute distances between two aligned homologous sequences

p-distances are very simple, but make many hidden assumptions, all of which are violated by biological data. Generally, they work reasonably well only for very closely related sequences.

In order to reconstruct trees from distance matrices, accurate estimation of large distances is necessary as well.

0.1

0.1

0.1

0.1

0.1

0.1

0.10.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.10.80.1 0.2 0.3 0.4 0.5 0.6 0.7

Each branch is short (0.1)

But the distance between sequences A and B is actually quite large: 0.9

A

B




MULTIPLE HITS AND REVERSIONS

A T G A A A G C G A

A T G A G A G T G A

LOW DIVERGENCE

Substitutions = 2p = 0.2

A T G A A A G C G AT C

A G T A G A G T G A

HIGH DIVERGENCE

Substitutions = 7p = 0.4

ReversionMultiple hits




EFFECT ON DISTANCE ESTIMATES.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1 0.2 0.3 0.4 0.5 0.6 0.7Correct

Estimated

Simulated 100 replicates of 1000 nucleotide long sequences for various divergence levels (substitutions/site)

Plotted ‘true’ divergence vs that estimated by p-distance.

Even for divergence of 0.25 (1/4 sites have mutation on average), p distance already significantly underestimates the true level: 0.2125 (0.19-0.241 95% range)

Underestimation becomes progressively worse for larger divergence levels.




JUKES-CANTOR 1969

The idea is to model substitutions at a site using a Markov Process.

Very much like a Markov chain, except time is now continuous, instead of being measured in discrete steps.

X(t) defines the probability distribution that the observed quantity follows at time t≥0.

Markov property (memoryless process):

Pr{X(t) = x0|X(t1) = x1, . . . , X(tn) = xn, t > t1 > . . . > tn ≥ 0} =Pr{X(t) = x0|X(t1) = x1}




MARKOV PROCESSES

To completely define a Markov process, we need to specify the transition probability function: given that the process is in state A at time u, what is the probability that it will be in state B at a later time, u+t?

Often written as a matrix, T(u,t):

If one further assumes that the process is homogeneous, i.e. T(u,t) does not depend on u, then

T (u, t)AB = Pr{X(u + t) = B|X(u) = A}

T (t)AB = Pr{X(t) = B|X(0) = A}




In the homogeneous case, it is easier to define the process in terms of its rate matrix Q:

Given Q, it can be shown that for t≥0,

where the matrix exponential is defined by the standard Taylor series

There are abundant numerical algorithms that compute the matrix exponential in O(C3) time, where C is the dimension of the rate matrix.

MARKOV PROCESSES (CONT’D)

Q = limt↓0

T (t)− I

t

T (t) = expQt

expQt = I + Qt +(Qt)2

2!+

(Qt)3

3!+ . . .




HOW DOES THIS RELATE TO GENETIC DISTANCES?

Consider t as the evolutionary time for a mutational process that runs at a constant mutation rate r. Divergence is then obtained as d = r × t

The advantage of using a Markov process is that it automatically computes the probability of all possible paths from A to B over time t, whereas p-distance only considers the direct A to B path.

The optimal d can be inferred from the data!




JUKES CANTOR (’69) DISTANCE: JC69The Markov process assumes that all four bases are equally probable and that nucleotides mutate to other nucleotides with equal rates.

Diagonal rates are defined by the requirement that the transition matrix forms a valid probability distribution in each row: for this to hold, each row in the rate matrix must sum to 0.

From↓To → A C G T

A -0.75 0.25 0.25 0.25

C 0.25 -0.75 0.25 0.25

G 0.25 0.25 -0.75 0.25

T 0.25 0.25 0.25 -0.75

Rate matrix Q Transition matrix T (t)

14(1 + 3e−t)

14

�1− e−t

�14

�1− e−t

�14

�1− e−t

�

14

�1− e−t

�14(1 + 3e−t)

14

�1− e−t

�14

�1− e−t

�

14

�1− e−t

�14

�1− e−t

�14(1 + 3e−t)

14

�1− e−t

�

14

�1− e−t

�14

�1− e−t

�14

�1− e−t

�14(1 + 3e−t)


A

C

G

T





A 1 0 0 0

C 0 1 0 0

G 0 0 1 0

T 0 0 0 1

T (0)


A 0.753 0.082 0.082 0.082

C 0.082 0.753 0.082 0.082

G 0.082 0.082 0.753 0.082

T 0.082 0.082 0.082 0.753

T (0.1)


A 0.352 0.216 0.216 0.216

C 0.216 0.352 0.216 0.216

G 0.216 0.216 0.352 0.216

T 0.216 0.216 0.216 0.352

T (0.5)


A 0.25 0.25 0.25 0.25

C 0.25 0.25 0.25 0.25

G 0.25 0.25 0.25 0.25

T 0.25 0.25 0.25 0.25

T (∞)




ML FITTING JC69The objective is to find the optimal t, given the data

Use the principle of maximal likelihood to select t, which maximizes the probability of observing the alignment given the model

A T G A A A G C G A

A G T A G A G T G A

Independent sitesPr{data|t} =

�14

�1 + 3e−t

��6 �14

�1− e−t

��4

Simplify...

0

5e-06

1e-05

0 0.25 0.5 0.75 1t

Prob(data|t)

t=0.76214

Pr{data|t} = Pr{A→ A|t}4 ×Pr{T → G|t}× Pr{G→ T |t}×

Pr{A→ G|t}× Pr{G→ G|t}2 ×Pr{C → T |t}




ML FITTING (CON’T)For any pair of sequences with n sites, the JC69 function will only depend on the number of matches (m) and mismatches (n-m):

Easier to deal with sums than products: use the log-likelihood function:

To find the maximum solve (D is the p-distance):

L(t) = Pr{data|t} =14n

�1 + 3e−t

�m �1− e−t

�n−m

log L(t) = −n log 4 + m log(1 + 3e−t) + (n−m) log(1− e−t)

d log L(t)dt

= 0 =⇒ t = − log (1− 4/3D)




JC 69 DISTANCE ESTIMATEOne can show that for JC69, the distance estimator d (expected substitutions per site), is related to the time parameter t, as d=3/4 t

Note that the distance is only defined for divergences up to 0.75 Why does this make sense?

dJC69 = −34

log (1− 4/3D)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.1 0.2 0.3 0.4 0.5 0.6 0.7Correct

Estimated

JC69 correction works!




NEXT STEP: FELSENSTEIN 81In biological sequences, base frequencies are not equal.

JC69 can become biased (overestimate distances, see below)

This is because there are necessarily more substitutions to frequent residues to maintain the frequencies

Base Frequency

A 0.39

C 0.17

G 0.20

T 0.24

HIV-1 frequencies

0

0.5

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7Correct

Estimated





A * πC πG πT

C πA * πG πT

G πA πC * πT

T πA πC πG *

F81: rate matrix Q

Distance estimator

dF81 = −F log (1−D/F )F = 1− π2

A − π2C − π2

G − π2T




NEXT STEP: DIFFERENT KINDS OF SUBSTITUTIONS

Nucleotides are split into two chemical groups

Adenine and Guanine (purines)

Cytosine and Thymine (pyrimidines)

Substitutions within group (e.g. A to/from G) are called transitions and are usually much more frequent than substitutions between groups: transversions.

Adenine

Guanine

Cytosine

ThymineFrom↓To → A C G T

A 2 20 1

C 2 * 3 18

G 24 1 * 1

T 1 10 0 * HIV-1 pol example





A * πC κπG πT

C πA * πG κπT

G κπA πC * πT

T πA κπC πG *

HKY85: rate matrix Q

κ: transition/transversion parameter (=1 to obtain F81)

GTR: rate matrix Q From↓To → A C G T

A * rACπC πG rATπT

C rACπA * rCGπG rCTπT

G πA rCGπC * rGTπT

T rATπA rCTπC rGTπG *

Most general in class: 6 parameters now

Closed form expressions either don’t exist (GTR), or are cumbersome (HKY85). Can always estimate numerically.




JC69 HKY85

F81 GTR

ESTIMATING HIV-1 TREES WITH NJ UNDER DIFFERENT DISTANCES

B_US_90_WEAU160_ACC_U21135

B_US_86_JRFL_ACC_U63632

D_UG_94_94UG114_ACC_U88824

D_CD_84_84ZR085_ACC_U88822

D_CD_83_NDK_ACC_M27323

D_CD_83_ELI_ACC_K03454

B_US_83_RF_ACC_M17451

B_FR_83_HXB2_ACC_K03455

0.069940.01 0.02 0.03 0.04 0.05 0.06

D_UG_94_94UG114_ACC_U88824


D_CD_84_84ZR085_ACC_U88822






0.068750.01 0.02 0.03 0.04 0.05 0.06

D_UG_94_94UG114_ACC_U88824


D_CD_84_84ZR085_ACC_U88822






0.070090.01 0.02 0.03 0.04 0.05 0.06

D_CD_84_84ZR085_ACC_U88822

D_UG_94_94UG114_ACC_U88824







0.06956




MAXIMUM LIKELIHOOD SEQUENCE ANALYSIS

sapienschimpanzeebonobogorillaorangutanSumatrangibbon

AAAAAAA

AAAAAAA

CCCCCCC

GGGGGGG

AAAAAAA

AAAAAAA

AAAAGAA

AAAAAGA

TTTTTTT

CCCTCCC

10

TTTTTTT

GAGAAAA

TTTTTTT

TTTTTTT

CCCCCTC

GGGGAAA

CCCCCCC

TTTTCCT

TTTTCCT

CCCCCCC

20

AAAAAAA

TTTTTTT

TTTTTTT

CCCCCCC

AGGAAAG

TCCTCTC

TTTTTTT

GGGGAAA

CCCCCCC

CCCCCCC

30

CCCCCCC

CCCCCCC

CCCCCCC

AAAAAAA

CCCCCCC

AAAAAAA

GAAAGGA

TTTTTTT

CCCCCAT

CCCCCCC

40

TTTTTTT

AAAAAAA

GGGGGGG

GGGGGGG

CCCCCCC

CTCTCCT

TTTTTTT

AAAACCA

CCCCCCC

CCCCCCC

50

CCCCCCC

GGGGGGG

CCCCCCC

CCCCTCC

GGGGGGG

CCCCCCC

AAAAAAA

GGGGAGG

TTTTTTT

AAAATAA

60

Alignment of homologous sequences (column = site)Phylogeny

Substitution Models “All models are wrong, but some models are useful” Box (1976)




Define the probability of a point substitution along a branch at a given site:

Continuous time Markov chains.

Typically, the models are stationary and time-reversible.

MAXIMUM LIKELIHOOD MODELS (PLEASE SEE HTTP://WWW.HYPHY.ORG/DOCS/MAXIMUMLIKELIHOOD.PDF FOR DETAILS)

Qix,y(t; θ) = Prθ {x is replaced with y in time t : x, y ∈ C}

1

T

TT

T

T

C

A

T

TT

T

T

C

A

T

T

b1

b2

b4

b3

b5

b6

b7

b8

L(Ds; T , q) = Q1T,A(t1; q)Q8

T,T (t8; q)Q2T,C(t2; q)Q7

T,T (t7; q)

Q3T,T (t3; q)Q6

T,T (t6; q)Q4T,T (t4; q)Q5

T,T (t5; q)

1

If ancestral states were known.



http://www.hyphy.org/docs/maximumlikelihood.pdf

http://www.hyphy.org/docs/maximumlikelihood.pdf


COMPUTING LIKELIHOODAncestral states are almost always unknown - must sum over all possible internal node character assignments.

Computations can be done efficiently, in O(C2N) time, as opposed to O(C2N) “brute force” time, using Felsenstein’s pruning algorithm (1981) that takes advantage of conditional independence of evolution along tree branches

T

T

T

C

A

T

T

T

c7

C

A

c8

c9

b1

b2

b4

b3

b5

b6

b7

b8

c6

L(Ds; T ,q ) =!

c9!C

!

c8!C

!

c7!C

!

c6!Cp(c9)Q1

c9,A(t1; q)Q8c9,c8(t8; q)Q

2c8,C(t2; q)

Q7c8,c7(t7; q)Q

3c7,T (t3; q)Q6

c7,c6(t6; q)Q4c6,T (t4; q)Q5

c6,T (t5; q)




MAXIMUM LIKELIHOOD FRAMEWORK FOR GENETIC SEQUENCE ANALYSIS.. 3

T

T

T

C

A

T

T

T

c7

C

A

c8

c9

b1

b2

b4

b3

b5

b6

b7

b8

c6

Figure 2. Example of a phylogenetic tree with unknown ancestral states.

Ds refers to the s-th column in a multiple sequence alignment (ACTTT in this

case).

Clearly, it is unreasonable to demand that ancestral sequences be known. Most

often, all that can be observed are leaf sequences, which correspond to modern

day organisms. Therefore, we need to be able to evaluate the likelihood of the

data knowing only leaf characters. To do so, we compute the sum over all possible

character assignments to internal nodes of the tree. Using Figure 2 as a reference,

such an evaluation would proceed as follows:

L(Ds; T , θ) =

�

c9∈C

�

c8∈C

�

c7∈C

�

c6∈Cπ(c9)Q

1c9,A(t1; θ)Q

8c9,c8

(t8; θ)Q2c8,C(t2; θ)× (2)

Q7c8,c7

(t7; θ)Q3c7,T (t3; θ)Q

6c7,c6

(t6; θ)Q4c6,T (t4; θ)Q

5c6,T (t5; θ),

where π(c) denotes the probability of observing character c ∈ C at the root of the

tree. While this calculation is straightforward, it is clearly not computationally

feasible, because for a tree on N sequences, there will be |C|N−2terms in the sum.

However, recalling that transition probabilities along a branch are independent of

other branches, it is possible to rearrange the sum in a computationally efficient

manner.

2. Recursive nature of the likelihood function.

Upon closer examination, Eq. (2), can be rewritten in a more computationally

efficient way by grouping the terms according to their hierarchical arrangement in

the tree:

L(Ds; T , θ) =

�

c9∈Cπ(c9)Q

1c9,A(t1; θ)

�

c8∈CQ

8c9,c8

(t8; θ)Q2c8,C(t2; θ)×

�

c7∈C

�Q

7c8,c7

(t7; θ)Q3c7,T (t3; θ)

�

c6∈C

�Q

6c7,c6

(t6; θ)Q4c6,T (t4; θ)Q

5c6,T (t5; θ)

��

The sum, as just written, can be evaluated with O(|C|2N) operations, which is

eminently feasible. This observation was first made by Felsenstein in [Felsenstein, 1981],

and he referred to it as the pruning algorithm.


T

T

T

C

A

T

T

T

c7

C

A

c8

c9

b1

b2

b4

b3

b5

b6

b7

b8

c6



case).







L(Ds; T , θ) =

�

c9∈C

�

c8∈C

�

c7∈C

�

c6∈Cπ(c9)Q

1c9,A(t1; θ)Q

8c9,c8

(t8; θ)Q2c8,C(t2; θ)× (2)

Q7c8,c7

(t7; θ)Q3c7,T (t3; θ)Q

6c7,c6

(t6; θ)Q4c6,T (t4; θ)Q

5c6,T (t5; θ),






manner.




the tree:

L(Ds; T , θ) =

�

c9∈Cπ(c9)Q

1c9,A(t1; θ)

�

c8∈CQ

8c9,c8

(t8; θ)Q2c8,C(t2; θ)×

�

c7∈C

�Q

7c8,c7

(t7; θ)Q3c7,T (t3; θ)

�

c6∈C

�Q

6c7,c6

(t6; θ)Q4c6,T (t4; θ)Q

5c6,T (t5; θ)

��



and he referred to it as the pruning algorithm.


T

T

T

C

A

T

T

T

c7

C

A

c8

c9

b1

b2

b4

b3

b5

b6

b7

b8

c6



case).







L(Ds; T , θ) =

�

c9∈C

�

c8∈C

�

c7∈C

�

c6∈Cπ(c9)Q

1c9,A(t1; θ)Q

8c9,c8

(t8; θ)Q2c8,C(t2; θ)× (2)

Q7c8,c7

(t7; θ)Q3c7,T (t3; θ)Q

6c7,c6

(t6; θ)Q4c6,T (t4; θ)Q

5c6,T (t5; θ),






manner.




the tree:

L(Ds; T , θ) =

�

c9∈Cπ(c9)Q

1c9,A(t1; θ)

�

c8∈CQ

8c9,c8

(t8; θ)Q2c8,C(t2; θ)×

�

c7∈C

�Q

7c8,c7

(t7; θ)Q3c7,T (t3; θ)

�

c6∈C

�Q

6c7,c6

(t6; θ)Q4c6,T (t4; θ)Q

5c6,T (t5; θ)

��



and he referred to it as the pruning algorithm. Only depends on c7Only depends on c8

Only depends on c9




FELSENSTEIN’S PRUNING ALGORITHM

Idea: for each node n in the tree, maintain a C (number of characters) - dimensional vector Ln, whose i-th element records the probability of the subtree rooted at n, given that the character at node n is i.

For leaves, Ln is easy to compute. Ln [i] = 1 if n is labeled with character i, and Ln [i] = 0, otherwise

For interior nodes, Ln [i] is computed by iterating over all children of n, and computing the cumulative probability of changing from i to any other state at child m (this uses Lm ), and then taking the product over all children

At the root node, r, compute the likelihood of the site, by summing over all characters Lr [i] x π (i), where π (i) is the (supplied) distribution of characters at the root.




Data : A tree T , with leaves labeled with integers from 0 to C − 1, root r,transition probabilities function for each branch (identified uniquely by thenode m where the branch terminates) Pr(x→ y|m); π(i): probability ofobserving character each i at the root.

Result: The likelihood score of the treen← the first node in the post-order traversal of T starting at r;1

C ← the dimension of the alphabet;2

while 1 do3

if n is a leaf then4

Ln ← zero− vector(C);5

Ln[label n]← 1;6

end7

else8

Ln ← vector of ones (C);9

for p from 0 to C-1 do10

for node m is a child of n do11

s← 0;12

for q from 0 to C-1 do13

s+ = Pr(p→ q|m)Lm[q];14

end15

end16

Ln[p]← Ln[p]× s;17

end18

end19

if n = r then20

exit loop;21

end22

n← the next node in the post-order traversal of T ;23

end24

L← 0;25

for p from 0 to C-1 do26

L← Lr[p]× π(p);27

end28

return L;29

Felsenstein’s Pruning Algorithm

Leaf initialization

Interior node calculation

Root clause




ML TREE COMPARISONPartial mitochondrial DNA (5 taxa). Exhaustive tree search. GTR.

Chimpanzee

Gorilla

Orangutan

Gibbon

Human

0.052790.01 0.02 0.03 0.04

Chimpanzee

Gorilla

Gibbon

Orangutan

Human

0.056070.01 0.02 0.03 0.04

Chimpanzee

Gorilla

Gibbon

Orangutan

Human

0.052620.01 0.02 0.03 0.04

Gibbon

Gorilla

Chimpanzee

Orangutan

Human

0.05174

Gorilla

Gibbon

Chimpanzee

Orangutan

Human

0.05342

Orangutan

Gorilla

Chimpanzee

Gibbon

Human

0.05209

Orangutan

Gorilla

Gibbon

Chimpanzee

Human

0.052440.01 0.02 0.03 0.04

Gibbon

Gorilla

Orangutan

Chimpanzee

Human

0.055460.01 0.02 0.03 0.04

Gibbon

Gorilla

Orangutan

Chimpanzee

Human

0.052450.01 0.02 0.03 0.04

Gorilla

Gibbon

Orangutan

Chimpanzee

Human

0.05301

Gorilla

Orangutan

Chimpanzee

Gibbon

Human

0.0526

Gorilla

Orangutan

Gibbon

Chimpanzee

Human

0.05181

Gibbon

Orangutan

Gorilla

Chimpanzee

Human

0.054580.01 0.02 0.03 0.04

Orangutan

Gibbon

Gorilla

Chimpanzee

Human

0.057140.01 0.02 0.03 0.04

Orangutan

Gibbon

Gorilla

Chimpanzee

Human

0.054350.01 0.02 0.03 0.04

-2697.85

-2663.56

-2697.56

-2703.24

-2701.86

-2703.46

-2699.97

-2666.24

-2700.57

-2702.2

-2700.64

-2700.75

-2689.87

-2659.56

-2689.98




SELECTING THE BEST TREE.The tree with the best log-likelihood score can be reported as the true tree.

What if there are other trees whose scores are not too much worse?

How does one assess significance?

How does one compare trees?

-2663.56 -2703.46-2659.56

Orangutan

Gibbon

Gorilla

Chimpanzee

Human

0.057140.01 0.02 0.03 0.04

Chimpanzee

Gorilla

Gibbon

Orangutan

Human

0.056070.01 0.02 0.03 0.04

Orangutan

Gorilla

Chimpanzee

Gibbon

Human

0.05209

Best 2nd best Worst




COMPARING TREES

Most widely used tree distance measures are based on the concept of splits:

Each interior branch partitions all leaves into two disjoint sets.

The Robinson-Foulds (RF) distance between trees on N leaves is the number of different splits between the trees.

For binary unrooted trees, can range from 0 to N-3

Chimpanzee

Gorilla

Gibbon

Orangutan

Human

0.056070.01 0.02 0.03 0.04

Split 1 Split 2

O,Gib H,C,Gor

H,O,Gib C,Gor




Orangutan

Gibbon

Gorilla

Chimpanzee

Human

0.057140.01 0.02 0.03 0.04

Chimpanzee

Gorilla

Gibbon

Orangutan

Human

0.056070.01 0.02 0.03 0.04

Orangutan

Gorilla

Chimpanzee

Gibbon

Human

0.05209

Split 1 Split 2

Gor,O,Gib C,H

O,Gib H,C,Gor

Split 1 Split 2

H,O,Gib C,Gor

O,Gib H,C,Gor

Split 1 Split 2

H,Gib C,Gor,O

C,O H,Gib,Gor

Tree 1

Tree 2

Tree 3

T2 T3

T1 1 2

T2 - 2

Robison Foulds Distance Matrix




RF CALCULATION ALGORITHMS

Pairwise Robinson Foulds distance calculations can be performed exactly in O(nk2) time on k n-leaf trees trees

Probabilistic algorithms with a tunable error parameter (ε) can run very quickly in practice:

2048 trees x 2048 taxa




CONFIDENCE ASSESSMENT VIA BOOTSTRAP

Biological inference is performed on limited length, noisy data

Sampling variance: the error in our ability to infer a quantity (e.g. the tree topology) based on limited data

Extreme example: outlier effects.

Consider trying to approximate the mean net worth in the US from a sample of 10 tax returns.

Assume that one of 10 is Bill Gates; hence we may have a data matrix that looks like the one on the left:

The mean estimated from the table is ~$5B

We could resample the database of tax returns to get many tables and average over them

In biology, frequently all we have is one sample (an alignment); hence it’s impossible to obtain another set

Bill G $50B

1 $100K

2 $20K

3 $120K

4 $30K

5 $200K

6 $10K

7 $25K

8 $40K

9 $5K




BOOTSTRAPWe can approximate the true underlying distribution that generated our observed sample by resampling it with replacement:

Draw a value from a sample at random; replace it in the sample

Repeat N times (N - size of the sample).

Back to the Bill Gates example

1000 bootstrap replicates indicate that the standard deviation of our estimate is about the same as the mean: $5B!

In other words, we have very little confidence in the obtained mean (~34% of cases have mean < 200K for example).

0

0.1

0.2

0.3

5e+09 1e+10 1.5e+10 2e+10Value

Weight




PHYLOGENETIC BOOTSTRAP

Infer a phylogeny using your favorite method

Generate bootstrap replicates: in this case our samples are alignment columns

Repeat the inference procedure

Tabulate the number of phylogenetic splits that are recovered in the replicates

Branches with high bootstrap support values are those that have strong signal in the alignment

Joe Felsenstein is credited with introducing the phylogenetic bootstrap




EXAMPLE MTDNA IN PRIMATES

Human

Chimpanzee

Gorilla

Orangutan

Gibbon

1

0.722

0.1

Used NJ to reconstruct a tree

Drew 1000 alignment samples using the bootstrap

Inferred 1000 trees: one from each resampled alignment

If a given split from the inferred tree was found in a replicate, then its count was incremented by 1.

The accepted number for good support is 0.7 or greater, but 1.0 is sometimes desired.

This alignment is too short (or not informative enough) to unequivocally support the Gorilla-Chimp-Human branching order.

This echoes the observation that the 2nd best ML tree changed that order.

722/1000 replicates had this split

1000/1000 replicates had this split




ESTABLISHING HIV TRANSMISSION CHAINS

One of the major applications of molecular phylogenetics in HIV research is to attempt to identify putative transmission events (i.e. A may have infected B).

This is obviously useful for epidemiological purposes: to estimate transmission networks, transmission cluster sizes etc

A pop-sci application: court cases to prove HIV transmission.

Generally the problem is quite difficult.




050106508-1

050108430-1

050109925-1

050112117-1

050106736-1

050105743-1

050105272-1

050107608-1

050100307-1

050102662-1

050102122-1

050104068-1

050108353-1

050108443-1

050109563-1

050109483-1

050109885-1

050109116-1

050100431-1

050100552-1

050100934-1

050101128-1

050101225-1

050101266-1

050100284-1

050100321-1

050100336-1050100398-1

050100618-1050100132-1

050100373-1050100493-1050100708-1050110355-1050106927-1050105201-1

050108848-1

050102654-1

050110989-1

050105328-1

050105328-2

050100402-2050100402-1

050112339-1

050102527-1

050104407-1050111376-1

050106514-1

050106754-1

050112134-1

050106553-1

050106335-1

050105949-1

050106675-1

050107971-1

050108372-1

050112128-1

050104235-1

050104326-1

050106293-1

050102019-1

050102040-1

050102066-1

050102318-1

050101692-1

050105033-1

050100595-1

050104114-1

050107239-1

050107332-1

050102953-1

050108596-1

050101808-1

050101837-1

050101258-1

050101433-1

050101679-1

050105123-1

050105593-1

050105164-1

050105580-1

050102163-1

050102859-1

050102516-1

050102447-1

050104609-1050105105-1

050102451-1 050111530-1

050104186-1

050105502-1

050105515-1

050107620-1

050112433-1

050104199-1

050104518-1

050104359-1

050102734-1

050109788-1

050107198-1

050102579-2

050102579-1

050107110-1

050104449-1050111496-1 050104498-1050104652-1

050104529-1

050105727-1

050107614-1

050106244-1

050104576-1

050104695-1

050101087-1

050101117-1

050104660-1

050107450-1

050101820-2

050101820-1

050104740-1

050107830-1

050102189-1

050104834-1050104869-1

050104939-1

050104951-1

050100074-1

050100048-1

050105065-1

050105214-1

050105196-2

050105196-1

050105286-1

050105409-1

050107394-1

050107661-1

050108256-1

050108564-1

050108677-1

050100870-1

050109313-1

050109387-1

050109428-1

050110210-1

050110254-1

050110499-1

050110833-1

050105272-2

050105438-1050109994-1

050105578-1050105648-1

050105663-1050105784-1

050105819-1

050105825-1

050107173-1

050105866-1

050106316-1050111501-1

050106824-1

050102058-1

050107288-2

050107288-1

050106852-1050107474-1

050107422-1

050107517-1

050107556-1050107591-1

050101967-1

050102464-1

050102077-1

050107741-1050100847-1

050107801-1

050108416-1

050109506-1

050110647-1

050101554-1

050112191-1

050107790-1050108102-1

050107827-1050108386-1

050111238-1 050112087-1

050107987-1050108098-1

050108062-1

050111189-1

050108085-1050109836-1

050108141-1

050108603-1

050109229-1

050108990-1

050109086-1

050109932-1

050109192-1

050108193-1

050110236-1

050110780-1

050111122-1

050108465-1

050110950-1

050100633-1

050108131-1

050105471-2

050105471-1

050108632-1

050109400-1050109693-1

050109550-1

050109667-1

050109680-1

050110028-1

050110187-1

050110221-1

050110195-1

050109218-1

050112009-1

050108782-1

050108918-1050108941-1 050100119-1050100125-1

050100035-1

050100097-1

050110034-1

050110160-1

050111775-1

050110316-1

050110379-1

050104001-1050104055-1

050106930-1

050107212-1

050107448-1

050110857-1

050104856-1050100753-1

050102111-1050102568-1050102673-1

050111365-1

050111003-1050110935-1

050104206-1050104206-2

050105237-2050105237-1

050105317-2050105317-1

050102333-2

050102333-1

050102341-2

050102341-1

050102357-1050102357-2

050102197-1050102197-2

050101351-2050101351-1

050101510-2050101510-1050111447-1050111412-1 050109849-1050111908-1

050111728-1050112038-1

000501207-1050112024-1

050100011-1050112266-1

050110249-1050112225-1

050111148-1

050112351-1

050112429-1

050107185-1050112418-1

050112388-1050112396-1

050108479-1050112601-1




ESTABLISHING TRANSMISSION?It is nearly impossible to use phylogenetic trees alone to definitively prove transmission.

What may a transmission event look like in a tree?

HIV has a high mutation rate, so viruses from two ‘unrelated’ individuals will be quite divergent (upwards of 20% in some genomic regions)

Direct transmissions tend to have lower pairwise distances than unrelated cases (this depends on many factors, most importantly, sampling time relative to transmission)




David Acer - a Florida dentist became infected with HIV in late 1986 (Kaposi’s sarcoma) and developed AIDS in 1987. He continued to practice general dentistry for two years following and performed invasive procedures on patients.

In 1990, a young woman in her 20s with AIDS but no identifiable risk factors, and a patient of Acer’s (patient A), prompted the CDC to investigate his practice.

Based on epidemiological evidence and sketchy molecular data, the CDC concluded that Acer might have infected patient A.

The dentist then requested that his former patients be tested for HIV.

Out of 1100 tested persons, two patients (B and C) were found to be HIV positive.

An additional HIV+ person (D - a former patient of Acer) was identified by using the AIDS case registry and two more patients (E,F) contacted the CDC to report that they were HIV infected.




CAVEAT

HIV evolves a lot in a single patient, so it’s a moving target, and diverse at any given time point.

Consider the pattern of HIV-1 envelope evolution over time proposed by Shakarappa et al.




MOLECULAR DATAMultiple clones from a short region of the envelope gene (C2-V3 region) was sequenced from each patient (and the dentist)

30 local controls (i.e. epidemiologically unrelated cases) were also included




PHYLOGENETIC ANALYSIS: PARSIMONY

Looks fairly convincing?

This i s in fe r red f rom 279 s i te s ( 146 informative). How reliable is the tree?

But the malpractice insurer (CIGNA) hired their own experts to argue that the analysis is flawed because the local controls are not representative.

Other types of evidence (e.g. molecular signature, known sexual contacts etc) were used to bolster the case.




The prosecution argued that on August 4, 1994 a Lafayette, LA gastroenterologist made a mixture of blood or blood-products from two patients under the doctor’s care, one infected with HIV-1 and the other with hepatitis C, and infected his former girlfriend by intramuscular injection.

Molecular/epidemiological data were used to support the case

Other risk factors for the victim were examined (sexual transmission, IDU, occupational hazards – the victim was a nurse) and determined unlikely.

In January 1995 the victim tested positive for HIV-1 and accused the physician of infecting her.

Law enforcement identified the potential source – a homosexual patient infected with HIV-1 in 1990.

The hypothesis of transmission was tested.




DUE DILIGENCE

This study used three different approaches: parsimony, maximum likelihood and Bayesian to build evolutionary trees from two different gene regions (RT and env)

28 local (Louisiana) HIV+ controls were used together with background sequences from GenBank.

Confidence values (bootstrap or posterior probability) were obtained.




The analysis does not preclude with 100% confidence that no intermediates were involved

It also does not establish directionality (although in this case there are other factors that do)

The fact that victim sequences cluster WITHIN the clade of source sequences makes a compelling case for transmission.



molecular evolution and phylogenetics an …cse/bimm/beng 181 may 27, 2011 sergei l kosakovsky pond...

Documents