1 additive distances between dna sequences mpi, june 2012
TRANSCRIPT
1
Additive Distances
Between DNA Sequences
MPI, June 2012
Additive Evolutionary distance :The number of substitutions which occurred
during the sequence evolution
AC
CC
C G T A1 2 3
1
site 1
site 2
substitutions
Some substitutions are hidden, due to overwriting.Therefore, the exact number of subst. is usually larger than the number of observed changes.
site 30
3
Edge weight = Expected number of substit’s per site
A A C A … G T C T T C G A G G C C Cu
v A G C A … G C C T A T G C G A C C T
MPI, June 2012
0 1 0 0 … 0 2 0 0 1 1 0 1 2 1 0 0 1
0.321 Number of substitutions per site
4
When the exact number of substitutions between any two
sequences is known, NJ (and any other algorithm which
reconstructs trees from the exact distances) returns the
correct evolutionary tree.
Interleaf distances: sum of edge weights
vu0.5
0.42
0.3
d(u,v) = 1.12
5
Estimating # of substitutionsfrom observed substitutions
requires
Substitution Model
JC [Jukes Cantor 1969] Kimura 2 Parameter (K2P) [Kimura 1980]
HKY [Hasegawa, Kishino and Yano 1985]
TN [Tamura and Nei 1993]
GTR: Generalised time-reversible [Tavaré 1986]
…and more…
6
Distance estimation in the
Jukes Cantor
model
7
Jukes Cantor model:All substitutions are equally like
JC generic rate matrix t is the expected # of substitutions per site
u
v
tuv
TCGA
t/3t/3t/3 -tA
t/3t/3 -tt/3G
t/3 -tt/3t/3C
-tt/3t/3t/3T
Ruv =
8
Substitution Matrix P
431
4( ) 1 tep t
expected number of substitutions per sitet
(Theory of Markov Processes)
TCGA
t/3t/3t/3 -tA
t/3t/3 -tt/3G
t/3 -tt/3t/3C
-tt/3t/3t/3T
R =
TCGA
p (t )p (t )p (t )1-3p (t )A
p (t )p (t )1-3p (t )p (t )G
p (t )1-3p (t )p (t )p (t )C
1-3p (t )p (t )p (t )p (t )T
substitution probability( )p t
Rate Matrix R
P =
9
JC distance estimation:First estimate the substitution matrix
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
an Estimation of Puv
From observed substit’s
uvP TCGA
A
G
C
T
ˆ ( )p t
ˆ1 3 ( )p t
ˆ1 3 ( )p t
ˆ1 3 ( )p t
ˆ1 3 ( )p t
ˆ ( )p tˆ ( )p t
ˆ ( )p t ˆ ( )p t
ˆ ( )p t
ˆ ( )p t
ˆ ( )p t
ˆ ( )p t
ˆ ( )p t
ˆ ( )p tˆ ( )p t
1 number of observed substit'sˆ ( )
3 total number of sitesp t
10
Estimate t from estimation of p(t)by “reverse engineering”
34
ˆ ˆln(1 4 ( ))t p t
Solve the formula for p(t)
uvP
ˆ uvR
TCGA
A
G
C
T
ˆ ( )p t
ˆ1 3 ( )p t
ˆ1 3 ( )p t
ˆ1 3 ( )p t
ˆ1 3 ( )p t
ˆ ( )p tˆ ( )p t
ˆ ( )p t ˆ ( )p t
ˆ ( )p t
ˆ ( )p t
ˆ ( )p t
ˆ ( )p t
ˆ ( )p t
ˆ ( )p tˆ ( )p t
TCGA
A
G
C
T
t
t
t
t
ˆ 3t ˆ 3tˆ 3t
ˆ 3tˆ 3t
ˆ 3t
ˆ 3t
ˆ 3t
ˆ 3t
ˆ 3t
ˆ 3t ˆ 3t
11
Checking the effect
of estimation-errorsin Reconstructing
Quartets
12
Quartets Reconstruction = Finding the correct split
A C
B D
A B
C D
A C
D B
Quartets are trees with four leaves. They have threepossible (fully resolved) topologies, called splits:
Distance methods resolves splits by the
4 point method
13
The 4 points method
A C
B D
The 4-point condition:
wsep
The 4-point condition for estimated distances:
2 2 2 2 2 2( , ) ( , ) min ( , ) ( , ) , ( , ) ( , )K P K P K P K P K P K Pd d d d d d A B C D A C B D A D B C
2 2 2 2 2 2( , ) ( , ) ( , ) ( , ) ( , ) ( , )2K P K P K P K P K Pse K Ppd d dwd d d A B C D A C B D A D B C
14
Evaluate the accuracy ofreconstructing quartets
using evolutionary distances
root
D
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
t
10t
CA
B
10t 10t10t
t-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
t is “evolutionary time”
The diameter of the quartet is 22t
15
Phase A: simulate evolution
DC
AB
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
16
Phase B: reconstruct the split by the 4p condition
DCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
BCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
ACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
÷÷÷÷÷÷÷÷
øçççççççç
è
ˆˆ ( , ) ( , )i jD i j d s sApply the 4p condition.
Is the recontruction correct?
compute distances between sequences,
Repeat this process 10,000 times,
count number of failures
17
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
This test was applied on the model quartet with various diameters
For each diameter, mark the fraction (percentage) of the
simulations in which the reconstruction failed (next slide)
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
C
AB
10t 10t 10t
t
root
D
t
10t
C
AB
10t 10t 10t
t … …
18
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
quartet diameter (total rate between furthest leaves)
Fra
ctio
n of
failu
res
out o
f 100
00 e
xper
imen
tsperformance of K2P standard distance method in resolving quartets, R=10
Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
Templatequartet
19
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter (=mutations rate between furthest leaves)
Fract
ion
of fa
ilure
s out
of 10
000 si
mul
atio
nsperformance of K2P standard distance method in resolving quartets,
For quartet ratio 0.1, R=10
Performance for larger diameters
“site saturation”
20
Repeat this experiment
on the
Hasegawa tree
• Assume the JC model. • Reconstruct by the NJ algorithm (use
any variants of NJ available in MATLAB)
Hasegawa Tree
21