mat 4830 mathematical modeling 4.5 phylogenetic distances i

Post on 05-Jan-2016

213 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MAT 4830Mathematical Modeling

4.5

Phylogenetic Distances I

http://myhome.spu.edu/lauw

Preview

Phylogenetic: of or relating to the evolutionary development of organisms

Estimate the amount of total mutations (observed and hidden mutations).

Example from 4.1

S0 : Ancestral sequenceS1 : Descendant of S0S2 : Descendant of S1

S0 : ATGTCGCCTGATAATGCC

S1 : ATGCCGCTTGACAATGCC

S2 : ATGCCGCGTGATAATGCC

Example from 4.1

S0 : Ancestral sequenceS1 : Descendant of S0S2 : Descendant of S1

S0 : ATGTCGCCTGATAATGCC

S1 : ATGCCGCTTGACAATGCC

S2 : ATGCCGCGTGATAATGCC

Observed mutations: 2

Example from 4.1

S0 : Ancestral sequenceS1 : Descendant of S0S2 : Descendant of S1

S0 : ATGTCGCCTGATAATGCC

S1 : ATGCCGCTTGACAATGCC

S2 : ATGCCGCGTGATAATGCC

Actual mutations: 5

Example from 4.1

S0 : Ancestral sequenceS1 : Descendant of S0S2 : Descendant of S1

S0 : ATGTCGCCTGATAATGCC

S1 : ATGCCGCTTGACAATGCC

S2 : ATGCCGCGTGATAATGCC

Actual mutations: 5, (some are hidden mutations)

Distance of Two Sequences

We want to define the “distance” between two sequences.

It measures the average no. of mutations per site that occurred, including the hidden ones.

S0 : ATGTCGCCTGATAATGCC

S : ATGCCGCGTGATAATGCC

Distance of Two Sequences

Let d(S0,S) be the distance between sequences S0 and S. What properties it “should” have?

1.

2.

3.S0 : ATGTCGCCTGATAATGCC

S : ATGCCGCGTGATAATGCC

Jukes-Cantor Model

Assume α is small. Mutations per time step are “rare”.

0

1 / 3 / 3 / 3

/ 3 1 / 3 / 3 1 1 1 1( )

/ 3 / 3 1 / 3 4 4 4 4

/ 3 / 3 / 3 1

T

M p

Jukes-Cantor Model

q(t)=conditional prob. that the base at time t is the same as the base at time 0

( )q t

1 3 4 1 1 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 3 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3( )

1 1 4 1 1 4 1 3 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 1 4 11 1

4 4 3 4 4 3 4

t t t

t t t

t

t t t

t t

M

1 1 41

4 4 3

1 1 41

4 4 3

1 1 41

4 4 3

1 4 1 3 41 1

4 3 4 4 3

t

t

t

t t

Jukes-Cantor Model

q(t)=fraction of sites with no observed mutations

( )q t

1 3 4 1 1 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 3 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3( )

1 1 4 1 1 4 1 3 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 1 4 11 1

4 4 3 4 4 3 4

t t t

t t t

t

t t t

t t

M

1 1 41

4 4 3

1 1 41

4 4 3

1 1 41

4 4 3

1 4 1 3 41 1

4 3 4 4 3

t

t

t

t t

Jukes-Cantor Model

p(t)=1-q(t)=fractions of sites with observed mutations

( )q t

( ) 1 ( )p t q t

1 3 4 1 1 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 3 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3( )

1 1 4 1 1 4 1 3 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 1 4 11 1

4 4 3 4 4 3 4

t t t

t t t

t

t t t

t t

M

1 1 41

4 4 3

1 1 41

4 4 3

1 1 41

4 4 3

1 4 1 3 41 1

4 3 4 4 3

t

t

t

t t

Jukes-Cantor Model

p(t)=1-q(t)=fractions of sites with observed mutations

( )q t

( ) 1 ( )p t q t

1 3 4 1 1 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 3 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3( )

1 1 4 1 1 4 1 3 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 1 4 11 1

4 4 3 4 4 3 4

t t t

t t t

t

t t t

t t

M

1 1 41

4 4 3

1 1 41

4 4 3

1 1 41

4 4 3

1 4 1 3 41 1

4 3 4 4 3

t

t

t

t t

3 3 4( ) 1

4 4 3

t

p t

Jukes-Cantor Model

p can be estimated from the two sequences

( )q t

( ) 1 ( )p t q t

1 3 4 1 1 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 3 4 1 1 41 1 1

4 4 3 4 4 3 4 4 3( )

1 1 4 1 1 4 1 3 41 1 1

4 4 3 4 4 3 4 4 3

1 1 4 1 1 4 11 1

4 4 3 4 4 3 4

t t t

t t t

t

t t t

t t

M

1 1 41

4 4 3

1 1 41

4 4 3

1 1 41

4 4 3

1 4 1 3 41 1

4 3 4 4 3

t

t

t

t t

3 3 4( ) 1

4 4 3

t

p t

Example from 4.1

S0 : ATGTCGCCTGATAATGCC

S1 : ATGCCGCTTGACAATGCC

S2 : ATGCCGCGTGATAATGCC

Observed mutations: 2

fractions of sites with observed mutations

2 0.11

18p

Jukes-Cantor Distance

Given p (and t), the J-C distance between two sequences S0 and S1 is defined as

0 1

3 4( , ) ln 1

4 3JCd S S p

0

1

: ATGTCGCCTGATAATGCC

: ATGCCGCGTGATAATGCC

S

S

Jukes-Cantor Distance

Given p (and t), the J-C distance between two sequences S0 and S1 is defined as

0 1

3 4( , ) ln 1

4 3JCd S S p

0

1

: ATGTCGCCTGATAATGCC

: ATGCCGCGTGATAATGCC

S

S

Jukes-Cantor Distance

rate of base sub. sub. per site per time step

t no. of time step

t total no. of sub. in t time steps sub. per site

Jukes-Cantor Distance

rate of base sub. sub. per site per time step

t no. of time step

t total no. of sub. in t time steps sub. per site

3 3 41

4 4 3

4 4ln 1 ln 1

3 3 when is small

44ln 1

33

t

p

p pt

Jukes-Cantor Distance

rate of base sub. sub. per site per time step

t no. of time step

t total no. of sub. in t time steps sub. per site

3 3 41

4 4 3

4 4ln 1 ln 1

3 3 when is small

44ln 1

33

t

p

p pt

3 4ln 1

4 3t p

Example from 4.3

Suppose a 40-base ancestral and descendent DNA sequences are

0

1

S : ACTTGTCGGATGATCAGCGGTCCATGCACCTGACAACGGT

S : ACATGTTGCTTGACGACAGGTCCATGCGCCTGAGAACGGC

1 0\

7 0 1 1 1 9 2 0

0 2 7 2

1 0 1 6

S S A G C T

A

G

C

T

Example from 4.3

Suppose a 40-base ancestral and descendent DNA sequences are

0

1

S : ACTTGTCGGATGATCAGCGGTCCATGCACCTGACAACGGT

S : ACATGTTGCTTGACGACAGGTCCATGCGCCTGAGAACGGC

1 0\

7 0 1 1 1 9 2 0

0 2 7 2

1 0 1 6

S S A G C T

A

G

C

T

110.275

403 4 11

ln 1 0.34264 3 40JC

p

d

0 1

3 4( , ) ln 1

4 3JCd S S p

Example from 4.3

0.275 observed sub. per site.

0.3426 sub. estimated per site.

0

1

S : ACTTGTCGGATGATCAGCGGTCCATGCACCTGACAACGGT

S : ACATGTTGCTTGACGACAGGTCCATGCGCCTGAGAACGGC

1 0\

7 0 1 1 1 9 2 0

0 2 7 2

1 0 1 6

S S A G C T

A

G

C

T

110.275

403 4 11

ln 1 0.34264 3 40JC

p

d

Example from 4.3

11 observed sub.

13.7 sub. estimated.

0

1

S : ACTTGTCGGATGATCAGCGGTCCATGCACCTGACAACGGT

S : ACATGTTGCTTGACGACAGGTCCATGCGCCTGAGAACGGC

1 0\

7 0 1 1 1 9 2 0

0 2 7 2

1 0 1 6

S S A G C T

A

G

C

T

110.275

403 4 11

ln 1 0.34264 3 40JC

p

d

Performance of JC distance (Homework Problem 4)

Write a program to simulate of the mutations of a sequence for t time step using the Jukes-Cantor model with parameter α.

Performance of JC distance (Homework Problem 4)

Write a program to simulate of the mutations of a sequence for t time step using the Jukes-Cantor model with parameter α.

Count the number of base substitutions occurred.

Performance of JC distance (Homework Problem 4)

Write a program to simulate of the mutations of a sequence for t time step using the Jukes-Cantor model with parameter α.

Count the number of base substitutions occurred.

Compute the Jukes-Cantor distance of the initial and finial sequence.

Performance of JC distance (Homework Problem 4)

Write a program to simulate of the mutations of a sequence for t time step using the Jukes-Cantor model with parameter α.

Count the number of base substitutions occurred.

Compute the Jukes-Cantor distance of the initial and finial sequence.

Compare the actual number of base substitutions and the estimation from the Jukes-Cantor distance.

Performance of JC distance (Homework Problem 4)

Maple: Strings Handling II

Concatenating two strings

Maple: Strings Handling II

However, no “re-assignment”.

Classwork

Work on HW #1, 2

top related