dynamic programming algorithms and sequence alignment › ... ›...

Dynamic Programming Algorithms and

Sequence Alignment

A T - G T A Tz

-

A T C G - A - CATGTTAT, ATCGTACATGTTAT, ATCGTAC T

T

4 matches 2 insertions 2 deletions

1. Change Problem

2. Manhattan Tourist Problem

3. Longest Paths in Graphs

4. Sequence Alignment

5. Edit Distance

Outline

The Change Problem

• Say we want to provide change totaling 97 cents.

• We could do this in a large number of ways, but the quickest way to do it would be:

• Three quarters = 75 cents

• Two dimes = 20 cents

• Two pennies = 2 cents

• Question 1: How do we know that this is quickest?

• Question 2: Can we generalize to arbitrary denominations?

The Change Problem

• Goal: Convert some amount of money M into given denominations, using the fewest possible number of coins.

• Input: An amount of money M, and an array of d denominations c = (c1, c2, …, cd), in decreasing order of value (c1 > c2 > … > cd).

• Output: A list of d integers i1, i2, …, id such that

c1i1 + c2i2 + … + cdid = M

and i1 + i2 + … + id is minimal.

The Change Problem: Formal Statement

• Given the denominations 1, 3, and 5, what is the minimum number of coins needed to make change for a given value?

1 2 3 4 5 6 7 8 9 10Value

Min # of coins

The Change Problem: Another Example


• Only one coin is needed to make change for the values 1, 3, and 5.

1 2 3 4 5 6 7 8 9 10

1 1 1

Value

Min # of coins




• However, two coins are needed to make change for the values 2, 4, 6, 8, and 10.

1 2 3 4 5 6 7 8 9 10

1 2 1 2 1 2 2 2

Value

Min # of coins




• However, two coins are needed to make change for the values 2, 4, 6, 8, and 10.

• Lastly, three coins are needed to make change for 7 and 9.

1 2 3 4 5 6 7 8 9 10

1 2 1 2 1 2 2 2

Value

Min # of coins 3 3


• This example expresses the following recurrence relation:

The Change Problem: Recurrence

• In general, given the denominations c: c1, c2, …, cd, the recurrence relation is:

The Change Problem: Recurrence

The Change Problem: Pseudocode

77

The RecursiveChange Tree: ExampleM = 77M = 77

c:1,3,7c:1,3,7

74

77

76 70

The RecursiveChange TreeM = 77M = 77

c:1,3,7c:1,3,7

74

77

76 70

75 73 69 73 71 67 69 67 63


c:1,3,7c:1,3,7

74

77

76 70

75 73 69 73 71 67 69 67 63

74 72 68

72 70 66

68 66 62

72 70 66

70 68 64

66 64 60

68 66 62

66 64 60

62 60 56


c:1,3,7c:1,3,7

74

77

76 70

75 73 69 73 71 67 69 67 63

74 72 68

72 70 66

68 66 62

72 70 66

70 68 64

66 64 60

68 66 62

66 64 60

62 60 56

. . . . . .70 70 70 7070


c:1,3,7c:1,3,7

• RecursiveChange recalculates the optimal coin combination for a given amount of money repeatedly.

• M = 77, c = (1,3,7):

• The optimal coin combination for 70 cents is computed 9 times!

RecursiveChange: Inefficiencies

• RecursiveChange recalculates the optimal coin combination for a given amount of money repeatedly.

• M = 77, c = (1,3,7):

• The optimal coin combination for 70 cents is computed 9 times!

• The optimal coin combination for 50 cents is computed billions of times!

RecursiveChange: Inefficiencies

• Save results of each computation for all amounts from 0 to M.– Reference call to find an already computed value

• Running time: M*d, where M is the amount of money and d is the number of denominations.

• Dynamic Programming.

RecursiveChange: Improvement

The Change Problem: Dynamic Programming

0 1 2 3 4 5 6 7 8 90 1 2 1 2 3 2 1 2 3

• For example, let us takec = (1,3,7), M = 9:

DPChange: Example

0 1

0 1 2

0 1 2 3

0 1 2 3 4

0 1 2 3 4 5

0 1 2 3 4 5 6

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 9

0 1

0 1 2

0 1 2 1

0 1 2 1 2

0 1 2 1 2 3

0 1 2 1 2 3 2

0 1 2 1 2 3 2 1

0 1 2 1 2 3 2 1 2

0 1 2 1 2 3 2 1 2 3

• For example, let us takec = (1,3,7), M = 9:

00

DPChange: Example

DPChange builds up from easier problem instances to the desired one, avoiding repetition.DPChange builds up from easier problem instances to the desired one, avoiding repetition.

Manhattan Tourist Problem

Hotel

• Imagine that you are a tourist in Manhattan, whose streets are represented by the grid on the right.

Station


Station*

*

*

*

*

**

* *

*

*

Hotel

*


• You are leaving town, and you want to see as many attractions (represented by *) as possible.


Station*

*

*

*

*

**

* *

*

*

Hotel

*



• Your time is limited: you only have time to travel east and south.


Station*

*

*

*

*

**

* *

*

*

Hotel

*



• Your time is limited: you only have time to travel east and south.

• What is the best path through town?

Additional Example: Manhattan Tourist Problem

• Goal: Find the longest path in a weighted grid.

• Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink.”

• Output: A longest path in G from “source” to “sink.”

Manhattan Tourist Problem (MTP): Formulation

• Our first try at solving the MTP will use a greedy algorithm.

• Main Idea: At each node (intersection), choose the edge (street) departing that node which has the greatest weight.

MTP Greedy Algorithm

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4

MTP Greedy Algorithm: Example

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

0

4


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

0 3

4


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

50 3

4


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

950 3

4


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

950 3

4


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

95

15

0 3

4


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4 19

95

15

0 3

4


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4 19

95

15

0

20

3

4


23

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4 19

95

15

0

20

3

4


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4

MTP DP Algorithm: Example

0

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4

0 3

1


3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4


0 3

1

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4


0 3

1 4

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4


0 3

1 4

5

5

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4


0 3

1 4

5

5

7

9

10

9

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4


0 3

1 4

5

5

7

9

10

9

13

9

17

14

14

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4


0 3

1 4

5

5

7

9

10

9

13

9

17

14

14

15

20

22

20

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4


0 3

1 4

5

5

7

9

10

9

13

9

17

14

14

15

20

22

20

24

22

30

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4

4


0 3

1 4

5

5

7

9

10

9

13

9

17

14

14

15

20

22

20

24

22

30

25

32 34

3

7 3

2

4

4

5

6

4

6

5

8

2

5

0 1 2 3

0

1

2

3

j coordinate

i co

ord

ina

te

source

sink

4

3 2 4 0

1 2 4

1

2

2

4

4


0 3

1 4

5

5

7

9

10

9

13

9

17

14

14

15

20

22

20

24

22

30

25

32 34

MTP: DP Implementation

w: weights of N to S edges

w: weights of W to E edges

w: weights of N to S edges

w: weights of W to E edges

• The score si, j for a point (i,j) is given by the recurrence:

• The running time is n x m for an n by m grid.• (n = # of rows, m = # of columns)

MTP: Running Time with Dynamic Programming

Longest Path in a Graph

• We would like to compute the score for point v in an arbitrary graph.

• Let Predecessors(v) be the set of vertices with edges leading into v. Then the recurrence is given by:

The running time for a graph with E edges is O(E), since each edge is evaluated once.

Recursion for an Arbitrary Graph

• Traversal – order of visiting vertices

• By the time the vertex x is analyzed, the values sy for all its predecessors y should already be computed.

• If the graph has a cycle, we will get stuck in the pattern of going over and over the same cycle.

• Manhattan graph restricts movement in only east or south directions to avoid this problem

Recursion for an Arbitrary Graph: Problem

• Directed Acyclic Graph (DAG): A graph in which each edge is provided an orientation, and which has no cycles.– Edges in a DAG is represented with directed arrows.

http://commons.wikimedia.org/wiki/File:Directed_acyclic_graph.svg

Some Graph Theory Terminology

• Topological Ordering: A labeling of the vertices of a DAG (from 1 to n, say) such that every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label.

• In other words, if vertices are positioned on a line in an increasing order, then all edges go from left to right.

• Theorem: Every DAG has a topological ordering.

• What this means: Every DAG has a source node (1) and a sink node (n).

Some Graph Theory Terminology

Topological Ordering: Example

1 2

3 5

4

6

7

• Goal: Find a longest path between two vertices in a weighted DAG.

• Input: A weighted DAG G with source and sink vertices.

• Output: A longest path in G from source to sink.

• Note: Now we know that we can apply a topological ordering to G, and then use dynamic programming to find the longest path in G.

Longest Path in a DAG: Formulation

Sequence Alignment

Back to Biology: Sequence Alignment

• Original problem: Fit a similarity score on two DNA sequences

• Alignment matrix

ATGTTATATGTTAT

ATCGTACATCGTAC

A T - G T A Tz

-

A T C G - A - C

T

T

4 matches 2 insertions 2 deletions

• Given two sequences, v = v1 v2…vm and w = w1 w2…wn

a common subsequence of v and w is a sequence of positions in

v: 1 < i1 < i2 < … < it < m and a sequence of positions in

w: 1 < j1 < j2 < … < jt < n such that the it -th letter of v is equal to the jt-th letter of w.

• Example: v = ATGCCAT, w = TCGGGCTATC. Then take:

• i1 = 2, i2 = 3, i3 = 6, i4 = 7

• j1 = 1, j2 = 3, j3 = 8, j4 = 9

– This gives us that the common subsequence is TGAT.

Common Subsequence

• Given two sequences v = v1 v2…vm and w = w1 w2…wn

the Longest Common Subsequence (LCS) of v and w is a sequence of positions in v: 1 < i1 < i2 < … < iT < m and a sequence of positions in w: 1 < j1 < j2 < … < jT < n such that the it -th letter of v is equal to jt-th letter of w and T is maximal.

• Example: v = ATGCCAT, w = TCGGGCTATC.

• TGCAT is a longer subsequence compared to TGAT

• Find the LCS of two sequences.

Longest Common Subsequence

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j• Assign one sequence to the rows, and one to the columns.

Edit Graph for LCS Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

• Assign one sequence to the rows, and one to the columns.

• Every diagonal edge represents a match of elements.


T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8




T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8




+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8



• In a path from source to sink, the diagonal edges represent a common subsequence. Common Subsequence: TGAT


T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j• LCS Problem: Find a path with the maximum number of diagonal edges.

Common Subsequence: TGAT


• Let vi = prefix of v of length i: v1 … vi

• and wj = prefix of w of length j: w1 … wj

• The length of LCS(vi,wj) is computed by:

Computing the LCS: Dynamic Programming

Edit Distance

• The Hamming Distance dH(v, w) between two DNA sequences v and w of the same length is equal to the number of places in which the two sequences differ.

• Example: Given as follows, dH(v, w) = 8:

• These sequences are very similar!

• Hamming Distance is therefore not an ideal similarity score, because it ignores insertions and deletions.

Hamming Distance

v: ATATATATw: TATATATA

Minimum number of elementary operations (insertions, deletions, and substitutions) needed to transform one string into the other

d(v,w) = MIN number of elementary operations

to transform v w

Edit Distance

• Shift w one nucleotide to the right, and see that w is obtained from v by one insertion and one deletion:

• Hence the edit distance, d(v, w) = 2.

• Note: In order to provide this distance, we had to “fiddle” with the sequences. Hamming distance was easier to find.

Edit Distance: Example 1

v: ATATATAT-w: -TATATATA

• Transform TGCATAT ATCCGAT.


• We can transform TGCATAT ATCCGAT in 5 steps:

TGCATAT



TGCATAT (delete last T)



TGCATAT (delete last T)TGCATA (delete last A)



TGCATAT (delete last T)TGCATA (delete last A)ATGCAT (insert A at front)



TGCATAT (delete last T)TGCATA (delete last A)ATGCAT (insert A at front)ATCCAT (substitute C for G)



TGCATAT (delete last T)TGCATA (delete last A)ATGCAT (insert A at front)ATCCAT (substitute C for G)ATCCGAT (insert G before last A)



TGCATAT (delete last T)TGCATA (delete last A)ATGCAT (insert A at front)ATCCAT (substitute C for G)ATCCGAT (insert G before last A)

• Note: This only allows us to conclude that the edit distance is at most 5.


• Theorem: Given two sequences v and w of length m and n, the edit distance d(v,w) is given by d(v,w) = m + n – s(v,w), where s(v,w) is the length of the longest common subsequence of v and w.

Solving the LCS problem for v and w is equivalent to finding the edit distance between them!

Key Result

Return to the Edit Graph

• Every alignment corresponds to a path from source to sink.

• Horizontal and vertical edges correspond to indels (deletions and insertions).


• Every alignment corresponds to a path from source to sink.

• Horizontal and vertical edges correspond to indels (deletions and insertions).

• Diagonal edges correspond to matches and mismatches.


• Find LCS in ATCGTAC, ATGTTAT.

Alignment as a Path in the Edit Graph: Example

ε A T C G T A C

ε

A

T

G

T

T

A

T

• ATCGTAC, ATGTTAT

• Match: +1

• Mismatches and indels: 0


ε A T C G T A C

ε

A

T

G

T

T

A

T


• Match: +1



ε A T C G T A C

ε

A

T

G

T

T

A

T

0


• Match: +1


• Score (0,1) =


ε A T C G T A C

ε

A

T

G

T

T

A

T

0


• Match: +1


• Score (0,1) = Score (indel)


ε A T C G T A C

ε

A

T

G

T

T

A

T

0

-

A


• Match: +1


• Score (0,1) = 0


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0


• Match: +1


• Score (0,j) = ?


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0


• Match: +1


• Score (0,j) = 0


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0

• Score (1,1) = ?


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0

• Score (1,1) = ?

• Three possibilities


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

-

A -

A

A

A

0 0 1


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0

• Score (1,1) =


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

-A -

AAA

0 0 1


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0

• Score (1,1) = ?


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0

• Score (1,i) = ?


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0

• Score (1,i) = ?


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

-

A -

T

A

T

0 0 0


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0

• Score (1,i) = ?


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0

• Score (1,i) = ?


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1


• Match: +1


• Score (0,j) = 0

• Score (i,0) = 0


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1


• Match: +1



ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2


• Match: +1



ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5


• Match: +1


• Optimal Alignment


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-


• Match: +1




ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T


• Match: +1




ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T

A

A


• Match: +1




ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T

A

A

T

T


• Match: +1




ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T

A

A

T

T

-

T


• Match: +1




ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T

A

A

T

T

-

T

G

G


• Match: +1




ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T

A

A

T

T

-

T

G

G

C

-


• Match: +1




ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T

A

A

T

T

-

T

G

G

C

-

T

T


• Match: +1




ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T

A

A

T

T

-

T

G

G

C

-

T

T

A

A


• Match: +1


• Optimal Alignment, LCS


ε A T C G T A C

ε

A

T

G

T

T

A

T

0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1 1 1 1 1 1 1

2 2 2 2 2 2

2

2

2

2

2

2 3 3 3 3

2

2

2

2

3

3

3

3

4

4

4

4

4

4

5

5

4

4

5

5

C

-

-

T

A

A

T

T

-

T

G

G

C

-

T

T

A

A

Dynamic Alignment: Pseudocode

Printing LCS: Backtracking

O(nm) to fill in the n x m dynamic programming matrix: the pseudocode consists of a nested “for” loop inside of another “for” loop.

LCS: Runtime

Global Alignment

Simplest scoring schema: For some positive numbers μ and σ:– Match Premium: +1– Mismatch Penalty: –μ– Indel Penalty: –σ

Alignment score =

Choice of µ and σ depends on how we wish to penalize mismatches and indels.

From LCS to Alignment: Change the Scoring

The Global Alignment Problem

Input : Strings v and w and a scoring schema

Output : An alignment with maximum score

Use DP to solve the Global Alignment Problem:

: mismatch penaltyσ : indel penalty

• Align ATCGTAC and ATGTTAT. : 1, σ : 0.5

Needleman and Wunsch Algorithm

A C T C G

0 -1 -2 -3 -4 -5

A -1

C -2

A -3

G -4

T -5

A -6

G -7

Gap Penalty = -1Match Score = +1Mismatch Score = 0

ACTCG vs. ACAGTAG


A C T C G

0 -1 -2 -3 -4 -5

A -1 1

C -2

A -3

G -4

T -5

A -6

G -7


ACTCG vs. ACAGTAG


A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0

C -2

A -3

G -4

T -5

A -6

G -7


ACTCG vs. ACAGTAG

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2

A -3

G -4

T -5

A -6

G -7


ACTCG vs. ACAGTAG


A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3

G -4

T -5

A -6

G -7


ACTCG vs. ACAGTAG


A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 0 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2


ACTCG vs. ACAGTAG


A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 0 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2


A C A G T A GA C – – T C G

ACTCG vs. ACAGTAG


Scoring Matrices

Scoring Matrices: Example

A G T C —

A 1 -0.8 -0.2 -2.3 -0.6

G -0.8 1 -1.1 -0.7 -1.5

T -0.2 -1.1 1 -0.5 -0.9

C -2.3 -0.7 -0.5 1 -1

— -0.6 -1.5 -0.9 -1 n/a

Scoring Matrices: Example

A G T C —

A 1 -0.8 -0.2 -2.3 -0.6

G -0.8 1 -1.1 -0.7 -1.5

T -0.2 -1.1 1 -0.5 -0.9

C -2.3 -0.7 -0.5 1 -1

— -0.6 -1.5 -0.9 -1 n/a

A-GTC-A

CGTTGGScore: –0.6 – 1 + 1 + 1 – 0.5 – 1.5 – 0.8 = –2.4

• Align AGTCA and CGTTGG with the scoring matrix:

Sample Alignment:

How Do We Make a Scoring Matrix?

Scoring matrices are created based on biological evidence.

Alignments can be thought of as two sequences that differ due to mutations.

Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.

Amino Acid Scoring Matrix

A R N K

A 5 -2 -1 -1

R -2 7 -1 3

N -1 -1 7 0

K -1 3 0 6

R and K have a positive mismatch score.Both positively charged amino acids this mismatch will not greatly change the function of the protein.Positive mismatch scores for amino acid changes that tend to preserve the physicochemical properties of the original residue (identical polarity, similar behaviour)

Scoring Matrices: Amino Acid vs. DNA

Two commonly used amino acid substitution matrices:1. PAM2. BLOSUM

DNA substitution matrices:• DNA is less conserved than protein sequences• It is therefore less effective to compare sequences at

the nucleotide level

PAM

PAM: Stands for Point Accepted Mutation

1 PAM = PAM1 = 1% average change of all amino acid positions.

• Note: This doesn’t mean that after 100 PAMs of evolution, every residue will have changed:• Some residues may have mutated several times.• Some residues may have returned to their original

state.• Some residues may not changed at all.

PAMX

PAMx = PAM1x (x iterations of PAM1)

– Example: PAM250 = PAM1250

PAM250 is a widely used scoring matrix:

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ...Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ...Arg R 3 17 4 3 2 5 3 2 6 3 2 9Asn N 4 4 6 7 2 5 6 4 6 3 2 5Asp D 5 4 8 11 1 7 10 5 6 3 2 5Cys C 2 1 1 1 52 1 1 2 2 2 1 1Gln Q 3 5 5 6 1 10 7 3 7 2 3 5...Trp W 0 2 0 0 0 0 0 0 1 0 1 0Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1Val V 7 4 4 4 4 4 4 4 5 4 15 10

BLOSUM

BLOSUM: Stands for Blocks Substitution Matrix

Scores are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins.

• BLOSUM62 was createdusing sequences sharingno more than 62%identity.

C S T P … F Y W

C 9 -1 -1 3 … -2 -2 -2

S -1 4 1 -1 … -2 -2 -3

T -1 1 4 1 … -2 -2 -3

P 3 -1 1 7 … -4 -3 -4

… … … … … … … … …

F -2 -2 -2 -4 … 6 3 1

Y -2 -2 -2 -3 … 3 7 2

W -2 -3 -3 -4 … 1 2 11

http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm

Local Alignment

Local vs. Global Alignment: Example

• Global Alignment:

• Local Alignment—better alignment to find conserved segment:

--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C

tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||

aattgccgccgtcgttttcagCAGTTATGTCAGatc

Local Alignment: Why?

Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions.

Local Alignment: Why?

Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions.

Example: Homeobox genes (regulate embryonic development) have a short homeodomains that are highly conserved among species.

• Aligning entire sequence (Global alignment) may miss homeodomains.

• Search for an alignment which has a positive score locally• (Alignment on substrings of the given sequences that has

a positive score)

Local Alignment: Illustration

Global alignment

Compute a “mini” Global Alignment to get Local Alignment

The Local Alignment Problem

Goal: Find the best local alignment between two strings.

Input : Strings v and w as well as a scoring matrix δ

Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignments of all possible substrings of v and w.

Local Alignment: How to Solve?

Global Alignment Problem finds the longest path between vertices (0,0) and (n,m) in the edit graph.

Local Alignment Problem finds the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

Local Alignment: How to Solve?

Global Alignment Problem finds the longest path between vertices (0,0) and (n,m) in the edit graph.

Local Alignment Problem finds the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.

In the edit graph with negatively-scored edges, Local Alignment may score higher than Global Alignment.

Global alignment

Local alignment

The Problem with This Setup

• In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source.



• For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time.



• This gives an overall runtime of O(n4), which is a bit too slow…can we do better?


Local Alignment Solution: Free Rides

• Add “free” edges to the edit graph.

• The dashed edges represent the“free rides” from (0, 0) to everyother node.

• Each “free ride” is assignedan edge weight of 0.

• If we start at (0, 0) instead of(i, j) and maximize the longestpath to (i’, j’), we will obtainthe local alignment.

Smith-Waterman Local Alignment Algorithm

• The largest value of si,j over the whole edit graph is the score of the best local alignment.

• The recurrence:

• O(n2)

Smith and Waterman Algorithm

A A C C T A T A G C T

0 0 0 0 0 0 0 0 0 0 0 0

G 0 0 0 0 0 0 0 0 0 1 0 0

C 0 0 0 1 1 0 0 0 0 0 2 1

G 0 0 0 2 0 0 0 0 0 1 0 1

A 0 1 1 1 0 0 1 0 1 0 0 0

T 0 0 0 0 0 1 0 2 1 0 0 1

A 0 0 1 3 0 0 2 0 3 2 1 0

T 0 0 0 3 0 0 1 3 2 2 1 2

A 0 0 0 3 0 0 2 2 4 3 2 1

AACCTATAGCT, GCGATATA


dynamic programming algorithms and sequence alignment › ... ›...

Documents