methods to chain local alignments sparse dynamic programming o(n log n)

27
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 2: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Page 3: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Saving cells in DP

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

Page 4: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 5: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

A similar problem: LCS

15 3 24 16 20 4 24 3 11 18

4

20

24

3

11

15

11

4

18

20

• Given two strings x and y, find the longest common subsequence• Imagine a sparse scenario, where x and y have few matches

Page 6: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sparse Dynamic Programming – L.I.S.

• Longest Increasing Subsequence

• Given a sequence over an ordered alphabet

w = w1, …, wm

• Find the longest increasing subsequence

s = s1, …, sk

s1 < s2 < … < sk

Page 7: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sparse Dynamic Programming – L.I.S.

Let input be w: w1,…, wn

INITIALIZATION:L: 1-indexed array, L[1] w1

B: 0-indexed array of backpointers; B[0] = 0P: array used for traceback// L[j]: smallest last element wi of j-long LIS seen so far

ALGORITHMfor i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1] w[i]

B[j+1] iP[i] B[j]

}

That’s it!!!• Running time?

Page 8: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sparse LCS expressed as LIS

Create a sequence w

• Every matching point (i, j), is inserted into w as follows:

• For each column j, from smallest to largest, insert in w the points (i, j), in decreasing row i order

• The 11 example points are inserted in the order given

• a = (y, x), b = (y’, x’) can be chained iff

a is before b in w, and y < y’

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 9: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sparse LCS expressed as LIS

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

• (y, x) < (y’, x’) if y < y’

Claim: An increasing subsequence of w is a common subsequence of x and y

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 10: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sparse Dynamic Programming for LIS

Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)

(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:

s = 4, 24, 3, 11, 18

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 11: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj: top to bottom

L is implemented as a balanced binary tree

y

h

l

Page 12: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates

• To the right of b, anything chainable to a is chainable to b

• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it

• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j)

V(b)V(a)

Page 13: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

1. When on the leftmost end of i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. k: rectangle in L, with largest lk lib. If V(i) V(k):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li

i

j

k

Page 14: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Example

x

y

1: 5

3: 3

2: 6

4: 45: 2

2

56

91011

1214

1516

Page 15: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Page 16: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Multiple Sequence Multiple Sequence AlignmentsAlignments

Page 17: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

The Global Alignment problem

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

Page 18: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Page 19: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
Page 20: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L

• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can help improve the pairwise alignments

Page 21: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Scoring Function

• Ideally: Find alignment that maximizes probability that sequences evolved

from common ancestor, according to some phylogenetic model

• More on phylogenetic models later

x

yz

w

v

?

Page 22: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Scoring Function

• A comprehensive model would have too many parameters, too inefficient to optimize

• Possible simplifications

Ignore phylogenetic tree

Statistically independent columns:

S(m) = i S(mi)

m: multiple alignment, mi are columns

Page 23: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 24: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sum Of Pairs (cont’d)

• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments

S(m) = k<l s(mk, ml)

s(mk, ml): score of induced alignment (k,l)

Page 25: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

Page 26: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Consensus

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC

CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

• Find optimal consensus string m* to maximize

S(m) = i s(m*, mi)

s(mk, ml): score of pairwise alignment (k,l)

Page 27: Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

A Profile Representation

• Given a multiple alignment M = m1…mn

Replace each column mi with profile entry pi

• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2O .2 .8 .4 .4E .4C .2 .8 .4 .2