real-world protein aligners
DESCRIPTION
Real-world protein aligners. MUSCLE High throughput One of the best in accuracy ProbCons High accuracy Reasonable speed. MUSCLE at a glance. Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/1.jpg)
CS262 Lecture 9, Win07, Batzoglou
Real-world protein aligners
• MUSCLE High throughput One of the best in accuracy
• ProbCons High accuracy Reasonable speed
![Page 2: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/2.jpg)
CS262 Lecture 9, Win07, Batzoglou
MUSCLE at a glance
1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time
2. Build tree TDRAFT based on those distances, with UPGMA
3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT• Only perform alignment steps for the parts of the tree that have changed
4. Measure new Kimura-based distances D(x, y) based on MDRAFT
5. Build tree T based on D
6. Progressive alignment over T, to build M
7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept
![Page 3: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/3.jpg)
CS262 Lecture 9, Win07, Batzoglou
PROBCONS at a glance
1. Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM
2. Re-estimation of posterior matrices M’xy with probabilistic consistency• M’xy(i, j) = 1/N sequence z k Mxz(i, k) Myz (j, k); M’xy = Avgz(MxzMzy)
3. Compute for every pair x, y, the maximum expected accuracy alignment• Axy: alignment that maximizes aligned (i, j) in A M’xy(i, j)• Define E(x, y) = aligned (i, j) in Axy M’xy(i, j)
4. Build tree T with hierarchical clustering using similarity measure E(x, y)
5. Progressive alignment on T to maximize E(.,.)
6. Iterative refinement; for many rounds, do:• Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each
sequence and realign the two resulting profiles
![Page 4: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/4.jpg)
CS262 Lecture 9, Win07, Batzoglou
Some Resources
Genome Resources
Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway
Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2
ABC—Nice Stanford tool for browsing alignmentshttp://encode.stanford.edu/~asimenos/ABC/
Protein Multiple Aligners
http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used
http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable
http://probcons.stanford.edu/ PROBCONS – most accurate
![Page 5: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/5.jpg)
CS262 Lecture 9, Win07, Batzoglou
Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
![Page 6: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/6.jpg)
CS262 Lecture 9, Win07, Batzoglou
![Page 7: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/7.jpg)
CS262 Lecture 9, Win07, Batzoglou
Motivation
• Genomic sequences are very long:
Human genome = 3 x 109 –long Mouse genome = 2.7 x 109 –long
• Aligning genomic regions is useful for revealing common gene structure
It is useful to compare regions > 1,000,000-long
![Page 8: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/8.jpg)
CS262 Lecture 9, Win07, Batzoglou
The UCSC Browser
• http://genome.ucsc.edu/cgi-bin/hgGateway
![Page 9: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/9.jpg)
CS262 Lecture 9, Win07, Batzoglou
Main Idea
Genomic regions of interest contain islands of similarity, such as genes
1. Find local alignments2. Chain an optimal subset of them3. Refine/complete the alignment
Systems that use this idea to various degrees:MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others
![Page 10: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/10.jpg)
CS262 Lecture 9, Win07, Batzoglou
Saving cells in DP
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
![Page 11: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/11.jpg)
CS262 Lecture 9, Win07, Batzoglou
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
![Page 12: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/12.jpg)
CS262 Lecture 9, Win07, Batzoglou
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
![Page 13: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/13.jpg)
CS262 Lecture 9, Win07, Batzoglou
Quadratic Time Solution
• Build Directed Acyclic Graph (DAG): Nodes: local alignments [(xa,xb) (ya,yb)] & score Directed edges: local alignments that can be chained
• edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )• xa < xb < xc < xd
• ya < yb < yc < yd
Each local alignmentis a node vi with alignment score si
![Page 14: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/14.jpg)
CS262 Lecture 9, Win07, Batzoglou
Quadratic Time Solution
Initialization:Find each node va s.t. there is no edge (u, va)Set score of V(a) to be sa
Iteration:For each vi, optimal path ending in vi has total score:
V(i) = maxj s.t. there is edge (vj, vi) ( weight(vj, vi) + V(j) )
Termination:Optimal global chain: j = argmax ( V(j) ); trace chain from vj
Worst case time: quadratic
![Page 15: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/15.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming
Back to the LCS problem:
• Given two sequences x = x1, …, xm
y = y1, …, yn
• Find the longest common subsequence Quadratic solution with DP
• How about when “hits” xi = yj are sparse?
![Page 16: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/16.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
![Page 17: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/17.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
x = x1, …, xm
• Find a subsequence
s = s1, …, sk
s1 < s2 < … < sk
![Page 18: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/18.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.Let input be w: w1,…, wn
INITIALIZATION:L: last LIS elt. array L[0] = -inf
L[1] = w1 L[2…n] = +inf
B: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so far
ALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]
B[j] iP[i] B[j – 1]
}
That’s it!!!• Running time?
![Page 19: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/19.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point (i, j), is inserted into w as follows:
• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order
• The 11 example points are inserted in the order given
• a = (y, x), b = (y’, x’) can be chained iff
a is before b in w, and y < y’
15 3 24 16 20 4 24 3 11 18
6
42 7
1 8
10
95
113
4
20
24
3
11
15
11
4
18
20
x
y
![Page 20: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/20.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (y, x) < (y’, x’) if y < y’
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
42 7
1 8
10
95
113
4
20
24
3
11
15
11
4
18
20
x
y
![Page 21: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/21.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L = [L1] [L2] [L3] [L4] [L5] …
1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:
s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
42 7
1 8
10
95
113
4
20
24
3
11
15
11
4
18
20
x
y
![Page 22: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/22.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj: smallest (North) to largest (South) value L is implemented as a balanced binary tree
y
h
l
![Page 23: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/23.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score
V(b)V(a)
![Page 24: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/24.jpg)
CS262 Lecture 9, Win07, Batzoglou
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in Lii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
i
j
k
Is k ever removed?
![Page 25: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/25.jpg)
CS262 Lecture 9, Win07, Batzoglou
Examplex
y
a: 5
c: 3
b: 6
d: 4e: 2
2
56
91011
12141516
1. When on the leftmost end of rectangle i:a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:a. k: rectangle in L, with largest lk lib. If V(i) > V(k):
i. INSERT (li, V(i), i) in Lii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
a b c d eV
5
L
li
V(i)
i
55a
8
118c
11 12
911b
1512d
13
16133
![Page 26: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/26.jpg)
CS262 Lecture 9, Win07, Batzoglou
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
![Page 27: Real-world protein aligners](https://reader035.vdocument.in/reader035/viewer/2022062521/56816686550346895dda33e6/html5/thumbnails/27.jpg)
CS262 Lecture 9, Win07, Batzoglou
Examples
Human Genome BrowserABC