cs262 lecture 15, win06, batzoglou rapid global alignments how to align genomic sequences in (more...
Post on 19-Dec-2015
219 views
TRANSCRIPT
CS262 Lecture 15, Win06, Batzoglou
Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
CS262 Lecture 15, Win06, Batzoglou
CS262 Lecture 15, Win06, Batzoglou
Main Idea
Genomic regions of interest contain islands of similarity, such as genes
1. Find local alignments
2. Chain an optimal subset of them
3. Refine/complete the alignment
Systems that use this idea to various degrees:
MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others
CS262 Lecture 15, Win06, Batzoglou
Saving cells in DP
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
CS262 Lecture 15, Win06, Batzoglou
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
CS262 Lecture 15, Win06, Batzoglou
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
CS262 Lecture 15, Win06, Batzoglou
Sparse Dynamic Programming – L.I.S.
Let input be w: w1,…, wn
INITIALIZATION:L: 1-indexed array, L[1] w1
B: 0-indexed array of backpointers; B[0] = 0P: array used for traceback// L[j]: smallest last element wi of j-long LIS seen so far
ALGORITHMfor i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1] w[i]
B[j+1] iP[i] B[j]
}
That’s it!!!• Running time?
CS262 Lecture 15, Win06, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point (i, j), is inserted into w as follows:
• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order
• The 11 example points are inserted in the order given
• a = (y, x), b = (y’, x’) can be chained iff
a is before b in w, and y < y’
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 15, Win06, Batzoglou
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (y, x) < (y’, x’) if y < y’
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 15, Win06, Batzoglou
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:
s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
CS262 Lecture 15, Win06, Batzoglou
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj: smallest (top) to largest (bottom) value
L is implemented as a balanced binary tree
y
h
l
CS262 Lecture 15, Win06, Batzoglou
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score
V(b)V(a)
CS262 Lecture 15, Win06, Batzoglou
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of rectangle i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. k: rectangle in L, with largest lk lib. If V(i) V(k):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li
i
j
k
CS262 Lecture 15, Win06, Batzoglou
Example
x
y
1: 5
3: 3
2: 6
4: 45: 2
2
56
91011
1214
1516
CS262 Lecture 15, Win06, Batzoglou
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
CS262 Lecture 15, Win06, Batzoglou
Examples
Human Genome BrowserABC
CS262 Lecture 15, Win06, Batzoglou
Whole-genome alignment Rat—Mouse—Human
CS262 Lecture 15, Win06, Batzoglou
Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned
CS262 Lecture 15, Win06, Batzoglou
Protein Classification
CS262 Lecture 15, Win06, Batzoglou
PDB GrowthN
ew
PD
B s
tru
ctu
res
CS262 Lecture 15, Win06, Batzoglou
Protein classification
• Number of protein sequences grow exponentially• Number of solved structures grow exponentially• Number of new folds identified very small (and close to constant)• Protein classification can
Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences
Morten Nielsen,CBS, BioCentrum, DTU
SCOP release 1.67, Class # folds # superfamilies # families
All alpha proteins 202 342 550
All beta proteins 141 280 529
Alpha and beta proteins (a/b) 130 213 593
Alpha and beta proteins (a+b) 260 386 650
Multi-domain proteins 40 40 55
Membrane & cell surface 42 82 91
Small proteins 72 104 162
Total 887 1447 2630
CS262 Lecture 15, Win06, Batzoglou
Protein world
Protein fold
Protein structure classification
Protein superfamily
Protein family
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 15, Win06, Batzoglou
Structure Classification Databases
• SCOP Manual classification (A. Murzin) scop.berkeley.edu
• CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath
• FSSP Automatic classification (L. Holm)
www.ebi.ac.uk/dali/fssp/fssp.html
Morten Nielsen,CBS, BioCentrum, DTU
CS262 Lecture 15, Win06, Batzoglou
Protein Classification
• Given a new protein, can we place it in its “correct” position within an existing protein hierarchy?
Methods
• BLAST / PsiBLAST
• Profile HMMs
• Supervised Machine Learning methods
Fold
Family
Superfamily
Proteins
?
new protein
CS262 Lecture 15, Win06, Batzoglou
PSI-BLAST
Given a sequence query x, and database D
1. Find all pairwise alignments of x to sequences in D
2. Collect all matches of x to y with some minimum significance
3. Construct position specific matrix M• Each sequence y is given a weight so that many similar sequences cannot have
much influence on a position (Henikoff & Henikoff 1994)
4. Using the matrix M, search D for more matches
5. Iterate 1–4 until convergence
Profile M
CS262 Lecture 15, Win06, Batzoglou
Profiles & Profile HMMs
• Psi-BLAST builds a profile
• Profile HMMs: more elaborate versions of a profile Intuitively, profile that models gaps
CS262 Lecture 15, Win06, Batzoglou
CS262 Lecture 15, Win06, Batzoglou
Profile HMMs
• Each M state has a position-specific pre-computed substitution table• Each I state has position-specific gap penalties (and in principle can
have its own emission distributions)• Each D state also has position-specific gap penalties
In principle, D-D transitions can also be customized per position
M1 M2 Mm
Protein Family F
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
CS262 Lecture 15, Win06, Batzoglou
Profile HMMs
transition between match states – αM(i)M(i+1)
transitions between match and insert states – αM(i)I(i), αI(i)M(i+1)
transition within insert state – αI(i)I(i)
transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1)
transition within delete state – αD(i)D(i+1)
emission of amino acid b at a state S – εS(b)
M1 M2 Mm
Protein Family F
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
CS262 Lecture 15, Win06, Batzoglou
Profile HMMs
transition probabilities ~ frequency of a transition in alignment emission probabilities ~ frequency of an emission in alignment pseudocounts are usually introduced
M1 M2 Mm
Protein Family F
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
aA
Aklkl
k ll
'
'
e aE a
E akk
ka
( )( )
( )'
' '
CS262 Lecture 15, Win06, Batzoglou
Alignment of a protein to a profile HMM
To align sequence x1…xn to a profile HMM:
We will find the most likely alignment with the Viterbi DP algorithm
• Define Vj
M(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj
VjI(i): score of best alignment of x1…xi to the HMM ending in xi being
emitted from Ij
VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is
the last character emitted before Dj)
• Denote by qa the frequency of amino acid a in a ‘random’ protein
• You can fill-in the details! (or, read Durbin et al. Chapter 5)
CS262 Lecture 15, Win06, Batzoglou
How to build a profile HMM
CS262 Lecture 15, Win06, Batzoglou
Resources on the web
• HMMer – a free profile HMM software http://hmmer.wustl.edu/
• SAM – another free profile HMM software http://www.cse.ucsc.edu/research/compbio/sam.html
• PFAM – database of alignments and HMMs for protein families and domains http://www.sanger.ac.uk/Software/Pfam/
• SCOP – a structural classification of proteins http://scop.berkeley.edu/data/scop.b.html
CS262 Lecture 15, Win06, Batzoglou
Classification with Profile HMMs
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
Fold
Family
Superfamily
?M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
M1 M2 Mm
BEGIN I0 I1 Im-1
D1 D2 Dm
ENDIm
Dm-1
new protein
CS262 Lecture 15, Win06, Batzoglou
Classification with Profile HMMs
• How generative models work
Training examples ( sequences known to be members of family ): positive Tuning parameters with a priori knowledge Model assigns a probability to any given protein sequence Idea: The sequence from the family (hopefully) yield a higher probability
than sequences outside the family
• Log-likelihood ratio as score
P(X | H1) P(H1) P(H1|X) P(X) P(H1|X) L(X) = log -------------------------- = log --------------------- = log --------------
P(X | H0) P(H0) P(H0|X) P(X) P(H0|X)
CS262 Lecture 15, Win06, Batzoglou
Generative Models
CS262 Lecture 15, Win06, Batzoglou
Generative Models
CS262 Lecture 15, Win06, Batzoglou
Generative Models
CS262 Lecture 15, Win06, Batzoglou
Generative Models
CS262 Lecture 15, Win06, Batzoglou
Discriminative Models -- SVM
v
Decision Rule:red: vTx > 0
marginIf x1 … xn training examples,
sign(iixiTx) “decides” where x falls
• Train i to achieve best margin
Large Margin for |v| < 1 Margin of 1 for small |v|
CS262 Lecture 15, Win06, Batzoglou
k-mer based SVMs for protein classification
• For given word size k, and mismatch tolerance l, define
K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches
• Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))
• SVM can be learned by supplying this kernel functionA B A C A R D I
A B R A D A B I
X
Y
K(X, Y) = 4
K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1
CS262 Lecture 15, Win06, Batzoglou
SVMs will find a few support vectors
v
After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X
CS262 Lecture 15, Win06, Batzoglou
Benchmarks