cs262 lecture 9, win07, batzoglou rapid global alignments how to align genomic sequences in (more or...

50
262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Page 2: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Page 3: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Saving cells in DP

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

Page 4: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 5: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 6: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse Dynamic Programming

15 3 24 16 20 4 24 3 11 18

4

20

24

3

11

15

11

4

18

20

• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Page 7: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse Dynamic Programming – L.I.S.

• Longest Increasing Subsequence

• Given a sequence over an ordered alphabet

x = x1, …, xm

• Find a subsequence

s = s1, …, sk

s1 < s2 < … < sk

Page 8: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse Dynamic Programming – L.I.S.

Let input be w: w1,…, wn

INITIALIZATION:L: last LIS elt. array L[0] = -inf

L[1] = w1 L[2…n] = +inf

B: array holding LIS elts; B[0] = 0P: array of backpointers// L[j]: smallest jth element wi of j-long LIS seen so far

ALGORITHMfor i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j] w[i]

B[j] iP[i] B[j – 1]

}

That’s it!!!• Running time?

Page 9: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse LCS expressed as LIS

Create a sequence w

• Every matching point (i, j), is inserted into w as follows:

• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order

• The 11 example points are inserted in the order given

• a = (y, x), b = (y’, x’) can be chained iff

a is before b in w, and y < y’

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 10: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse LCS expressed as LIS

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

• (y, x) < (y’, x’) if y < y’

Claim: An increasing subsequence of w is a common subsequence of x and y

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 11: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse Dynamic Programming for LIS

Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)

(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L = [L1] [L2] [L3] [L4] [L5] …

1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10)

Longest common subsequence:s = 4, 24, 3, 11, 18

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 12: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj: smallest (North) to largest (South) value

L is implemented as a balanced binary tree

y

h

l

Page 13: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates

• To the right of b, anything chainable to a is chainable to b

• Therefore, if V(b) > V(a), rectangle a is “useless” for subsequent chaining

• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score

V(b)V(a)

Page 14: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

1. When on the leftmost end of rectangle i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. k: rectangle in L, with largest lk lib. If V(i) > V(k):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li

i

j

k

Is k ever removed?

Page 15: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Example

x

y

a: 5

c: 3

b: 6

d: 4e: 2

2

56

9101112141516

1. When on the leftmost end of rectangle i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. k: rectangle in L, with largest lk lib. If V(i) > V(k):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li

a b c d eV

5

L

li

V(i)

i

5

5

a

8

11

8

c

11 12

9

11

b

15

12

d

13

16

13

3

Page 16: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Page 17: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Examples

Human Genome BrowserABC

Page 18: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Gene Recognition

Page 19: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = protein-codingintron = non-coding

Codon:A triplet of nucleotides that is converted to one amino acid

Page 20: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Where are the genes?Where are the genes?Where are the genes?Where are the genes?

Page 21: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Page 22: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Needles in a Haystack

Page 23: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

• Classes of Gene predictors Ab initio

• Only look at the genomic DNA of target genome De novo

• Target genome + aligned informant genome(s)

EST/cDNA-based & combined approaches• Use aligned ESTs or cDNAs + any other kind of evidence

Gene Finding

EXON EXON EXON EXON EXON

Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg

Page 24: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Signals for Gene Finding

1. Regular gene structure

2. Exon/intron lengths

3. Codon composition

4. Motifs at the boundaries of exons, introns, etc.Start codon, stop codon, splice sites

5. Patterns of conservation

6. Sequenced mRNAs

7. (PCR for verification)

Page 25: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Next Exon:Frame 0

Next Exon:Frame 1

Page 26: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Exon and Intron Lengths

Page 27: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Nucleotide Composition

• Base composition in exons is characteristic due to the genetic code

Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG

Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG

Page 28: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

atgatg

tgatga

ggtgagggtgag

ggtgagggtgag

ggtgagggtgag

caggtgcaggtg

cagatgcagatg

cagttgcagttg

caggcccaggccggtgagggtgag

Page 29: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Splice Sites

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Page 30: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

HMMs for Gene Recognition

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

exon exon exonintronintronintergene intergene

Intergene State

Intergene State

First Exon State

First Exon State

IntronStateIntronState

Page 31: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

HMMs for Gene Recognition

exon exon exonintronintronintergene intergene

Intergene State

Intergene State

First Exon State

First Exon State

IntronStateIntronState

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

Page 32: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Duration HMMs for Gene Recognition

TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C

Exon1 Exon2 Exon3

Duration d

iPINTRON(xi | xi-1…xi-w)

PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)

j+2

P5’SS(xi-3…xi+4)

PSTOP(xi-4…xi+3)

Page 33: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Genscan

• Burge, 1997

• First competitive HMM-based gene finder, huge accuracy jump

• Only gene finder at the time, to predict partial genes and genes in both strands

Features– Duration HMM– Four different parameter sets

• Very low, low, med, high GC-content

Page 34: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Using Comparative Information

Page 35: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Using Comparative Information

• Hox cluster is an example where everything is conserved

Page 36: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Patterns of Conservation

30% 1.3%

0.14%

58%14%

10.2%

Genes Intergenic

Mutations Gaps Frameshifts

Separation

2-fold10-fold75-fold

Page 37: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Comparison-based Gene Finders

• Rosetta, 2000• CEM, 2000

– First methods to apply comparative genomics (human-mouse) to improve gene prediction

• Twinscan, 2001– First HMM for comparative gene prediction in two genomes

• SLAM, 2002– Generalized pair-HMM for simultaneous alignment and gene

prediction in two genomes

• NSCAN, 2006– Best method to-date based on a phylo-HMM for multiple genome

gene prediction

Page 38: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Twinscan

1. Align the two sequences (eg. from human and mouse)

2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )

New “alphabet”: 4 x 3 = 12 letters = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }

Emission distributions ek(b) estimated from real genes from human/mouse

eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns

ExampleHuman: ACGGCGACGUGCACGUMouse: ACUGUGACGUGCACUUAlignment: ||:|:|||||||||:|

Page 39: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

SLAM – Generalized Pair HMM

d

e

Exon GPHMM1.Choose exon lengths (d,e).2.Generate alignment of length d+e.

Page 40: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

NSCAN—Multiple Species Gene Prediction

• GENSCAN

• TWINSCAN

• N-SCAN

Target GGTGAGGTGACCAAGAACGTGTTGACAGTATarget GGTGAGGTGACCAAGAACGTGTTGACAGTA

Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence

Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence

Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA

Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA

...

),...,,...,|( 1 oiioiii TTP III

),...,|( 1 oiii TTTP

),...,,,...,|,( 11 oiioiiii TTTP III

Target sequence:

Informant sequences (vector):

Joint prediction (use phylo-HMM):

Page 41: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

NSCAN—Multiple Species Gene Prediction

XX

C YY

ZZ H

M R

)|()|()|(

)|()|()|()(

),,,,,,(

1

ZRPZMPYZP

YHPXYPXCPAP

ZYXRMCHP

XX

C

YY

ZZ

H

M R

)|()|()|(

)|()|()|()(

),,,,,,(

ZRPZMPXCP

YZPYXPHYPHP

ZYXRMCHP

Page 42: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Performance Comparison

GENSCANGeneralized HMMModels human sequence

TWINSCANGeneralized HMMModels human/mouse alignments

N-SCANPhylo-HMMModels multiple sequence evolution

GENSCANGeneralized HMMModels human sequence

TWINSCANGeneralized HMMModels human/mouse alignments

N-SCANPhylo-HMMModels multiple sequence evolution

NSCAN human/mouse

>Human/multiple

informants

Page 43: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

• 2-level architecture• No Phylo-HMM that models alignments

CONTRAST

Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tg

SVMSVM SVMSVM

X

Y

a b a b

Page 44: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

CONTRAST

Page 45: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

• log P(y | x) ~ wTF(x, y)

• F(x, y) = i f(yi-1, yi, i, x)

• f(yi-1, yi, i, x):

1{yi-1 = INTRON, yi = EXON_FRAME_1}

1{yi-1 = EXON_FRAME_1, xhuman,i-2,…, xhuman,i+3 = ACCGGT)

1{yi-1 = EXON_FRAME_1, xhuman,i-1,…, xdog,i+1 = ACC, AGC)

(1-c)1{a<SVM_DONOR(i)<b} (optional) 1{EXON_FRAME_1, EST_EVIDENCE}

CONTRAST - Features

Page 46: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

• Accuracy increases as we add informants

• Diminishing returns after ~5 informants

CONTRAST – SVM accuracies

SN SP

Page 47: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

CONTRAST - Decoding

Viterbi Decoding:

maximize P(y | x)

Maximum Expected Boundary Accuracy Decoding:

maximize i,B 1{yi-1, yi is exon boundary B} Accuracy(yi-1, yi, B | x)

Accuracy(yi-1, yi, B | x) = P(yi-1, yi is B | x) – (1 – P(yi-1, yi is B | x))

Page 48: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

CONTRAST - Training

Maximum Conditional Likelihood Training:

maximize L(w) = Pw(y | x)

Maximum Expected Boundary Accuracy Training:

ExpectedBoundaryAccuracy(w) = i Accuracyi

Accuracyi = B 1{(yi-1, yi is exon boundary B} Pw(yi-1, yi is B | x) -

B’ ≠ B P(yi-1, yi is exon boundary B’ | x)

Page 49: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Performance Comparison

De NovoDe Novo

EST-assistedEST-assisted

HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken

HumanMacaqueMouseRatRabbitDogCowArmadilloElephantTenrecOpossumChicken

Page 50: CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 9, Win07, Batzoglou

Performance Comparison