heuristic methods for sequence alignment1 eitan rubin & shmuel pietrokovski, advanced topic in...

56
1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment BLAST, BLAT and more

Upload: others

Post on 01-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

1

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Heuristic methods for sequence alignment

BLAST, BLAT and more

Page 2: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

2

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Motivation

• Align a sequence to a large collection of sequences

• Similarity implies related history; related history suggests shared structure/function

• Most sequences in the collection are unrelated• Specificity, sensitivity, performance

Page 3: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

3

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

A note on the role of BI

• Most biological knowledge is obtained by hard, slow, expensive experiments.

• Successful BI strategies involve knowledge transfer.– If X is A, and Y≈X, then Y is A.

• Nature reuses winning strategies.• Nature abuses winning strategies.

Page 4: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

4

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Problem definition

• Given a query sequence and a collection of sequences, find which are similar.

• Use statistical estimates!!!– For DBase, a cutoff is enough.– For pair-wise, probability is required.

Page 5: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

5

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

The Naïve solution

• For each target sequence

– Align to query using dynamic programming

– Report alignment if above cutoff

Too slow!!!

Page 6: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

6

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

The Naïve solution (cont)ACGATCATCATGCA

seq1 seq2 seq3

Query

ACGATCATCATGCTATGACTGCATAGCTAGACTCGCATAGCATCGACT

For an EST against DBest: 500x500x6x106=1.5x1012 cells, O(1015) calculationsAt 109 calculations/second, will take 106 seconds = 11 days

For human against mouse:3x109x3x109=1019 cells, O(1028) calculationsAt 109 calculations/second, will take 1019 seconds = 1011 years

Page 7: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

7

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

BLAT

• 3 billion bases of genome • 40 thousand sequenced mRNAs• 3 million sequenced ESTs of ~500 bases

each• 15 million mouse reads of ~500 bases.

Page 8: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

8

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Perfect Matches Serve as Seeds

• Computers can look for exact matches very quickly.

• Finding inexact matches is slow• Inexact matches should contain some

short exact matches.• Inexact matches should contain multiple

even shorter exact matches.

Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip

Page 9: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

9

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

ggagaatagggcatgctctgaggtctgctggaacccatcc1 12 123456 12 12 12345 1 12 12345 1gtggattagggcttgttccgaggttatcgggttcccatac

ttcttgtctcgctccagggcaccgtgcaggaaatcccggg2345 1 123 1 12 1234567 1234567 1234ttctgggctctcgcccgggcacctagcaggaatacccgat

acacctcctcattctcatccagccactggatgacgaaggg123 123456 123 123 12345 12345

ggacccgctcattaccatacagtaaacggatggcgaagac

Distribution of identical matches:length: 1 2 3 4 5 6 7 8 number: 5 5 4 1 5 2 2 0

Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip

Page 10: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

10

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

BLAT overview

1. Use an index to find regions in genome homologous to query.

2. Do a detailed alignment between query and homologous regions.

3. Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole.

Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip

Page 11: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

11

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Indexing• Build an index which contains positions

of each K-mer in genome. K is typically between 9 and 13 for cDNA alignments.

• Step through each K-mer in cDNA chunk and look it up in index.

• Get list of ‘hits’ - positions in cDNA and in genome that match for K bases.

• Cluster hits to find homologous regions.

Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip

Page 12: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

12

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Specificity vs. sensitivity of indexesK The K-mer sizeM Match ratio in homologous regions (98% for

cDNA/genomic of same species, 89% human/mouse proteins)H Length of homologous region (50-200 for human exons)G Database sizeQ The size of the query sequenceA Alphabet size.P The probability that at least one nolp K-mer will matchF The number of unspecific matched K-mers

P1=MK

T=floor(H/K)P=1-(1-P1)T = 1-(1-MK)T

F=(Q-K+1)*(G/K)*(1/A)KKent (2002), Genome Research 12( 4):656-664

Page 13: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

13

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of ScienceSensitivity and Specificity of Single Perfect Nucleotide K-mer Matches as a Search Criterion

H 7 8 9 10 11 12 13 14 81% 0.974 0.915 0.833 0.726 0.607 0.486 0.373 0.314 83% 0.988 0.953 0.897 0.815 0.711 0.595 0.478 0.415 85% 0.996 0.978 0.945 0.888 0.808 0.707 0.594 0.532 87% 0.999 0.992 0.975 0.942 0.888 0.811 0.714 0.659 89% 1.000 0.998 0.991 0.976 0.946 0.897 0.824 0.782 91% 1.000 1.000 0.998 0.993 0.981 0.956 0.912 0.886 93% 1.000 1.000 1.000 0.999 0.995 0.987 0.968 0.957 95% 1.000 1.000 1.000 1.000 0.999 0.998 0.994 0.991 97%

1.000

1.000

1.000

1.000

1.000

1.000

1.000

0.999

K 7 8 9 10 11 12 13 14 F 1.3e+07 2.9e+06 635783 143051 32512 7451 1719 399 (A) Columns are for K sizes of 7-14. Rows represent various percentage identities between the homologous sequences. The table entries show the fraction of homologies detected as calculated from equation 3 assuming a homologous region of 100 bases. The larger the value of K, the fewer homologies are detected. (B) K represents the size of the perfect match. F shows how many perfect matches of this size expected to occur by chance according to equation 4 in a genome of 3 billion bases using a query of 500 bases.

A

B

Kent (2002), Genome Research 12( 4):656-664

Page 14: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

14

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Other Indexing Techniques

• Allowing one mismatch in a seed is also quite effective.

• Pair of 11-mers is seed for BLAT cDNAalignments.

• 7 out of 8 amino acid seed used for translated BLAT alignments.

Page 15: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

15

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Specificity of Indexing Techniques

for Mouse/Human Alignments

The table above assumes that a homologous region is 100 bases long and is 86% identical at the base level and 89% identical at the amino acid level.

Seed P Homology Number FalseType is Found Positives -------------------------------------N 1 8-mer 0.988 2,861,023N 2 7-mers 0.997 596,046N 11 of 12 0.995 275,671P 1 5-mer 0.996 31,125P 2 4-mers 0.995 122P 7 of 8 0.998 374

Page 16: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

16

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Genome: cacaattatcacgaccgc3-mers: cac aat tat cac gac cgcIndex: aat 3 gac 12

cac 0,9 tat 6cgc 15

query: aattctcac3-mers: aat att ttc tct ctc tca cac

0 1 2 3 4 5 6hits: aat 0,3 3

cac 6,0 -6cac 6,9 3

With modifications, from Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip

Page 17: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

17

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

“Clumping” hitsK-mere match diagonalAAT 0,3 3CAC 6,0 -6CAC 6,9 3

Unsorted list of hits

• Calculate diagonal (dbase_position – query_position)• Sort on dbase• Clump hits within W from each other and on

diagonals not more than GAP away

• Sorting: N*log(N) calculations – can take lots of time for very long lists of hits (e.g. genome)

Page 18: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

18

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Clumping

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

W

< gap_limit

Page 19: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

19

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

“Clumping” hitsK-mere match diagonalAAT 0,3 3CAC 6,0 -6CAC 6,9 3

Unsorted list of hits

• Calculate diagonal (dbase_position – query_position)• Drop hits into 64K buckets on DB position • sort on diagonal• Prot-clump if diagonals are within GAP• Sort on dbase• Clump hits within W from each other and on

diagonals not more than GAP away

Page 20: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

20

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Buckets

W

N*log(N)

L

N*log(n)n=N*b/L

bb

toss-over

Page 21: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

21

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

“Clumping” hitsK-mere match diagonalCAC 6,0 -6AAT 0,3 3CAC 6,9 3

sorted list of hits

• Calculate diagonal (dbase_position – query_position)• Drop hits into 64K buckets on DB position

(toss-over marginal hits)• sort on diagonal• Prot-clump if diagonals are within GAP• Sort on dbase• Clump hits within W from each other and on

diagonals not more than GAP away

Page 22: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

22

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Eliminate small clumps

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

W

Not enough hitsin the clump

Page 23: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

23

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Eliminate small clumps

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

Page 24: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

24

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Clumping clumps

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es < 300/100(DNA/prot)

> 300/100(DNA/prot)

Extra 500

Page 25: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

25

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Alignment (cDNA)• Start from scratch with regions defined with K-

mers• Index on smaller K-mers, but extend each K-mer

until it becomes specific• Extend in both direction without mismatches or

gaps and merge overlapping or continues alignments

• Recurse on gaps with smaller K until gap or hits are eliminated

• Additional tricks (Ns, mismatches, introns)

Page 26: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

26

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Alignment (cDNA)

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

Page 27: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

27

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Alignment (cDNA)

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

Page 28: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

28

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Alignment (protein)

• Extend hits into ungapped HSPs with +2/-1 scoring scheme.

• Create a graph of all possible HSP merges• Use dynamic programming to traverse the

graph.

Page 29: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

29

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

HSPs

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

Page 30: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

30

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

HSPs

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

+2 +2-1 +1

+2 +3+2 +5

-1 +4-1 +3

-1 +2-1 +1

Page 31: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

31

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

HSPs

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

+2 +2-1 +1

+2 +3+2 +5

Page 32: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

32

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Detailed Alignments• Smith-Waterman technique based on

dynamic programming.• “Banded” Smith-Waterman: faster but

doesn’t tolerate long inserts• Simple extensions that don’t allow

inserts.

Page 33: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

33

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Stitching Together Alignments

Page 34: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

34

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Repeats Complicate Things

Page 35: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

35

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Dynamic Programming Saves the Day

Page 36: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

36

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Assignment

• You are trying to compare Arabidopsis mRNAs to the genome (high quality genome – less errors). Based on the BLAT article, how will you set the following parameters relative to their (human) defaults?– The length and number of K-mers– W– The drop-off score

Page 37: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

37

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

The BLAST algorithm“The central idea of the BLAST algorithm is that a

statistically significant alignment is likely to contain a high-scoring pair of aligned words”.

Altschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402

Page 38: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

38

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Finding an a “hit”

• Word: k amino acids/nucleotides• Build a table of all words in the

database• Convert query -> words table

– For each word, derive all words with similarity score above T

– Strong statistical background for T selection

• Build a word/sequence tableTo excel

Page 39: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

39

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Choosing a word-score cutoff

For a scoring matrix with Pi and SijS’(bits) = (λS-lnK)/ln2For ungapped alignment, λ and K can be

calculated from Pi and SijE=N/2S’ (N=nm)S’=log2(N/E)For a protein of length 250, compared with 50

million residues, S’ should be 38 bits to ensure E<0.05.

Page 40: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

40

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Building the similar words list

• Original word: FDI

AAA -12AAC -10CDI 14…FDA 16

Page 41: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

41

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Extending words to HSPs

FSFLKDSAGVVDSPKLGAHAEKVFGMVRDSAVQLRATGEVV--LDGKDG| | : |: :||: || :|| | : || | || | FGDLSNPGAVMGNPKVKAHGKKV----------LHSFGEGVKHLDNLKG

•Find additional aligned words with a lower T

•Merge alignments in the same phase within distance A

•Choose significant HSPs

Page 42: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

42

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Altschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402

The 15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlapping hits with score at least 11 are indicated by dots. Of these 37 hits, only the two indicated pairs are on the same diagonal and within distance 40 of one another. Thus the two-hit heuristic with T = 11 triggers two extensions, in place of the 15 extensions invoked by the one-hit heuristic with T = 13.

Page 43: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

43

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Extending words to HSPs

FSFLKDSAGVVDSPKLGAHAEKVFGMVRDSAVQLRATGEVV--LDGKDG| | : |: :||: || :|| | : || | || | FGDLSNPGAVMGNPKVKAHGKKV----------LHSFGEGVKHLDNLKG

•Find additional aligned words with a lower T•Merge alignments in the same phase within distance A•Choose significant HSPs: Using the equation above, set a threshold so that 1:50 of the sequences will be passed to the next stage by chance.

Page 44: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

44

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Combining HSPs with dynamic programming

• banded smith-waterman• Basic principle: significant alignments

cannot occur too far from the diagonal• How far depends on the score

Page 45: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

45

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Altschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402

Page 46: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

46

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of ScienceAltschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402

Page 47: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

47

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

The blast algorithm

• Find High Scoring Pairs (HSPs)– Eliminate most of the sequences– Allow no gaps– Identical->similar words by enumeration

• Find nearby double hits on the same diagonal– Increase sensitivity; allow gaps in the initial HSP

• Combine HSPs with fast dynamic programming– Get semi-optimal alignment

Page 48: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

48

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Summary

• The main heuristics– highly similar words– Lower stringency, require double hit– Extend until score drops– Fast banded smith waterman

• Blast vs. Dynamic Programming

Page 49: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

49

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Fasta: overview

• First fast algorithm for database searching• Lost popularity to blast in the mid-90s

– No web implementation– Slower– Misses complex similarities

• Further drop with the introduction of BLAST2

Page 50: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

50

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Algorithm overview

1. Search for non-gapped alignments based on identical words.

2. Merge alignments allowing gaps.3. Pick best N hits, and realign using SW.

Discard 99.9999%

Page 51: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

51

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Additional reading

• Blast articles– Altschul (2002), “Gapped BLAST and PSI-BLAST: a

new generation of protein database search programs”, Nucleic Acids Research 25(17):3389–3402

– Altschul (1994), “Issues in searching molecular sequence databases”, Nature Genetics 6(2):119-129

• Fasta article– Pearson (2000), “Flexible sequence similarity

searching with the FASTA3 program package”, Methods Mol Biol 132:185-219

Page 52: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

52

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Statistics

• The expected number of HSPs with as score S or larger, when aligning two sufficiently large random sequences of length m an n is given by

E=Kmne-λS

• Normalizing S by S’= (λS-lnK)/ln2, you getE=mn2-S’

• And for a databases with a total length NE≈mN2-S’

Page 53: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

53

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Testing the results

Rank/S' plot

1

10

100

1000

10000

1 10 100 1000

Rank

s'

Page 54: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

54

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Testing the results

Rank/S' plot

1

10

100

1000

10000

1 10 100 1000

Rank

s'

Page 55: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

55

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Testing the results

Rank/S' plot

1

10

100

1000

10000

1 10 100 1000

Rank

s'

mays misc

Page 56: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment

56

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Testing the results

Rank/S' plot

1

10

100

1000

10000

1 10 100 1000

Rank

s'