heuristic methods for sequence alignment1 eitan rubin & shmuel pietrokovski, advanced topic in...

1

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science

Heuristic methods for sequence alignment

BLAST, BLAT and more

2


Motivation

• Align a sequence to a large collection of sequences

• Similarity implies related history; related history suggests shared structure/function

• Most sequences in the collection are unrelated• Specificity, sensitivity, performance

3


A note on the role of BI

• Most biological knowledge is obtained by hard, slow, expensive experiments.

• Successful BI strategies involve knowledge transfer.– If X is A, and Y≈X, then Y is A.

• Nature reuses winning strategies.• Nature abuses winning strategies.

4


Problem definition

• Given a query sequence and a collection of sequences, find which are similar.

• Use statistical estimates!!!– For DBase, a cutoff is enough.– For pair-wise, probability is required.

5


The Naïve solution

• For each target sequence

– Align to query using dynamic programming

– Report alignment if above cutoff

Too slow!!!

6


The Naïve solution (cont)ACGATCATCATGCA

seq1 seq2 seq3

Query

ACGATCATCATGCTATGACTGCATAGCTAGACTCGCATAGCATCGACT

For an EST against DBest: 500x500x6x106=1.5x1012 cells, O(1015) calculationsAt 109 calculations/second, will take 106 seconds = 11 days

For human against mouse:3x109x3x109=1019 cells, O(1028) calculationsAt 109 calculations/second, will take 1019 seconds = 1011 years

7


BLAT

• 3 billion bases of genome • 40 thousand sequenced mRNAs• 3 million sequenced ESTs of ~500 bases

each• 15 million mouse reads of ~500 bases.

8


Perfect Matches Serve as Seeds

• Computers can look for exact matches very quickly.

• Finding inexact matches is slow• Inexact matches should contain some

short exact matches.• Inexact matches should contain multiple

even shorter exact matches.

Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip

9


ggagaatagggcatgctctgaggtctgctggaacccatcc1 12 123456 12 12 12345 1 12 12345 1gtggattagggcttgttccgaggttatcgggttcccatac

ttcttgtctcgctccagggcaccgtgcaggaaatcccggg2345 1 123 1 12 1234567 1234567 1234ttctgggctctcgcccgggcacctagcaggaatacccgat

acacctcctcattctcatccagccactggatgacgaaggg123 123456 123 123 12345 12345

ggacccgctcattaccatacagtaaacggatggcgaagac

Distribution of identical matches:length: 1 2 3 4 5 6 7 8 number: 5 5 4 1 5 2 2 0


10


BLAT overview

1. Use an index to find regions in genome homologous to query.

2. Do a detailed alignment between query and homologous regions.

3. Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole.


11


Indexing• Build an index which contains positions

of each K-mer in genome. K is typically between 9 and 13 for cDNA alignments.

• Step through each K-mer in cDNA chunk and look it up in index.

• Get list of ‘hits’ - positions in cDNA and in genome that match for K bases.

• Cluster hits to find homologous regions.


12


Specificity vs. sensitivity of indexesK The K-mer sizeM Match ratio in homologous regions (98% for

cDNA/genomic of same species, 89% human/mouse proteins)H Length of homologous region (50-200 for human exons)G Database sizeQ The size of the query sequenceA Alphabet size.P The probability that at least one nolp K-mer will matchF The number of unspecific matched K-mers

P1=MK

T=floor(H/K)P=1-(1-P1)T = 1-(1-MK)T

F=(Q-K+1)*(G/K)*(1/A)KKent (2002), Genome Research 12( 4):656-664

13

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of ScienceSensitivity and Specificity of Single Perfect Nucleotide K-mer Matches as a Search Criterion

H 7 8 9 10 11 12 13 14 81% 0.974 0.915 0.833 0.726 0.607 0.486 0.373 0.314 83% 0.988 0.953 0.897 0.815 0.711 0.595 0.478 0.415 85% 0.996 0.978 0.945 0.888 0.808 0.707 0.594 0.532 87% 0.999 0.992 0.975 0.942 0.888 0.811 0.714 0.659 89% 1.000 0.998 0.991 0.976 0.946 0.897 0.824 0.782 91% 1.000 1.000 0.998 0.993 0.981 0.956 0.912 0.886 93% 1.000 1.000 1.000 0.999 0.995 0.987 0.968 0.957 95% 1.000 1.000 1.000 1.000 0.999 0.998 0.994 0.991 97%

1.000

1.000

1.000

1.000

1.000

1.000

1.000

0.999

K 7 8 9 10 11 12 13 14 F 1.3e+07 2.9e+06 635783 143051 32512 7451 1719 399 (A) Columns are for K sizes of 7-14. Rows represent various percentage identities between the homologous sequences. The table entries show the fraction of homologies detected as calculated from equation 3 assuming a homologous region of 100 bases. The larger the value of K, the fewer homologies are detected. (B) K represents the size of the perfect match. F shows how many perfect matches of this size expected to occur by chance according to equation 4 in a genome of 3 billion bases using a query of 500 bases.

A

B

Kent (2002), Genome Research 12( 4):656-664

14


Other Indexing Techniques

• Allowing one mismatch in a seed is also quite effective.

• Pair of 11-mers is seed for BLAT cDNAalignments.

• 7 out of 8 amino acid seed used for translated BLAT alignments.

15


Specificity of Indexing Techniques

for Mouse/Human Alignments

The table above assumes that a homologous region is 100 bases long and is 86% identical at the base level and 89% identical at the amino acid level.

Seed P Homology Number FalseType is Found Positives -------------------------------------N 1 8-mer 0.988 2,861,023N 2 7-mers 0.997 596,046N 11 of 12 0.995 275,671P 1 5-mer 0.996 31,125P 2 4-mers 0.995 122P 7 of 8 0.998 374

16


Genome: cacaattatcacgaccgc3-mers: cac aat tat cac gac cgcIndex: aat 3 gac 12

cac 0,9 tat 6cgc 15

query: aattctcac3-mers: aat att ttc tct ctc tca cac

0 1 2 3 4 5 6hits: aat 0,3 3

cac 6,0 -6cac 6,9 3

With modifications, from Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip

17


“Clumping” hitsK-mere match diagonalAAT 0,3 3CAC 6,0 -6CAC 6,9 3

Unsorted list of hits

• Calculate diagonal (dbase_position – query_position)• Sort on dbase• Clump hits within W from each other and on

diagonals not more than GAP away

• Sorting: N*log(N) calculations – can take lots of time for very long lists of hits (e.g. genome)

18


Clumping

Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates

Que

ry c

oord

inat

es

W

< gap_limit

19


“Clumping” hitsK-mere match diagonalAAT 0,3 3CAC 6,0 -6CAC 6,9 3

Unsorted list of hits

• Calculate diagonal (dbase_position – query_position)• Drop hits into 64K buckets on DB position • sort on diagonal• Prot-clump if diagonals are within GAP• Sort on dbase• Clump hits within W from each other and on


20


Buckets

W

N*log(N)

L

N*log(n)n=N*b/L

bb

toss-over

21


“Clumping” hitsK-mere match diagonalCAC 6,0 -6AAT 0,3 3CAC 6,9 3

sorted list of hits

• Calculate diagonal (dbase_position – query_position)• Drop hits into 64K buckets on DB position

(toss-over marginal hits)• sort on diagonal• Prot-clump if diagonals are within GAP• Sort on dbase• Clump hits within W from each other and on


22


Eliminate small clumps


Que

ry c

oord

inat

es

W

Not enough hitsin the clump

23


Eliminate small clumps


Que

ry c

oord

inat

es

24


Clumping clumps


Que

ry c

oord

inat

es < 300/100(DNA/prot)

> 300/100(DNA/prot)

Extra 500

25


Alignment (cDNA)• Start from scratch with regions defined with K-

mers• Index on smaller K-mers, but extend each K-mer

until it becomes specific• Extend in both direction without mismatches or

gaps and merge overlapping or continues alignments

• Recurse on gaps with smaller K until gap or hits are eliminated

• Additional tricks (Ns, mismatches, introns)

26


Alignment (cDNA)


Que

ry c

oord

inat

es

27


Alignment (cDNA)


Que

ry c

oord

inat

es

28


Alignment (protein)

• Extend hits into ungapped HSPs with +2/-1 scoring scheme.

• Create a graph of all possible HSP merges• Use dynamic programming to traverse the

graph.

29


HSPs


Que

ry c

oord

inat

es

30


HSPs


Que

ry c

oord

inat

es

+2 +2-1 +1

+2 +3+2 +5

-1 +4-1 +3

-1 +2-1 +1

31


HSPs


Que

ry c

oord

inat

es

+2 +2-1 +1

+2 +3+2 +5

32


Detailed Alignments• Smith-Waterman technique based on

dynamic programming.• “Banded” Smith-Waterman: faster but

doesn’t tolerate long inserts• Simple extensions that don’t allow

inserts.

33


Stitching Together Alignments

34


Repeats Complicate Things

35


Dynamic Programming Saves the Day

36


Assignment

• You are trying to compare Arabidopsis mRNAs to the genome (high quality genome – less errors). Based on the BLAT article, how will you set the following parameters relative to their (human) defaults?– The length and number of K-mers– W– The drop-off score

37


The BLAST algorithm“The central idea of the BLAST algorithm is that a

statistically significant alignment is likely to contain a high-scoring pair of aligned words”.

Altschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402

38


Finding an a “hit”

• Word: k amino acids/nucleotides• Build a table of all words in the

database• Convert query -> words table

– For each word, derive all words with similarity score above T

– Strong statistical background for T selection

• Build a word/sequence tableTo excel

39


Choosing a word-score cutoff

For a scoring matrix with Pi and SijS’(bits) = (λS-lnK)/ln2For ungapped alignment, λ and K can be

calculated from Pi and SijE=N/2S’ (N=nm)S’=log2(N/E)For a protein of length 250, compared with 50

million residues, S’ should be 38 bits to ensure E<0.05.

40


Building the similar words list

• Original word: FDI

AAA -12AAC -10CDI 14…FDA 16

41


Extending words to HSPs

FSFLKDSAGVVDSPKLGAHAEKVFGMVRDSAVQLRATGEVV--LDGKDG| | : |: :||: || :|| | : || | || | FGDLSNPGAVMGNPKVKAHGKKV----------LHSFGEGVKHLDNLKG

•Find additional aligned words with a lower T

•Merge alignments in the same phase within distance A

•Choose significant HSPs

42



The 15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlapping hits with score at least 11 are indicated by dots. Of these 37 hits, only the two indicated pairs are on the same diagonal and within distance 40 of one another. Thus the two-hit heuristic with T = 11 triggers two extensions, in place of the 15 extensions invoked by the one-hit heuristic with T = 13.

43


Extending words to HSPs

FSFLKDSAGVVDSPKLGAHAEKVFGMVRDSAVQLRATGEVV--LDGKDG| | : |: :||: || :|| | : || | || | FGDLSNPGAVMGNPKVKAHGKKV----------LHSFGEGVKHLDNLKG

•Find additional aligned words with a lower T•Merge alignments in the same phase within distance A•Choose significant HSPs: Using the equation above, set a threshold so that 1:50 of the sequences will be passed to the next stage by chance.

44


Combining HSPs with dynamic programming

• banded smith-waterman• Basic principle: significant alignments

cannot occur too far from the diagonal• How far depends on the score

45



46

Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of ScienceAltschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402

47


The blast algorithm

• Find High Scoring Pairs (HSPs)– Eliminate most of the sequences– Allow no gaps– Identical->similar words by enumeration

• Find nearby double hits on the same diagonal– Increase sensitivity; allow gaps in the initial HSP

• Combine HSPs with fast dynamic programming– Get semi-optimal alignment

48


Summary

• The main heuristics– highly similar words– Lower stringency, require double hit– Extend until score drops– Fast banded smith waterman

• Blast vs. Dynamic Programming

49


Fasta: overview

• First fast algorithm for database searching• Lost popularity to blast in the mid-90s

– No web implementation– Slower– Misses complex similarities

• Further drop with the introduction of BLAST2

50


Algorithm overview

1. Search for non-gapped alignments based on identical words.

2. Merge alignments allowing gaps.3. Pick best N hits, and realign using SW.

Discard 99.9999%

51


Additional reading

• Blast articles– Altschul (2002), “Gapped BLAST and PSI-BLAST: a

new generation of protein database search programs”, Nucleic Acids Research 25(17):3389–3402

– Altschul (1994), “Issues in searching molecular sequence databases”, Nature Genetics 6(2):119-129

• Fasta article– Pearson (2000), “Flexible sequence similarity

searching with the FASTA3 program package”, Methods Mol Biol 132:185-219

52


Statistics

• The expected number of HSPs with as score S or larger, when aligning two sufficiently large random sequences of length m an n is given by

E=Kmne-λS

• Normalizing S by S’= (λS-lnK)/ln2, you getE=mn2-S’

• And for a databases with a total length NE≈mN2-S’

53


Testing the results

Rank/S' plot

1

10

100

1000

10000

1 10 100 1000

Rank

s'

54


Testing the results

Rank/S' plot

1

10

100

1000

10000

1 10 100 1000

Rank

s'

55


Testing the results

Rank/S' plot

1

10

100

1000

10000

1 10 100 1000

Rank

s'

mays misc

56


Testing the results

Rank/S' plot

1

10

100

1000

10000

1 10 100 1000

Rank

s'

heuristic methods for sequence alignment1 eitan rubin & shmuel pietrokovski, advanced topic in...

Documents