heuristic methods for sequence alignment1 eitan rubin & shmuel pietrokovski, advanced topic in...
TRANSCRIPT
![Page 1: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/1.jpg)
1
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Heuristic methods for sequence alignment
BLAST, BLAT and more
![Page 2: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/2.jpg)
2
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Motivation
• Align a sequence to a large collection of sequences
• Similarity implies related history; related history suggests shared structure/function
• Most sequences in the collection are unrelated• Specificity, sensitivity, performance
![Page 3: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/3.jpg)
3
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
A note on the role of BI
• Most biological knowledge is obtained by hard, slow, expensive experiments.
• Successful BI strategies involve knowledge transfer.– If X is A, and Y≈X, then Y is A.
• Nature reuses winning strategies.• Nature abuses winning strategies.
![Page 4: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/4.jpg)
4
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Problem definition
• Given a query sequence and a collection of sequences, find which are similar.
• Use statistical estimates!!!– For DBase, a cutoff is enough.– For pair-wise, probability is required.
![Page 5: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/5.jpg)
5
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
The Naïve solution
• For each target sequence
– Align to query using dynamic programming
– Report alignment if above cutoff
Too slow!!!
![Page 6: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/6.jpg)
6
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
The Naïve solution (cont)ACGATCATCATGCA
seq1 seq2 seq3
Query
ACGATCATCATGCTATGACTGCATAGCTAGACTCGCATAGCATCGACT
For an EST against DBest: 500x500x6x106=1.5x1012 cells, O(1015) calculationsAt 109 calculations/second, will take 106 seconds = 11 days
For human against mouse:3x109x3x109=1019 cells, O(1028) calculationsAt 109 calculations/second, will take 1019 seconds = 1011 years
![Page 7: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/7.jpg)
7
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
BLAT
• 3 billion bases of genome • 40 thousand sequenced mRNAs• 3 million sequenced ESTs of ~500 bases
each• 15 million mouse reads of ~500 bases.
![Page 8: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/8.jpg)
8
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Perfect Matches Serve as Seeds
• Computers can look for exact matches very quickly.
• Finding inexact matches is slow• Inexact matches should contain some
short exact matches.• Inexact matches should contain multiple
even shorter exact matches.
Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip
![Page 9: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/9.jpg)
9
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
ggagaatagggcatgctctgaggtctgctggaacccatcc1 12 123456 12 12 12345 1 12 12345 1gtggattagggcttgttccgaggttatcgggttcccatac
ttcttgtctcgctccagggcaccgtgcaggaaatcccggg2345 1 123 1 12 1234567 1234567 1234ttctgggctctcgcccgggcacctagcaggaatacccgat
acacctcctcattctcatccagccactggatgacgaaggg123 123456 123 123 12345 12345
ggacccgctcattaccatacagtaaacggatggcgaagac
Distribution of identical matches:length: 1 2 3 4 5 6 7 8 number: 5 5 4 1 5 2 2 0
Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip
![Page 10: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/10.jpg)
10
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
BLAT overview
1. Use an index to find regions in genome homologous to query.
2. Do a detailed alignment between query and homologous regions.
3. Use dynamic programming to stitch together detailed alignments regions into detailed alignment of whole.
Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip
![Page 11: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/11.jpg)
11
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Indexing• Build an index which contains positions
of each K-mer in genome. K is typically between 9 and 13 for cDNA alignments.
• Step through each K-mer in cDNA chunk and look it up in index.
• Get list of ‘hits’ - positions in cDNA and in genome that match for K bases.
• Cluster hits to find homologous regions.
Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip
![Page 12: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/12.jpg)
12
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Specificity vs. sensitivity of indexesK The K-mer sizeM Match ratio in homologous regions (98% for
cDNA/genomic of same species, 89% human/mouse proteins)H Length of homologous region (50-200 for human exons)G Database sizeQ The size of the query sequenceA Alphabet size.P The probability that at least one nolp K-mer will matchF The number of unspecific matched K-mers
P1=MK
T=floor(H/K)P=1-(1-P1)T = 1-(1-MK)T
F=(Q-K+1)*(G/K)*(1/A)KKent (2002), Genome Research 12( 4):656-664
![Page 13: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/13.jpg)
13
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of ScienceSensitivity and Specificity of Single Perfect Nucleotide K-mer Matches as a Search Criterion
H 7 8 9 10 11 12 13 14 81% 0.974 0.915 0.833 0.726 0.607 0.486 0.373 0.314 83% 0.988 0.953 0.897 0.815 0.711 0.595 0.478 0.415 85% 0.996 0.978 0.945 0.888 0.808 0.707 0.594 0.532 87% 0.999 0.992 0.975 0.942 0.888 0.811 0.714 0.659 89% 1.000 0.998 0.991 0.976 0.946 0.897 0.824 0.782 91% 1.000 1.000 0.998 0.993 0.981 0.956 0.912 0.886 93% 1.000 1.000 1.000 0.999 0.995 0.987 0.968 0.957 95% 1.000 1.000 1.000 1.000 0.999 0.998 0.994 0.991 97%
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.999
K 7 8 9 10 11 12 13 14 F 1.3e+07 2.9e+06 635783 143051 32512 7451 1719 399 (A) Columns are for K sizes of 7-14. Rows represent various percentage identities between the homologous sequences. The table entries show the fraction of homologies detected as calculated from equation 3 assuming a homologous region of 100 bases. The larger the value of K, the fewer homologies are detected. (B) K represents the size of the perfect match. F shows how many perfect matches of this size expected to occur by chance according to equation 4 in a genome of 3 billion bases using a query of 500 bases.
A
B
Kent (2002), Genome Research 12( 4):656-664
![Page 14: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/14.jpg)
14
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Other Indexing Techniques
• Allowing one mismatch in a seed is also quite effective.
• Pair of 11-mers is seed for BLAT cDNAalignments.
• 7 out of 8 amino acid seed used for translated BLAT alignments.
![Page 15: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/15.jpg)
15
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Specificity of Indexing Techniques
for Mouse/Human Alignments
The table above assumes that a homologous region is 100 bases long and is 86% identical at the base level and 89% identical at the amino acid level.
Seed P Homology Number FalseType is Found Positives -------------------------------------N 1 8-mer 0.988 2,861,023N 2 7-mers 0.997 596,046N 11 of 12 0.995 275,671P 1 5-mer 0.996 31,125P 2 4-mers 0.995 122P 7 of 8 0.998 374
![Page 16: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/16.jpg)
16
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Genome: cacaattatcacgaccgc3-mers: cac aat tat cac gac cgcIndex: aat 3 gac 12
cac 0,9 tat 6cgc 15
query: aattctcac3-mers: aat att ttc tct ctc tca cac
0 1 2 3 4 5 6hits: aat 0,3 3
cac 6,0 -6cac 6,9 3
With modifications, from Kent, 2001 http://www.soe.ucsc.edu/~kent/presentations/JobTalk/jobTalk.zip
![Page 17: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/17.jpg)
17
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
“Clumping” hitsK-mere match diagonalAAT 0,3 3CAC 6,0 -6CAC 6,9 3
Unsorted list of hits
• Calculate diagonal (dbase_position – query_position)• Sort on dbase• Clump hits within W from each other and on
diagonals not more than GAP away
• Sorting: N*log(N) calculations – can take lots of time for very long lists of hits (e.g. genome)
![Page 18: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/18.jpg)
18
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Clumping
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es
W
< gap_limit
![Page 19: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/19.jpg)
19
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
“Clumping” hitsK-mere match diagonalAAT 0,3 3CAC 6,0 -6CAC 6,9 3
Unsorted list of hits
• Calculate diagonal (dbase_position – query_position)• Drop hits into 64K buckets on DB position • sort on diagonal• Prot-clump if diagonals are within GAP• Sort on dbase• Clump hits within W from each other and on
diagonals not more than GAP away
![Page 20: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/20.jpg)
20
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Buckets
W
N*log(N)
L
N*log(n)n=N*b/L
bb
toss-over
![Page 21: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/21.jpg)
21
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
“Clumping” hitsK-mere match diagonalCAC 6,0 -6AAT 0,3 3CAC 6,9 3
sorted list of hits
• Calculate diagonal (dbase_position – query_position)• Drop hits into 64K buckets on DB position
(toss-over marginal hits)• sort on diagonal• Prot-clump if diagonals are within GAP• Sort on dbase• Clump hits within W from each other and on
diagonals not more than GAP away
![Page 22: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/22.jpg)
22
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Eliminate small clumps
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es
W
Not enough hitsin the clump
![Page 23: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/23.jpg)
23
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Eliminate small clumps
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es
![Page 24: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/24.jpg)
24
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Clumping clumps
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es < 300/100(DNA/prot)
> 300/100(DNA/prot)
Extra 500
![Page 25: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/25.jpg)
25
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Alignment (cDNA)• Start from scratch with regions defined with K-
mers• Index on smaller K-mers, but extend each K-mer
until it becomes specific• Extend in both direction without mismatches or
gaps and merge overlapping or continues alignments
• Recurse on gaps with smaller K until gap or hits are eliminated
• Additional tricks (Ns, mismatches, introns)
![Page 26: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/26.jpg)
26
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Alignment (cDNA)
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es
![Page 27: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/27.jpg)
27
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Alignment (cDNA)
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es
![Page 28: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/28.jpg)
28
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Alignment (protein)
• Extend hits into ungapped HSPs with +2/-1 scoring scheme.
• Create a graph of all possible HSP merges• Use dynamic programming to traverse the
graph.
![Page 29: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/29.jpg)
29
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
HSPs
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es
![Page 30: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/30.jpg)
30
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
HSPs
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es
+2 +2-1 +1
+2 +3+2 +5
-1 +4-1 +3
-1 +2-1 +1
![Page 31: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/31.jpg)
31
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
HSPs
Modified, from Kent (2002), Genome Research 12( 4):656-664 Target coordinates
Que
ry c
oord
inat
es
+2 +2-1 +1
+2 +3+2 +5
![Page 32: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/32.jpg)
32
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Detailed Alignments• Smith-Waterman technique based on
dynamic programming.• “Banded” Smith-Waterman: faster but
doesn’t tolerate long inserts• Simple extensions that don’t allow
inserts.
![Page 33: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/33.jpg)
33
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Stitching Together Alignments
![Page 34: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/34.jpg)
34
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Repeats Complicate Things
![Page 35: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/35.jpg)
35
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Dynamic Programming Saves the Day
![Page 36: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/36.jpg)
36
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Assignment
• You are trying to compare Arabidopsis mRNAs to the genome (high quality genome – less errors). Based on the BLAT article, how will you set the following parameters relative to their (human) defaults?– The length and number of K-mers– W– The drop-off score
![Page 37: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/37.jpg)
37
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
The BLAST algorithm“The central idea of the BLAST algorithm is that a
statistically significant alignment is likely to contain a high-scoring pair of aligned words”.
Altschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402
![Page 38: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/38.jpg)
38
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Finding an a “hit”
• Word: k amino acids/nucleotides• Build a table of all words in the
database• Convert query -> words table
– For each word, derive all words with similarity score above T
– Strong statistical background for T selection
• Build a word/sequence tableTo excel
![Page 39: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/39.jpg)
39
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Choosing a word-score cutoff
For a scoring matrix with Pi and SijS’(bits) = (λS-lnK)/ln2For ungapped alignment, λ and K can be
calculated from Pi and SijE=N/2S’ (N=nm)S’=log2(N/E)For a protein of length 250, compared with 50
million residues, S’ should be 38 bits to ensure E<0.05.
![Page 40: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/40.jpg)
40
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Building the similar words list
• Original word: FDI
AAA -12AAC -10CDI 14…FDA 16
![Page 41: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/41.jpg)
41
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Extending words to HSPs
FSFLKDSAGVVDSPKLGAHAEKVFGMVRDSAVQLRATGEVV--LDGKDG| | : |: :||: || :|| | : || | || | FGDLSNPGAVMGNPKVKAHGKKV----------LHSFGEGVKHLDNLKG
•Find additional aligned words with a lower T
•Merge alignments in the same phase within distance A
•Choose significant HSPs
![Page 42: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/42.jpg)
42
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Altschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402
The 15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlapping hits with score at least 11 are indicated by dots. Of these 37 hits, only the two indicated pairs are on the same diagonal and within distance 40 of one another. Thus the two-hit heuristic with T = 11 triggers two extensions, in place of the 15 extensions invoked by the one-hit heuristic with T = 13.
![Page 43: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/43.jpg)
43
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Extending words to HSPs
FSFLKDSAGVVDSPKLGAHAEKVFGMVRDSAVQLRATGEVV--LDGKDG| | : |: :||: || :|| | : || | || | FGDLSNPGAVMGNPKVKAHGKKV----------LHSFGEGVKHLDNLKG
•Find additional aligned words with a lower T•Merge alignments in the same phase within distance A•Choose significant HSPs: Using the equation above, set a threshold so that 1:50 of the sequences will be passed to the next stage by chance.
![Page 44: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/44.jpg)
44
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Combining HSPs with dynamic programming
• banded smith-waterman• Basic principle: significant alignments
cannot occur too far from the diagonal• How far depends on the score
![Page 45: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/45.jpg)
45
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Altschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402
![Page 46: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/46.jpg)
46
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of ScienceAltschul et. al, 1997 Nucl. Acid Res. 25(17):3389-402
![Page 47: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/47.jpg)
47
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
The blast algorithm
• Find High Scoring Pairs (HSPs)– Eliminate most of the sequences– Allow no gaps– Identical->similar words by enumeration
• Find nearby double hits on the same diagonal– Increase sensitivity; allow gaps in the initial HSP
• Combine HSPs with fast dynamic programming– Get semi-optimal alignment
![Page 48: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/48.jpg)
48
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Summary
• The main heuristics– highly similar words– Lower stringency, require double hit– Extend until score drops– Fast banded smith waterman
• Blast vs. Dynamic Programming
![Page 49: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/49.jpg)
49
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Fasta: overview
• First fast algorithm for database searching• Lost popularity to blast in the mid-90s
– No web implementation– Slower– Misses complex similarities
• Further drop with the introduction of BLAST2
![Page 50: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/50.jpg)
50
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Algorithm overview
1. Search for non-gapped alignments based on identical words.
2. Merge alignments allowing gaps.3. Pick best N hits, and realign using SW.
Discard 99.9999%
![Page 51: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/51.jpg)
51
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Additional reading
• Blast articles– Altschul (2002), “Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs”, Nucleic Acids Research 25(17):3389–3402
– Altschul (1994), “Issues in searching molecular sequence databases”, Nature Genetics 6(2):119-129
• Fasta article– Pearson (2000), “Flexible sequence similarity
searching with the FASTA3 program package”, Methods Mol Biol 132:185-219
![Page 52: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/52.jpg)
52
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Statistics
• The expected number of HSPs with as score S or larger, when aligning two sufficiently large random sequences of length m an n is given by
E=Kmne-λS
• Normalizing S by S’= (λS-lnK)/ln2, you getE=mn2-S’
• And for a databases with a total length NE≈mN2-S’
![Page 53: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/53.jpg)
53
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Testing the results
Rank/S' plot
1
10
100
1000
10000
1 10 100 1000
Rank
s'
![Page 54: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/54.jpg)
54
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Testing the results
Rank/S' plot
1
10
100
1000
10000
1 10 100 1000
Rank
s'
![Page 55: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/55.jpg)
55
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Testing the results
Rank/S' plot
1
10
100
1000
10000
1 10 100 1000
Rank
s'
mays misc
![Page 56: Heuristic methods for sequence alignment1 Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science Heuristic methods for sequence alignment](https://reader031.vdocument.in/reader031/viewer/2022011812/5e2ae01653c53a1f3b74d81b/html5/thumbnails/56.jpg)
56
Eitan Rubin & Shmuel Pietrokovski, Advanced Topic in Bioinformatics 2003, Weizmann Institute of Science
Testing the results
Rank/S' plot
1
10
100
1000
10000
1 10 100 1000
Rank
s'