blast: basic local alignment search tool...
TRANSCRIPT
![Page 1: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/1.jpg)
BLAST:Basic Local Alignment Search Tool
Altschul et al. J. Mol Bio. 1990.
CS 466Saurabh Sinha
![Page 2: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/2.jpg)
Motivation
• Sequence homology to a known proteinsuggest function of newly sequencedprotein
• Bioinformatics task is to findhomologous sequence in a database ofsequences
• Databases of sequences growing fast
![Page 3: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/3.jpg)
Alignment
• Natural approach to check if the “querysequence” is homologous to asequence in the database is to computealignment score of the two sequences
• Alignment score counts gaps(insertions, deletions) and replacements
• Minimizing the evolutionary distance
![Page 4: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/4.jpg)
Alignment
• Global alignment: optimize the overallsimilarity of the two sequences
• Local alignment: find only relativelyconserved subsequences
• Local similarity measures preferred fordatabase searches– Distantly related proteins may only share isolated
regions of similarity
![Page 5: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/5.jpg)
Alignment
• Dynamic programming is the standardapproach to sequence alignment
• Algorithm is quadratic in length of thetwo sequences
• Not practical for searches against verylarge database of sequences (e.g.,whole genome)
![Page 6: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/6.jpg)
Scoring alignments
• Scoring matrix: 4 x 4 matrix (DNA) or 20x 20 matrix (protein)
• Amino acid sequences: “PAM” matrix– Consider amino acid sequence alignment for
very closely related proteins, extractreplacement frequencies (probabilities),extrapolate to greater evolutionary distances
• DNA sequences: match = +5, mismatch= -4
![Page 7: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/7.jpg)
BLAST: the MSP
• Given two sequences of same length, thesimilarity score of their alignment (withoutgaps) is the sum of similarity values for eachpair of aligned residues
• Maximal segment pair (MSP): Highest scoringpair of identical length segments from the twosequences
• The similarity score of an MSP is called theMSP score
• BLAST heuristically aims to find this
![Page 8: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/8.jpg)
Locally maximal segment pair
• A molecular biologist may be interested in allconserved regions shared by two proteins, not justtheir highest scoring pair
• A segment pair (segments of identical lengths) islocally maximal if its score cannot be improved byextending or shortening in either direction
• BLAST attempts to find all locally maximal segmentpairs above some score cutoff.
![Page 9: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/9.jpg)
Rapid approximation of MSP score
• Goal is to report those database sequences thathave MSP score above some threshold S.
• Statistics tells us what is the highest threshold Sat which “chance similarities” are likely to appear
• Tractability to statistical analysis is one of theattractive features of the MSP score
![Page 10: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/10.jpg)
Rapid approximation of MSP score
• BLAST minimizes time spent on database sequenceswhose similarity with the query has little chance ofexceeding this cutoff S.
• Main strategy: seek only segment pairs (one fromdatabase, one query) that contain a word pair withscore >= T
• Intuition: If the sequence pair has to score above S,its most well matched word (of some predeterminedsmall length) must score above T
• Lower T => Fewer false negatives• Lower T => More pairs to analyze
![Page 11: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/11.jpg)
Implementation
1. Compile a list of high scoring words2. Scan database for hits to this word list3. Extend hits
![Page 12: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/12.jpg)
Step 1: Compiling list of wordsfrom query sequence
• For proteins: List of all w-length words thatscore at least T when compared to someword in query sequence
• Question: Does every word in the querysequence make it to the list?
• For DNA: list of all w-length words in thequery sequence, often with w=12
![Page 13: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/13.jpg)
Step 2: Scanning thedatabase for hits
• Find exact matches to list words• Can be done in linear time
– two methods (next slides)• Each word in list points to all
occurrences of the word in word listfrom previous step
![Page 14: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/14.jpg)
Scanning the database for hits
• Method 1: Let w=4, so 204 possible words• Each integer in 0 … 204-1 is an index for an
array• Array element point to list of all occurrences
of that word in query• Not all 204 elements of array are populated
– only the ones in word list from previousstep
![Page 15: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/15.jpg)
Scanning the database for hits
• Method 2: use “deterministic finiteautomaton” or “finite state machine”.
• Similar to the keyword trees seen incourse.
• Build the finite state machine out of allwords in word list from previous step
![Page 16: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/16.jpg)
Step 3: Extending hits
• Once a word pair with score >= T has beenfound, extend it in each direction.
• Extend until score >= S is obtained• During extension, score may go up, and then
down, and then up again• Terminate if it goes down too much (a certain
distance below the best score found forshorter extensions)
• One implementation allows gaps duringextension
![Page 17: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/17.jpg)
BLAST: approximating the MSP
• BLAST may not find all segment pairsabove threshold S
• Trying to approximate the MSP• Bounds on the error: not hard bounds,
but statistical bounds– “Highly likely” to find the MSP
![Page 18: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/18.jpg)
Statistics
• Suppose the MSP has been calculated byBLAST (and suppose this is the true MSP)
• Suppose this observed MSP scores S.• What are the chances that the MSP score for
two unrelated sequences would be >= S?• If the chances are very low, then we can be
confident that the two sequences must nothave been unrelated
![Page 19: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/19.jpg)
Statistics
• Given two random sequences oflengths m and n
• Probability that they will produce anMSP score of >= x ?
![Page 20: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/20.jpg)
Statistics
• Number of separate SPs with score >= x is Poissondistributed with mean y(x) = Kmn exp(-λx), where
• λ is the positive solution of∑pipjexp(λs(i,j)) = 1
• K is a constant• s(i,j) is the scoring matrix, pi is the frequency of
i in random sequences
![Page 21: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/21.jpg)
Statistics• Poisson distribution:
Pr(x) = (e- λ λx)/x!• Pr(#SPs >= α)= 1 - Pr(#SPs <= α-1)
!
=1"e"yyi
i!i= 0
#"1
$
=1" e"yyi
i!i= 0
#"1
$
![Page 22: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/22.jpg)
Statistics• For α=1, Pr(#SPs >= 1) = 1-e-y(x)
• Choose S such that 1-e-y(S) is small• Suppose the probability of having at least 1 SP with
score >= S is 0.001.• This seems reasonably small• However, if you test 10000 random sequences, you
expect 10 to cross the threshold• Therefore, require “E-value” to be small.• That is, expected number of random sequence pairs
with score >= S should be small.
![Page 23: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/23.jpg)
More statistics
• We just saw how to choose threshold S• How to choose T ?• BLAST is trying to find segment pairs
(SPs) scoring above S• If an SP scores S, what is the
probability that it will have a w-wordmatch of score T or more?
• We want this probability to be high
![Page 24: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/24.jpg)
More statistics: Choosing T
• Given a segment pair (from two randomsequences) that scores S, what is theprobability q that it will have no w-wordmatch scoring above T?
• Want this q to be low• Obtained from simulations• Found to decrease exponentially as S
increases
![Page 25: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/25.jpg)
BLAST is the universally usedbioinformatics tool
![Page 26: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS](https://reader036.vdocument.in/reader036/viewer/2022062506/5f0334147e708231d4080c76/html5/thumbnails/26.jpg)
http://flybase.org/blast/