sequence alignment algorithms02710/lectures/seqalign2015.pdf · 50 fasta algorithm • the method:...
TRANSCRIPT
![Page 1: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/1.jpg)
Sequence Alignment Algorithms
DEKM book
Notes from Dr. Bino John
and Dr. Takis Benos
1
![Page 2: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/2.jpg)
To Do
• Global alignment• Local alignment• Gaps
– Affine Gaps– Algorithm (blackboard)
• Statistical Significance– Notes (blackboard)
• Read up on database searches– BLAST– FASTA– CS tricks: suffix tree, …
• PSSMs and Multiple Sequence Alignments
2
![Page 3: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/3.jpg)
Why compare sequences?
• Given a new sequence, infer its function based on similarity to another sequence
• Find important molecular regions – conserved across species
3
![Page 4: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/4.jpg)
Why compare sequences? Do more..
• Determine the evolutionary constraints at work
• Find mutations in a population or family of genes
• Find similar looking sequence in a database
• Find secondary/tertiary structure of a sequence of interest – molecular modeling using a template (homology modeling)
4
![Page 5: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/5.jpg)
Sequence alignment
• Are two sequences related?– Align sequences or parts of them
– Decide if alignment is by chance or evolutionarily linked?
• Issues:– What sorts of alignments to consider?
– How to score an alignment and hence rank?
– Algorithm to find good alignments
– Evaluate the significance of the alignment
5
![Page 6: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/6.jpg)
6
![Page 7: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/7.jpg)
7
Dynamic Programming
We apply dynamic programming when:
• There is only a polynomial number of subproblems– Align x1…xi to y1…yj
• Original problem is one of the subproblems– Align x1…xM to y1…yN
• Each subproblem is easily solved from smaller subproblems
![Page 8: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/8.jpg)
8
Global alignment
Mij
ST
i-1
i
j-1 j
Mi,j = MAX
Mi-1, j-1 + Score(Si,Tj )
Mi,j-1 + γ
Mi-1,j + γ
DNA matrix
PAM
BLOSUM
Gap penalty
Needleman & Wunsch, 1970
![Page 9: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/9.jpg)
9
![Page 10: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/10.jpg)
10
![Page 11: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/11.jpg)
11
![Page 12: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/12.jpg)
12
![Page 13: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/13.jpg)
13
Alignment: adding scores (cntd)
Score(match) = 1Score(mismatch) = 0Score(gap) = 0
![Page 14: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/14.jpg)
14
Alignment: adding scores
![Page 15: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/15.jpg)
15
Alignment: adding scores (cntd)
(Seq #1) A
|
(Seq #2) A Alignment:
![Page 16: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/16.jpg)
16
Alignment: adding scores (cntd)
(Seq #1) T A
|
(Seq #2) - A Alignment:
![Page 17: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/17.jpg)
17
Alignment: adding scores (cntd)
(Seq #1) G A A T T C A G T T A
| | | | | |
(Seq #2) G G A - T C - G - - A Alignment:
6 matches, 1 mism., 4 gaps
![Page 18: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/18.jpg)
18
![Page 19: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/19.jpg)
19
![Page 20: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/20.jpg)
20
Local alignment
Given two sequences, S and T, find two subsequences, s and t, whose alignment has the highest “score” amongst all subsequence pairs.
Question: Why do we need local alignment, if we have the global one?
![Page 21: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/21.jpg)
21
Local alignment: an example
EGR4_HUMAN KA [FACPVESCVRSFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTSHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]
EGR4_RAT KA [FACPVESCVRTFARSDELNRHLRIH] TGHKP [FQCRICLRNFSRSDHLTTHVRTH] TGEKP [FACDV--CGRRFARSDEKKRHSKVH]
EGR3_HUMAN RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH]
EGR3_RAT RP [HACPAEGCDRRFSRSDELTRHLRIH] TGHKP [FQCRICMRSFSRSDHLTTHIRTH] TGEKP [FACEF--CGRKFARSDERKRHAKIH]
EGR1_HUMAN RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_MOUSE RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_RAT RP [YACPVESCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDI--CGRKFARSDERKRHTKIH]
EGR1_BRARE RP [YACPVETCDRRFSRSDELTRHIRIH] TGQKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACEI--CGRKFARSDERKRHTKIH]
EGR2_RAT RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_XENLA RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_MOUSE RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_HUMAN RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDY--CGRKFARSDERKRHTKIH]
EGR2_BRARE RP [YPCPAEGCDRRFSRSDELTRHIRIH] TGHKP [FQCRICMRNFSRSDHLTTHIRTH] TGEKP [FACDF--CGRKFARSDERKRHTKIH]
MIG1_KLULA -- [-------------------------] ---RP [YVCPICQRGFHRLEHQTRHIRTH] TGERP [HACDFPGCSKRFSRSDELTRHRRIH]
MIG1_KLUMA -- [-------------------------] ---RP [YMCPICHRGFHRLEHQTRHIRTH] TGERP [HACDFPGCAKRFSRSDELTRHRRIH]
MIG1_YEAST -- [-------------------------] ---RP [HACPICHRAFHRLEHQTRHMRIH] TGEKP [HACDFPGCVKRFSRSDELTRHRRIH]
MIG2_YEAST -- [-------------------------] ---RP [FRCDTCHRGFHRLEHKKRHLRTH] TGEKP [HHCAFPGCGKSFSRSDELKRHMRTH]
[ ] :* [. * * * * * :* . *:* *] ***:* [. * * : *:**** .** : *]
![Page 22: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/22.jpg)
22
Local alignment (cntd)
Mij
ST
i-1
i
j-1 j
Mi,j = MAXMi-1, j-1 + Score(Si,Tj )
Mi,j-1 + γ
Mi-1,j + γ
DNA matrix
PAM
BLOSUM
Gap penalty
0
Smith & Waterman, 1981 Similarity Scoring Expected value:negative for random alignmentspositive for highly similar sequences
![Page 23: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/23.jpg)
23
The Smith-Waterman Algorithm
1. Initialization
F(0,0) = F(0,j) = F(i,0) = 0
2. Iteration
for i=1,…,M
for j=1,…,N
- calculate optimal F(i,j)
- store Ptr(i,j)
3. Termination
• Find the end of the best alignment with FOPT = max{i,j} F(i,j) and trace back OR
• Find all alignments with F(i,j) > threshold and trace back
![Page 24: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/24.jpg)
24
Local vs. global alignment
![Page 25: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/25.jpg)
25
Local vs. global alignment (cntd)
![Page 26: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/26.jpg)
26
Local alignment (cntd)
Characteristics of local alignments:
• The alignment can start/end at any point in the matrix.
• No negative scores in the alignment.
• The mean value of the scoring matrix (e.g. PAM, BLOSUM) should be negative, but there should be positive scores in the scoring matrix.
![Page 27: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/27.jpg)
27
Scoring the gaps more accurately
• A naive model
Gap penalty is linear to the gap length
Nature “prefers” to place gaps where other gaps exist
• Convex gap penalty function
γ(n+1) - γ(n) ≤ γ(n) - γ(n-1)
Time O(N2M) Space O(NM)
(assume N>M)
γ(n)
γ(n)
![Page 28: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/28.jpg)
28
Scoring gaps: affine gaps
• Affine gaps: a compromise between linear and convex gap penalties
γ(n) = -d - e * (n-1)
d: gap initiation penalty
e: gap extension penaltyd
e
γ(n)
![Page 29: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/29.jpg)
29
![Page 30: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/30.jpg)
30
![Page 31: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/31.jpg)
31
![Page 32: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/32.jpg)
32
![Page 33: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/33.jpg)
33
![Page 34: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/34.jpg)
34
![Page 35: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/35.jpg)
35
![Page 36: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/36.jpg)
36
![Page 37: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/37.jpg)
37
![Page 38: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/38.jpg)
38
Database searches
![Page 39: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/39.jpg)
39
DNA and protein databases
• EMBL/GenBank/DDBJ database of nucleic acids
![Page 40: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/40.jpg)
40
DNA and protein databases
• EMBL/GenBank/DDBJ database of nucleic acids (cntd)
![Page 41: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/41.jpg)
41
DNA and protein databases
• SWISS-PROT & TrEMBL database of proteins
![Page 42: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/42.jpg)
42
DNA and protein databases
• SWISS-PROT & TrEMBL database of proteins
![Page 43: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/43.jpg)
43
Database searches
• Database searching consists of many pairwise alignments combined in one search.
• It helps determining the function and the evolutionary relationships
• Heuristic algorithms are used instead of DP. Why?
• Size of SWISS-PROT + TrEMBL (Rel. 9.5):
3.9M entries or 1,276M residues.
• Exact algorithms are O(NM) fast.
• Heuristic methods can look at a small fraction of the searching space that will include all (or most) of the high scoring pairs.
![Page 44: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/44.jpg)
44
BLAST algorithm
• Basic Local Alignment Search Tool - The method:
• For each “word” (of fixed-length) in the query sequence, make a list of all neighbouring “words” that score above some threshold.
• Scan the database for these words.
• Perform (ungapped) “hit extension” until score < threshold.
• Stop at maximum scoring extension.
![Page 45: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/45.jpg)
45
BLAST algorithm (cntd)
• An example:
CPICHRAFHRLEHQTRHMRIHTGEKPHACQuery:
HMR
HHR
HIR
HAR
…
18
-2+13
+1+13
-1+13
…
BLOSUM62
HMR
selection
HMR
HIR
.
.
.
![Page 46: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/46.jpg)
46
BLAST algorithm (cntd)
• An example:
CPICHRAFHRLEHQTRHMRIHTGEKPHAC
H+R
Query:
CPLCDKAFHRLEHQTRHIRTHTGEKPHACSbjct:
![Page 47: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/47.jpg)
47
BLAST algorithm (cntd)
• An example:
CPICHRAFHRLEHQTRHMRIHTGEKPHAC
H+R
Query:
CPLCDKAFHRLEHQTRHIRTHTGEKPHACSbjct:
CP+C +AFHRLEHQTR HTGEKPHAC
![Page 48: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/48.jpg)
48
BLAST algorithm (cntd)
• The idea: a high scoring match alignment is very likely to contain a short stretch of very high scoring matches.
• Word length: 3 (proteins) and 11 (DNA).
• HSSP: multiple HSSPs can be reported for each database entry.
• Gapped alignments: more recent BLAST versions perform gapped alignments.
![Page 49: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/49.jpg)
49
BLAST flavours
Query:
Database:
DNA Protein
DNA Protein
BL
AS
TN
BL
AS
TP
BLASTX
TBLASTN
TBLASTX: DNA Query to DNA Database via translation
![Page 50: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/50.jpg)
50
FASTA algorithm
• The method:
• For each pair of sequences (query, subject), identify all identical “word” matches of (fixed) length.
• Look for diagonals with many mutually supporting “word” matches.
• The best diagonals are used to extend the word matches to find the maximal scoring (ungapped) regions.
• Join ungapped regions, using gap costs.
• Align the two (sub)regions using full dynamic programming techniques.
![Page 51: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/51.jpg)
51
FASTA algorithm (cntd)
![Page 52: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/52.jpg)
52
FASTA algorithm (cntd)
• The idea: a high scoring match alignment is very likely to contain a short stretch of identities.
• Word length: 2 (proteins) and 4-6 (DNA).
• HSSP: usually one (extended) gapped alignment is presented.
![Page 53: Sequence Alignment Algorithms02710/Lectures/SeqAlign2015.pdf · 50 FASTA algorithm • The method: • For each pair of sequences (query, subject), identify all identical word matches](https://reader035.vdocument.in/reader035/viewer/2022062505/5eddd1adad6a402d6669057d/html5/thumbnails/53.jpg)
53
FASTA flavours
Query:
Database:
DNA Protein
DNA Protein
FA
ST
A3
FA
ST
A3
FASTX3
TFASTA3