local and global alignment, database searching with blast
Post on 11-Jan-2016
47 Views
Preview:
DESCRIPTION
TRANSCRIPT
Sequence Database Searching
Local and Global Alignment, Database Searching With Blast
Original Presentations:
Hugues Sicotte
National Center for Biotechnology Information
sicotte@ncbi.nlm.nih.gov
Adaptation:
Alan Durham
University of São Paulo
alan@ime.usp.br
Sequence Database Searching
Alignment definition and Type:
G-ATES
GRATED
Local Alignments:
Global Alignment:
Alignment:
All bases aligned with another base or with a gap (symbol of “-” or sometimes “.”).
Each Base is used at most once.
Do not need to align all the bases in all sequences.
Align BILLGATESLIKESCHEESE and GRATEDCHEESE
G-ATESLIKESCHEESE or G-ATES & CHEESE
GRATED-----CHEESE GRATED & CHEESE
Sequence Database Searching
C O M P A R A T I V E A N A L Y S I S
GATTATACCAGATTA---CA
Insertions and deletions (‘indels’) are represented by gaps in alignments
gap of length 3
Sequence Database Searching
Score and Statistics
G-ATESLIKESCHEESE AND/OR G-ATES & CHEESE
GRATED-----CHEESE GRATED & CHEESE
Percent Identity. Can be misleading.
Score: A simple quality measure is the “score”. The score assigns points for each aligned base (or gap) of the alignment.
identical bases : “match” score
mismatching bases: “mismatch” score
gaps(optional): “gap opening” penalty for starting a gap
“gap extension” penalty for each gap symbol.
Score = 10*(+1)+1*(-1)+(-5-1)+(-5+5*(-1))
= -7
Example: match = +1 , mismatch =-1,
gap opening = -5, gap extension=-1
Sequence Database Searching
S C O R I N G S Y S T E M S
Which alignment is “better”?
GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC
GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC
0 mismatches, 5 gaps
3 mismatches, 1 gap
Sequence Database Searching
S C O R I N G S Y S T E M S
High penalty for “opening” a gap
(e.g. G = 5)
GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC
GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC
Penalty = 5G + 6L = 31
Penalty = 1G + 6L = 11
Lower penalty for “entending” a gap
(e.g. L = 1)
Sequence Database Searching
L O C A L S I M I L A R I T Y
Figure 7.3
F12 F2 E F1 E K Catalytic
PLAT F1 E K CatalyticK
Mix-and-match protein modules confound alignment algorithms
Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)
F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy
Sequence Database Searching
L O C A L S I M I L A R I T Y
Figure 7.3
F12 F2 E F1 E K Catalytic
PLAT F1 E K CatalyticK
Mix-and-match protein modules confound alignment algorithms
Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)
F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy
modules inreverse order
Sequence Database Searching
L O C A L S I M I L A R I T Y
Figure 7.3
F12 F2 E F1 E K Catalytic
PLAT F1 E K CatalyticK
Mix-and-match protein modules confound alignment algorithms
Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)
F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy
repeatedmodules
Sequence Database Searching
D O T P L O T S
Figure 7.4
Dot-plot Fitch : Biochem. Genet. (1969)3,99-108
A
C
G
T
C G T A C C G T
0 0 0 1 0 0 0 0
1
0
0
0 0 0 1 1 0 0
1 0 0 0 0 1 0
0 1 0 0 0 0 1
Horizontal axis is coordinates for one sequence
Vertical axis is coordinates for the other
Sequence Database Searching
D O T P L O T S
Figure 7.4b
Dot-plot Fitch : Biochem. Genet. (1969)3,99-108
Can also score not 1 position at a time, but in sliding window. For example a window of 3 nucleotides where we score 1 for identical triplets and 0 for all other combinations yields.
A
C
G
T
C G T A C C G T
0 0 0 0 0 0
1 0 0 0 0 1
Horizontal axis is coordinates for one sequence
Vertical axis is coordinates for the other
Sequence Database Searching
D O T P L O T S
Tis
sue
Pla
smin
ogen
Act
ivat
or (
PLA
T)
Coagulation Factor XII (F12)
Figure 7.4
Horizontal axis is coordinates for one sequence
Vertical axis is coordinates for the other
Sequence Database Searching
D O T P L O T S
Tis
sue
Pla
smin
ogen
Act
ivat
or (
PLA
T)
Coagulation Factor XII (F12)
Figure 7.4
K
K
Catalytic
Cat
aly
ticK
EF1EF2
EF
1
Plot dots for high similarity within a short window
Adjacent dots merge to form diagonal segments
Sequence Database Searching
D O T P L O T S
Tis
sue
Pla
smin
ogen
Act
ivat
or (
PLA
T)
Coagulation Factor XII (F12)
Figure 7.4
K
K
Catalytic
Cat
aly
ticK
EF1EF2
EF
1
Repeated domains show a characteristic pattern
Sequence Database Searching
P A T H G R A P H S
Figure 7.5
90 137
72
23
90 137
72
23
PLAU 90 EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLTGNHCQKEK---CFE 137PLAT 23 ELHQVPSNCD----CLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE 72
EGF similarity domains of urokinse plasminogen activator (PLAU) and tissue plasminogen activator (PLAT)
Dot plots suggest paths through the alignment space
Path graphs are more explicit representations
Each path is a unique alignment
Sequence Database Searching
G A T A C T AG A T T A C C A
Construct an optimal of these two sequences:
Using these scoring rules: Match:
Mismatch:Gap:
+1-1-1
D Y N A M I C P R O G R A M M I N G
Dynamic Programming Example
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Arrange the sequence residues along a two-dimensional lattice
Vertices of the lattice fall between letters
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The goal is to find the optimal path
from here
to here
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Each path corresponds to a unique alignment
Which one is optimal?
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores
A aligned with AMatch = +1
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores A aligned with T
Mismatch = -1
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
The score for a path is the sum of its incremental edges scores
T aligned with NULL
Gap = -1
NULL aligned with T
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0 -1
+1-1
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
+1-1
-2
-2
-1
Remember the best sub-path leading to each point on the lattice
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
-1
-2
Remember the best sub-path leading to each point on the lattice
0 +2
+1
-1
-20
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0 -2
Remember the best sub-path leading to each point on the lattice
0 +2
+1
-1
-20
-2
-1
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
Remember the best sub-path leading to each point on the lattice
+1
-1
-2-1
-3-2
-3
-2
+3
-1
-1
0
0
+1
+1
+2
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
0
Remember the best sub-path leading to each point on the lattice
+1
-1
-1
-2
-2 0
0
+1+2
-5-4
-5
-4
-3
-3
-1 -3-2
-10
+1
+2
0
+1-1
+2
-3 -1
-2
+1 +3
+2 +1
+2+3
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Incrementally extend the path
Remember the best sub-path leading to each point on the lattice
0
+1
-1
-1
-2
-2 0
0
+1+2
-4
-4
-3
-3
-1 -2
0
+2
0
+1-1
+2-2 +2 +1
+2+3
-8
-7
-6
-5
-7-6-5
-5-3
-2 -3
-4
-1
-1
0+1
+1
+1 +3
+2
-4
-6
-3
-2
-3
-1
-4
-5
+1 +3
+1
0 +2
+4
+4
+3
+2
+2
+3
-2 0
-1
+2 +2
+3
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Trace-back to get optimal path and alignment
0
+1
-1
-1
-2
-2 0
0
+1+2
-4
-4
-3
-3
-1 -2
0
+2
0
+1-1
+2-2 +2 +1
+2+3
-8
-7
-6
-5
-7-6-5
-5-3
-2 -3
-4
-1
-1
0+1
+1
+1 +3
+2
-4
-6
-3
-2
-3
-1
-4
-5
+1 +3
+1
0 +2
+4
+4
+3
+2
+2
+3
-2 0
-1
+2 +2
+3
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G
G A T A C T AGATTACCA
Print out the alignment
AA-TTTAACCTCAA
GG
Sequence Database Searching
Two different types of Alignment
Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial sub-path that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences.
Global Alignment methods:
Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph.
FASTP(Lipman &Pearson(1985),Science 227,1435-1441
BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments.
Local Alignment methods:
Sequence Database Searching
G L O B A L & L O C A L S I M I L A R I T Y
Implementations of dynamic programming for global and local similarities
Optimal global alignment
Needleman & Wunsch (1970)
Sequences align essentially from end to end
Optimal local alignment
Smith & Waterman (1981)
Sequences align only in small, isolated regions
Sequence Database Searching
Local and Global Alignments: computing the matrix
global alignment: as shown in previous slides
local allignment: change computation so that never put negative values
– when value of cell will be negative, set to zero (means staring another path)
– best local alignment comes from entry in matrix with maximum value
semi-global alignment:
– good in assembling
– ignore gaps at the end and at the beginning of sequences
– to ignore gaps at the beginning of alignment: zeroes in first column and first row
– to ignore gaps at the end of alignment: pick maximum value of last row and last column
Sequence Database Searching
D Y N A M I C P R O G R A M M I N G: semi-global alignment
G A T A C TGATTACCA
(we eliminated the last symbol from one of the sequences)
•choose the best score from scores in last row and collumn
•fill first row and first column with zeroes
•in this problem, 2 solutions
0
+1
0
0
0
0 0
0
+1+2
0
0
0
0
-1 -1
0
+2
0
+1-1
+2-1 +2 +1
+2+3
0
0
0
0
00
-1
-2
-1
-1
0+1
+1
+2
-1
-1
-1
-2
-2
-1
-2
-1
+1 +3
+1
0 +2 +4
+3
+3
-2 0
-1
+2 +2
+3
Sequence Database Searching
Match/mismatch scores and Statistics
•Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other.
•Scoring system that doesn’t penalize very much mutations to similar amino acid.
•PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated.
•BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.
Sequence Database Searching
Sequence Database Searching
Alignment methods
Query sequence
Sub
ject
seq
uenc
e
Sequence Alignment representation using a dot plot.
For a query of N letters against a subject sequence of M letters, it requires MxN comparisons.
Sequence Database Searching
H A S H I N G M E T H O D S
Hashing is a common method for accelerating database searches
MLILII
MLIIKRDELVISWASHEREquery sequence
IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE
all overlappingwords of size 3
Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation.
Sequence Database Searching
Index lookup
Each word is assigned a unique integer.
E.g. for a word of 3 letters made up of an alphabet of 20 letters.
1. Assign a code to each letter Code(l) (0 to 19)
2. For a word of 3 letters L1 L2 L3 the code is
index = Code(L1)*202 + Code(L2)*201 + Code(L3)
3. Have an array with a list of the positions that have that word.
1
0 1 2 3
Position in query sequence of word
Sequence Database Searching
H A S H I N G M E T H O D S
Building the dictionary for the query sequence requires (N-2) operations.
MLILII
MLIIKRDELVISWASHEREquery sequence
IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE
all overlappingwords of size 3
The database contains (M-2) words, and it takes only one operation to see if the word was in the query.
Sequence Database Searching
H A S H I N G M E T H O D S
Query sequence
Sub
ject
seq
uenc
e
Scan the subject, looking up words in the dictionary
Use word hits to determine were to search for alignments
fills the dynamic programming matrix
in (N-2)+(M-2) operations instead
of MxN.
Sequence Database Searching
Blast: extending good hits
blast pre-processes the target sequence set
lists of hits for each possible word (
– 3-tuple for proteins - 203 = 8000 different words
– for each word, find with ones have “good match”
• 13 in old version 11 in new version
for the “good ones” get list of sequences in database that have it
Blast (old) : extend match both ways while score is increasing
Blast2 (new):
– when two words found in same “diagonal” withing “short” distance, extend an un-gapped alignment.
– continue extension like old blast
get local alignments with score greater than cutoff score
perform SW on best candidates
Sequence Database Searching
H A S H I N G M E T H O D S
Query sequence
Dat
abas
e se
quen
ce
Scan the database, looking up words in the dictionary
Use word hits to determine were to search for alignments
BLAST extends from word hits
Sequence Database Searching
H A S H I N G M E T H O D S
Query sequence
Dat
abas
e se
quen
ce
Scan the database, looking up words in the dictionary
Use word hits to determine were to search for alignments
BLAST2 extends pairs in same diagonal first
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
Simplest Database searching could is a large dynamic programming example.
With all the database sequences concatenated one after another.
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
Which alignment is more significant?
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
Score can be used to judge alignments. But a score absolute value is a function of the score parameters.
Match=+1,Mismatch=-1,
Gap_open=5,
gap_extend=1
Yields same alignments as
Match=+10,Mismatch=-10,
Gap_open=50,
gap_extend=10
Scores useful for relative ranking.
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
To Judge relevancy of an alignment, need to judge if match is significant.
E-value = Expect(S) is a function of the score, database size and composition, and query size.
Number of Aligments with scores >= S expected if the query was a random given the database size and composition.
Expect of 0.0 means a very good match unlikely to be random.
Sequence Database Searching
Alligning sequences in databases: evaluating significance
we can allign a sequence with any other one
we want good allignents that are statistically significant
when searching databases, statistical relevance needs to be computed too
E value: number of hits a random sequence of the same size would get in a database of the same size
Sequence Database Searching
E-value
“Hits” can be sorted according to their E-value or their score.
The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length.
E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters.
e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance.
E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that.
Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)
Sequence Database Searching
D A T A B A S E S E A R C H I N G
The “hit list” gives titles and scores for matched sequences
> fasta myquery swissprot -ktup 2The best scores are: initn init1 opt z-sc E(77110)gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3
Sequence Database Searching
D A T A B A S E S E A R C H I N G
Detailed alignments are shown farther down in the output
> fasta myquery swissprot -ktup 2
>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21Smith-Waterman score: 395; 52.3% identity in 109 aa overlap
10 20 30 40 50gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60
60 70 80 90 100 110gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120
120 130 140gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ ..gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180
>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16Smith-Waterman score: 316; 37.4% identity in 131 aa overlap
10 20 30 40gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X :
Sequence Database Searching
Database Search Space
Query sequence
Con
cana
ted
Dat
abas
e se
quen
ce
Some matches are non-meaningful because they occur VERY often in
database.
e.g. nucleotide AAA (from polyA)
Biological repeated elements(retroposons ALU)
Low-complexity repeated patterns.
(CAGCAG, QQQ,KKK,…)
These elements should be
FILTERED or MASKED
to avoid generating false ‘hits’.. It is ‘OK’ to align through them if they are near meaningful diagonal ‘hits’
Sequence Database Searching
H A S H I N G M E T H O D S
Query sequence
Sub
ject
seq
uenc
e
Scan the database, looking up words in the dictionary
Use word hits to determine were to search for alignments
FASTA searches in a band
top related