database similarity search. 2 sequences that are similar probably have the same function why do we...
TRANSCRIPT
Database Similarity Search
2
Sequences that are similar probably have the same function
Why do we care to align sequences?
new sequence
?
Sequence Database
≈ Similar function
Discover Function of a new sequence
4
Discover Function of a new sequence
Searching Databases for similar sequences
Naïve solution: Use exact algorithm to compare each sequence in the database to query.
Is this reasonable ??
How much time will it take to calculate?
Complexity for genomes
• Human genome contains 3 109 base pairs– Searching an mRNA against HG requires ~1012
cells
-Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.
So what can we do?
Searching databases
Solution:Use a heuristic (approximate) algorithm
Heuristic strategy
Reduce the search space
Remove regions that are not useful for meaningful alignments
Perform efficient search strategies
Preprocess database into new data structure to enable fast accession
Heuristic strategy
• Reduce the search space
Remove regions that are not useful for meaningful alignments
• Preprocess database into new data structure to enable fast accession
• AAAAAAAAAAA
• ATATATATATATA
• Transposable elements
What sequences to remove?
53% of the genomeis repetitive DNALow complexity sequences(JUNK???)
Low Complexity Sequences
What's wrong with them?* Not informative* Produce artificial high scoring alignments.
So what do we do?We apply Low Complexity masking to the database and the query sequence
MaskTCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA
Heuristic strategy
• Remove low-complexity regions that are not useful for meaningful alignments
• Perform efficient search strategies
Preprocess database into new data structure to enable fast accession
BLAST Basic Local Alignment Search Tool
• General idea - a good alignment contains subsequences of high identity (local alignment):
ACGCCCGGGAGCGC
CTGGGCGTATAGCCC
–First, identify (most efficiently) short almost exact matches .–Next, extended to longer regions of similarity.–Finally, optimize the alignment using an exact algorithm.
Altschul et al 1990
DNA/RNA vs protein alphabet
DNA(4)
A T G C
RNA(4)
A U G C
Protein (20)
ACDEFGHIKLMNPQRSTVWY
A T=A G…. A T=A G…. A G>>A W….
WHY is it different?
The 20 Amino Acids
The 20 Amino Acids
A
W
G
Scoring system for amino acids mismatches
BLAST Basic Local Alignment Search Tool
• General idea - a good alignment contains subsequences of high identity (local alignment):
ACGCCCGGGAGCGC
CTGGGCGTATAGCCC
–First, identify (most efficiently) short almost exact matches .–Next, extended to longer regions of similarity.–Finally, optimize the alignment using an exact algorithm.
Altschul et al 1990
BLAST(Protein Sequence Example)
First, identify (most efficiently) short almost exact matches between the query sequence and the database.
Query sequence …FSGTWYA…
Words of length 3: FSG, SGT, GTW, TWY, WYA
BLAST
FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA
FTG.. SVT. GSW. TWF.. WYS….
Preprocessing of the database
Seq 1 FSGTWYA FSG, SGT, GTW, TWY, WAYSeq 2 FDRTSYV FDR, DRT, RTS, TSY, SYVSeq 3 SWRTYVA SWR, WRT,RTY, TYV, YVA…….
Seq 3546
Seq 102
Seq 1 BAG OF WORDS
BLAST
Query sequence …FSGTWYA…Words of length 3: FSG, SGT, GTW, TWY, WYA…
DATABASE
FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS….
SEQ N INVIEIAFDGTWTCATTNAMHEWASNINETEEN
BLAST Basic Local Alignment Search Tool
• General idea - a good alignment contains subsequences of high identity (local alignment):
ACGCCCGGGAGCGC
CTGGGCGTATAGCCC
–First, identify (most efficiently) short almost exact matches .–Next, extended to longer regions of similarity.–Finally, optimize the alignment an exact algorithm.
Altschul et al 1990
BLAST2.Extend word pairs as much as possible,
i.e., as long as the total score increases
High-scoring Segment Pairs (HSPs)
Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD
D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN
Q= query sequence, D= sequence in database
3. Finally, optimize the alignment using an exact algorithm.
Running BLAST to predict a function of a new protein
>Arrestin protein (C. elegance)MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKGIGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQFGSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPFGCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKKLAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTALPGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR
How to interpret a BLAST score:
•The score is a measure of the similarity of the query to the sequence shown.
How do we know if the score is significant?
-Statistical significance
-Biological significance
The expectation value E-value is the number of alignmentswith scores greater than or equal to score Sthat are expected to occur by chance in a database search.
page 105
How to interpret a BLAST search:
For each blast score we can calculate an expectation value (E-value)
BLAST- E value:
Increases linearly with
length of query sequence
Increases linearly with
length of database
Decreases exponentially with score of
alignment
–K ,λ: statistical parameters dependent upon scoring system and background residue frequencies
m = length of query ; n= length of database ; s= score
What is a Good E-value (Thumb rule)
• E values of less than 0.00001 show that sequences are almost always related.
• Greater E values, can represent functional relationships as well.
• Sometimes a real (biological) match has an E value > 1• Sometimes a similar E value occurs for a short exact
match and long less exact match
How to interpret a BLAST search:
•The score is a measure of the similarity of the query to the sequence shown.
How do we know if the score is significant?
-Statistical significance
-Biological significance
Treating Gaps in BLAST
>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Human mRNACATGCGACTGACATCGATCATA
Sometimes correction to the model are needed to infer biological significance
Gap Scores
• Standard solution: affine gap model
wx = g + r(x-1) wx : total gap penalty; g: gap open penalty;
r: gap extend penalty ;x: gap length
– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm
Gapped BLAST
4. Connect several HSPs by aligning the sequences in between them:
THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD
INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN
The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment
BLAST BLAST is a family of programs
Query: DNA Protein
Database: DNA Protein