ncbi molecular biology resourceshpc.ilri.cgiar.org/beca/training/imbb/lectures/ncbi_2013... ·...
TRANSCRIPT
-
NCBI Molecular Biology Resources
May 16, 2012
-
BLAST
Sequence VAST
Structure Entrez
Text
Searching the NCBI Databases
-
http://www.ncbi.nlm.nih.gov/blast
-
Why do we need sequence similarity searching?
• To identify and annotate sequences with… − incomplete (or no) annotations (GenBank) − incorrect annotations
• To assemble genomes • To explore evolutionary relationships by… − finding homologous molecules − developing phylogenetic trees
NOTE: Similar sequences may NOT have similar function! NOTE: Homologous molecules may have similar functions.
Searching with Sequences
-
Why Search Databases?
• To find out if a new DNA sequence is already fully or partially present in the databanks.
• To find homologous proteins to a putative coding ORF that might share similar 3D structure.
• to identify homology (“relatedness”) between a query and entries in a database
-
3000 Myr
1000 Myr
540 Myr
Common ancestry allows us to infer similar function
Alzheimer’s Disease
Ataxia telangiectasia
Colon cancer
Pancreatic carcinoma
Yeast Bacteria Worm Fly Human
Molecular Evolution
MLH1 MutL
-
Some Terminology
-
Searching Sequence Databases • Two sequences are homologous when
they share a common ancestry. This ancestry is reflected in strong sequence similarity.
• Computationally, threshold limits for sequence similarity can be defined by : – length of the stretch of similar sequence – percentage of identity between the
sequence – statistical measurements, like E-value, P-
value, Bit-score, etc.
-
Similarity and Homology
• Similarity can be expressed as a percentage. It does not imply any reasons for the observed sameness.
• Homology is an evolutionary term used to describe relationship via descent from a common ancestor.
• Homologous things are often similar, but not always (whale flipper human arm)
• Homology is NEVER expressed as a percentage
-
Orthologs vs Paralogs • Homologs can be separated into
two classes: orthologs and paralogs.
• Orthologs are homologous genes that perform the same function in different species.
• Paralogs are homologous genes within a species that may perform different functions.
-
Similarity and Homology • Sequence homology can be reliably inferred
from statistically significant similarity over a majority of the sequence length.
• Non-homology CANNOT be inferred from non-similarity because non-similar things can still share a common ancestor.
• Homologous proteins share common structures, but not necessarily common sequence or function (e.g. FtsZ tubulin)
• Remember: pair of sequences either is or isn't homologous. There is no such thing as “64% homologous"
-
Searching sequence databases
• When we search a sequence database, we are usually looking for related sequences.
• Unfortunately, the algorithms that we have for searching databases, do not search for homology, they search for similarity.
• When similarity is found, we must determine if this similarity is a result of homology or it if comes from another source.
-
Substitutions, Insertions, Deletions
• Mutation: one of – switch from one nucleotide to another – insertion – deletion
• Substitution: a switch in nucleotides which spreads throughout most of a species.
• Substitutions, insertions and deletions passed along two independent lines of descent cause a divergence of the two sequences from the original (and from each other):
ccctaggtccca!
cgggtatccaa!cggtatgcca!
-
Example
• For the previous example cggtatgcca→ cgggtatccaa , ccctaggtccca, the two
descendent sequences align as follows c g g g t a - - t - c c a a c c c - t a g g t c c c - a • “-” (indel) represents an insertion or
deletion.
-
Pairwise Sequence Alignments
• Purpose: • identification of sequences with significant similarity to (a)
sequence(s) in a sequence-repository • identification of all homologous sequences the repository • identification of domains with sequence similarity
• Terminology • Global alignment • Local alignment
-
Terminology: Global Alignment
• Finds the optimal alignment over the
entire length of the two compared sequences
• Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA
• Suitable for sequences of homologous molecules
-
Terminology: Local Alignment
• short regions of similarity between a pair of sequences.
• compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length
• useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons
-
Algorithms
• Needleman-Wunsch – Exhaustive global alignment – most rigorous method when aligning conserved
sequences of similar length (no exon shuffling, insertion/deletion etc)
• Smith-Waterman – Exhaustive local alignment – alignment does not have to extend along the full
length of the sequences – In contrast to N-W alignments initiating at all
possible positions of the sequence-space will be considered
– Can be very slow
-
Basic Local Alignment Search Tool
• Calculates similarity for biological sequences • Finds best local alignments • A Heuristic approach based on the
Smith-Waterman algorithm − Searches for matching “words” rather than individual
residues − Uses statistical theory to determine if a match might
have occurred by chance
-
Local vs. Global Alignment
Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492
human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60
440 450 human REQLEHI--------KTHELHL . .:: . : ... worm QWKLEDLFNLDSSEYKEASINF 500
Align program (Lipman and Pearson) -a global alignment protocol-
BLASTp: protein-protein comparison -a local alignment protocol-
-
Nucleotide Words GTACTGGACATGGACCCTACAGGAA Query:
Word Size = 11 GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT ...........
Make a lookup table of words
Minimum word size = 7 blastn default = 11 megablast default = 28
-
An alignment that BLAST can’t find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Here there are no words longer than 6…
...for nucleotides there must be an exact match of at least 7.
-
Protein Words GTQITVEDLFYNIATRRKALKN Query:
Word Size = 3
Neighborhood Words LTV, MTV, ISV, LSV, etc.
GTQ TQI QIT ITV TVE VED EDL DLF ...
Make a lookup table of words
Word Size can be 2 or 3 (default = 3)
-
Minimum Requirements for a Hit
• Nucleotide BLAST requires one exact match • Protein BLAST requires two neighboring matches within 40 residues
GTQITVEDLFYNI SEI YYN
ATCGCCATGCTTAATTGGGCTT CATGCTTAATT
neighborhood words
exact word match one match
two matches
-
Some Flavors of BLAST ucleotide rotein N
NN
N
N
N
P
P
blastx
tblastn
tblastx
P P P P P P
P P P P P P P P P P P P
P P P P P P Query Database Program
blastp
blastn
P P N N
PP
-
BLAST Selection Matrix
-
Nucleotide vs. Protein BLAST
aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc H.sapiens: N R V T V V L G A Q W G D E G + + V + V L G Q W G D E G A.thaliana: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt
Comparing ADSS from H. sapiens and A. thaliana
BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words
Protein searches are generally more sensitive than nucleotide searches.
-
Choosing The Right BLAST Flavor for Proteins
What you Want to Do? The Right BLAST Flavor Find out something about the function of the protein
Use blastp to compare your protein with other proteins contained in the databases.
Discover new genes encoding similar proteins
Use tblastn to compare your protein with DNA sequences translated into their 6 possible reading frames
Claverie & Notredame 2003
-
Choosing the Right BLAST Flavor for DNA
Questions Answer Am I interested in non coding DNA?
Yes, Use blastn. Rem: blastn is only for closely related DNA sequences (more than 70% identical)
Do I want to discover new proteins?
Yes, Use tblastx
Do I want to discover proteins encoded in my query DNA sequences?
Yes, Use blastx
Am I unsure of the quality of my DNA?
Yes, Use blastx. Especially if you suspsect your DNA sequence codes for a protein, but may contain sequencing errors.
Claverie & Notredame 2003
-
Choosing The Right BLAST Flavor for DNA Sequences
Usage Query Database Program Find very similar DNA sequence
DNA DNA blastn
Protein discovery and ESTs
Translated DNA
Translated DNA
tblastx
Analysis of query DNA sequence
Translated DNA
Protein blastx
Claverie & Notredame 2003
-
Some WWW-BLAST Databases
• nr (nt) – Traditional gb divisions – NM_ and XM_ RefSeqs
• est • htgs • gss • wgs • chromosome
– NC_ RefSeqs
• env_nt • month
• nr (non-redundant sequences) – GenBank CDS translations – NP_ RefSeqs – PIR, Swiss-Prot, PRF – PDB (sequences from structures)
• swissprot
• pat - patents
• pdb - sequences with 3D structures
• env_nr - environmental samples
• month - sequences updated within the past 30 days
Nucleotide Protein
-
Scoring Systems - Nucleotides
A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA
-
Scoring Systems - Proteins Position Independent Matrices
PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved
blocks • Each matrix derived separately from blocks with a defined
percent identity cutoff • BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST
-
PAM250-Matrix A R N D C Q E G H I L K M F P S T W Y V
• A 2 • R -2 6 • N 0 0 2 • D 0 -1 2 4 • C -2 -4 -4 -5 12 • Q 0 1 1 2 -5 4 • E 0 -1 1 3 -5 2 4 • G 1 -3 0 1 -3 -1 0 5 • H -1 2 2 1 -3 3 1 -2 6 • I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 • L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 • K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 • M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 • F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 • P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 • S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 • T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 • W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 • Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 • V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.
-
BLOSUM62 Substitution Matrix
Common amino acids have low weights
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
-
Lower BLOSUM series means more divergence
Higher PAM series means more divergence
better for finding local alignments
better for finding global alignments and remote homologs
based on groups of related sequences counted as one
based on minimum replacement or maximum parsimony
Built from vast amout of data
Built from small amout of data
Built from local alignments
Built from global alignments
BLOSUM PAM
Matrix differences
-
Matrices - Rules of thumb
Need different levels of sensitivity ? – Close relationships (Low PAM number
(PAM 1) or high Blosum number, eg. 80) – Distant relationships (High PAM (e.g. PAM
250), low Blosum (BLOSUM 45)
-
Local Alignment Statistics High scores of local alignments between two random
Sequences follow the Extreme Value Distribution.
Score
Alig
nmen
ts
(applies to ungapped alignments)
E = Kmne-λS E = mn2-S’
K = scale for search space λ = scale for scoring system S’ = bitscore = (λS - lnK)/ln2
Expect Value E = number of database hits you expect to find by chance
size of database
your score
expected number of random hits
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
-
Protein BLAST page
>sorting nexin 18 [Homo sapiens] MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASVQVIRAPEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPAS
Choose your database
Accession, GI, or sequence
-
Advanced BLAST Options
Matrix Selection (protein) • PAM30 -- most stringent • BLOSUM45 -- least stringent
Example Entrez Queries green plants[Organism] srcdb_refseq[Properties] biomol_mrna[Properties] proteins_all[Filter] NOT mammalia[Organism]
Other Advanced -e 10000 expect value -v 2000 descriptions -b 2000 alignments
Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism] Expect Value Cut-off
-
sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88
Filtered Unfiltered
Low Complexity Filtering
-
Homology Search: Sorting Nexin 18 >Notch homolog 3 [Homo sapiens] MGPGARGRRRRRRPMSPPPPPPPVRALPLLLLLAGPGAAAPPCLDGSPCANGGRCTQLPSREAACLCPPGWVGERCQLEDPCHSGPCAGRGVCQSSVVAGTARFSCRCPRGFRGPDCSLPDPCLSSPCAHGARCSVGPDGRFLCSCPPGYQGRSCRSDVDEC
>sorting nexin 18 [Homo sapiens] MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASVQVIRAPEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPAS
-
BLAST Formatting Page
Results of a Conserved Domain Search
-
>gi|20140146|sp|Q96RF0|SNXI_HUMAN Sorting nexin 18 Length = 628 Score = 1048 bits (2710), Expect = 0.0 Identities = 528/628 (84%), Positives = 528/628 (84%) Query: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR Sbjct: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 Query: 61 XXXXXXXXXXXXXXXXXXXNVPPGGFEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSTFQ 120 NVPPGGFE STFQ Sbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120 . . .
Low Complexity Filter
low complexity sequence
Query: 61 apepgpagdggpgaparyaNVPPGGFEplpvappasfkpppdafqallqpqqapppSTFQ 120 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ Sbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ 120
-
BLAST Formatting Page
-
BLAST Output Overview • Graphic Display: Shows you where your
query is similar to other sequences. • Hit List: Name of sequences similar to
your query ranked by similarity • Alignments: Every alignment between
your query and the reported hits • Parameters: List of the various
parameters used for the search
-
BLAST Output: Graphic Overview
SH3 PX
Results of the CD Search:
Red bar = most similar sequence Pink = almost as similar Green – even less similar Blue/Black – worse scores
mouse over, click for active links
-
BLAST Output: Descriptions
Links to Sequence Records
4 X 10-68
Expect Value Cut-off default = 10
L
G
U
S
LocusLink
UniGene
GEO
Structure
Bit scores < 50 unreliable
link to entrez
-
A Little on Interpretation • How similar must sequences be in order
to be considered homologous? • More than 25% of the amino acids
present are identical for proteins and more than 70% of the nucleotides present are identical for DNA. Above these limits, you can be sure that two proteins have same structure and same common ancestor.
• Rem: only > 100 aa or nt in length
-
A Little on Interpretation: E-value • Determine how much you can trust your
conclusion on homology. • E-value = Expectation Values • Allow for comparing pairwise alignment with
different similarities and different length. Advantage over Percent Identity (not discussed).
• Definition: Number of times your database match may have occurred by chance. Match unlikely to occur by chance is a good match. The lowest E-values (as close to 0 as possible) are the best. Thus, most significant, since we know we can trust them enough to infer homology
• If you want to be certain of homology your E-values must be below 10-4 or (0.0001).
-
Results From “nr” (protein)
non-redundant set
-
BLAST Output: Pairwise Alignments
>gi|12643956|sp|Q9Y5X1|SNX9_HUMAN Sorting nexin 9 (SH3 and PX domain-
containing protein 1) (SDP1 protein) Length = 595
Score = 255 bits (652), Expect = 4e-68
Identities = 140/322 (43%), Positives = 185/322 (56%), Gaps = 7/322 (2%)
Query: 221 SSATVSRNLNRFSTFVKSGGEAFVLGEASGFVKDGDKLCVVLGPYGPEWQENPYPFQCTI 280
SS+++ LN+F F K G E ++L A K +K+ +++G YGP W F C +
Sbjct: 197 SSSSMKIPLNKFPGFAKPGTEQYLL--AKQLAKPKEKIPIIVGDYGPMWVYPTSTFDCVV 254
Query: 281 DDPTKQTKFKGMKSYISYKLVPTHTQVPVHRRYKHFDWLYARLAEKF-PVISVPHLPEKQ 339
DP K +K G+KSYI Y+L PT+T V+ RYKHFDWLY RL KF I +P LP+KQ
Sbjct: 255 ADPRKGSKMYGLKSYIEYQLTPTNTNRSVNHRYKHFDWLYERLLVKFGSAIPIPSLPDKQ 314
Query: 340 ATGRFEEDFISKRRKGLIWWMNHMASHPVLAQCDVFQHFLTCPSSTDEKAWKQGKRKAEK 399
TGRFEE+FI R + L WM M HPV+++ +VFQ FL + DEK WK GKRKAE+
Sbjct: 315 VTGRFEEEFIKMRMERLQAWMTRMCRHPVISESEVFQQFL---NFRDEKEWKTGKRKAER 371
VH V+ VN
-
BLAST Output: Alignments >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756 Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDA Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct: 396 FLQPLSKPLSS 406
low complexity sequence filtered
-
Genomic BLAST
• The BLAST homepage links to the Genome BLAST pages provide customized nucleotide and protein databases for each genome.
• If a Map Viewer is available, the BLAST hits can be viewed on the maps.
-
Identify an your sequence with BLAST
BG743989
Human EST from a Natural Killer Cell Culture:
DEFINITION: 602722761F1 NIH_MGC_106 Homo sapiens cDNA clone IMAGE:4849239 5', mRNA sequence.
-
BLAST Tips • It is faster and more accurate to BLAST
proteins (blastp) rather than nucleotides. • If in doubt use blastp. • When possible restrict to the subset of
the database you are interested in. • Look around for the database you need or
create your own custom BLAST database. BUT HOW???
• When is the best time to use the BLAST server?