ncbi fieldguide a field guide part 2 august 30, 2005 university of colorado health sciences center
TRANSCRIPT
NC
BI
Fie
ldG
uid
eA Field Guide part 2
August 30, 2005 University of Colorado Health Sciences Center
NC
BI
Fie
ldG
uid
ePart 2
Entrez: text searching• a GenBank record• preview/index
BLAST: sequence searching• pre-computed searches• algorithms• what’s new?
VAST: structure searching
Example: mapping oligos to a genome
NC
BI
Fie
ldG
uid
eGenBank Records
Header
Feature Table
Sequence
The Flatfile Format
NC
BI
Fie
ldG
uid
eA Typical GenBank Record
LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNAACCESSION NM_019570VERSION NM_019570.3 GI:50811869 KEYWORDS .
= Title
NC
BI
Fie
ldG
uid
eGenBank Record: Feature Table
NC
BI
Fie
ldG
uid
e
GenPept identifier
GenBank Record: Feature Table, con’t.
NC
BI
Fie
ldG
uid
eGenBank Record: sequence
skip
NC
BI
Fie
ldG
uid
eIndexing for Nucleotide UID 59958365
Field Indexed Terms
[primary accession] NM_001012399[title] Bos taurus hemochromatosis (hfe), mRNA.[organism] Bos taurus[sequence length] 1168[modification date] 2005/02/19[properties] biomol mrna
gbdiv mamsrcdb refseq
[accn]
[orgn]
[mdat][prop]
NC
BI
Fie
ldG
uid
eGlobal Entrez Search: HFE
HFE
NC
BI
Fie
ldG
uid
e
Entrez Nucleotide: HFE 137 records
Not HFE [Title]
NC
BI
Fie
ldG
uid
eSmarter Query
hfe[title]
42 records
Curated HFE splice variants(11 total)
AND human[orgn]
NC
BI
Fie
ldG
uid
ehfe[title] AND human[orgn] (con’t)
Primary data
NC
BI
Fie
ldG
uid
ePreview/Index
NC
BI
Fie
ldG
uid
ePreview/Index
NC
BI
Fie
ldG
uid
ePreview/Index: Properties, srcdb
srcdbProperties
NC
BI
Fie
ldG
uid
ePreview/Index: Properties, srcdb
…AND srcdb refseq[Properties]…AND srcdb refseq[Properties]
NC
BI
Fie
ldG
uid
ePreview/Index: Properties, srcdb
…AND srcdb ddbj/embl/genbank[Properties]…AND srcdb ddbj/embl/genbank[Properties]
NC
BI
Fie
ldG
uid
e#1 hfe 137#2 hfe[title] AND human[orgn] 42
#3 #2 AND srcdb refseq[prop] 11#4 #2 AND srcdb ddbj/embl/genbank[prop] 31
Database Queries
#5 #4 AND gbdiv pri[prop] 29
#4 #4 AND gbdiv est[prop] 2
Primate division gbdiv pri[prop]EST division gbdiv est[prop]
NC
BI
Fie
ldG
uid
e
Molecule Queries
#1 hfe 116
#2 hfe[title] AND human[orgn] 42
#3 #2 AND biomol mrna[prop] 29
#4 #2 AND biomol genomic[prop] 13
Genomic DNA biomol genomic[prop]cDNA biomol mrna[prop]
NC
BI
Fie
ldG
uid
eMore Queries…
Fields are database-specific
Entrez Nucleotide
Reviewed RefSeqs with transcript variants:
srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]
Topoisomerase genes from Archaea:
topoisomerase[gene name] AND archaea[organism]
Entrez Gene
Genes on human chromosome 2 with OMIM links
2[chromosome] AND human[organism] AND “gene omim”[filter]
Membrane proteins linked to cancer:
“integral to plasma membrane”[gene ontology] AND cancer[dis]
NC
BI
Fie
ldG
uid
e
Other Entrez Databases
UniSTS: markers on the Genethon map of human chromosome 12
Genethon[Map Name] AND human[organism] AND 12[chromosome]
UniGene: rat clusters that have at least one mRNA
rat[organism] NOT 0[mrna count]
Structure: structures of bacterial kinases with resolutions below 2 Å
bacteria[organism] AND kinase AND 000.00:002.00[resolution]
SNP: uniquely mapped microsatellites on human chr2
microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn]
NC
BI
Fie
ldG
uid
e
Basic Local Alignment Search Tool
NC
BI
Fie
ldG
uid
eBLAST Web Searches, 2005
200,000
NC
BI
Fie
ldG
uid
e
Nucleotide or protein: Related
Sequences
BLAST link: BLink
Precomputed BLAST Services
Transcript clusters: UniGene
Protein homologs: HomoloGene
NC
BI
Fie
ldG
uid
eLink to Related Sequences
NC
BI
Fie
ldG
uid
eRelated Sequences
Most similar
Least similar
NC
BI
Fie
ldG
uid
e
BLink (BLAST Link)
NC
BI
Fie
ldG
uid
eBLink Output
Best hitsBest hits 3D structures3D structures CDD-SearchCDD-Search
NC
BI
Fie
ldG
uid
eGlobal vs Local Alignment
Seq 1
Seq 2
Seq 1
Seq 2
Global alignment
Local alignment
NC
BI
Fie
ldG
uid
e
Global vs Local Alignment
Seq1: WHEREISWALTERNOW (16aa)Seq2: HEWASHEREBUTNOWISHERE (21aa)
Global
Seq1: 1 W--HEREISWALTERNOW 16 W HERE
Seq2: 1 HEWASHEREBUTNOWISHERE 21
LocalSeq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERESeq2: 3 WASHERE 9 Seq2: 15 WISHERE 21
NC
BI
Fie
ldG
uid
eThe Flavors of BLAST
• Standard BLAST– nucleotide, protein and translations (blastn, blastp,
blastx, tblastn, tblastx)– traditional “contiguous” word hit
• Megablast– optimized for large batch searches– can use discontiguous words
• PSI-BLAST– constructs PSSMs automatically; uses as query– very sensitive protein search
• RPS BLAST– searches a database of PSSMs– tool for conserved domain searches
“contiguous”
discontiguous
NC
BI
Fie
ldG
uid
eFast- heuristic approach based on Smith Waterman
Local alignments
Statistical significance- Expect value
Versatile- blastn, blastp, blastx, tblastn, tblastx, rps-blast,
psi-blast- www, standalone, and network clients
Why Is BLAST So Popular?
NC
BI
Fie
ldG
uid
e
How BLAST Works
• Make lookup table of “words” for query
• Scan database for hits
• Ungapped extensions of hits (initial HSPs)
• Gapped extensions (no traceback)
• Gapped extensions (traceback; alignment
details)
• Make lookup table of “words” for query
• Scan database for hits
• Ungapped extensions of hits (initial HSPs)
• Gapped extensions (no traceback)
• Gapped extensions (traceback; alignment
details)
NC
BI
Fie
ldG
uid
eNucleotide Words
GTACTGGACATGGACCCTACAGGAAQuery:
GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT
Make a lookuptable of words
11-mer
. . .
NC
BI
Fie
ldG
uid
eProtein Words
GTQITVEDLFYNIATRRKALKNQuery:
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookuptable of words
Word size = 3 (default)
Word size can only be 2 or 3
[ -f 11 = blastp default ]
NC
BI
Fie
ldG
uid
e
Minimum Requirements for a Hit
• Nucleotide BLAST requires one exact match• Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
SEI YYN
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT
neighborhood words
one exact match
two matches
[ -A 40 = blastp default ]
NC
BI
Fie
ldG
uid
e
BLASTP Summary
YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47
Gapped extension with trace back
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + +Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337
Final HSP
+E YA YL K F+ L +SP+ +DVNVHP+K V +++ I
High-scoring pair (HSP)
HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …
YLS 15YLT 12 YVS 12YIT 10etc …
Neighborhood words
Neighborhood score threshold
T (-f) =11
Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…
example query words
NC
BI
Fie
ldG
uid
e
Scoring Systems - Nucleotides
A G C T
A +1 –3 –3 -3
G –3 +1 –3 -3
C –3 –3 +1 -3
T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| ||||| raw score = 19-9 = 10
CACGTAGCAAGCTTG-GTGTCA
[ -r 1 -q -3 ]
NC
BI
Fie
ldG
uid
eScoring Systems - Proteins
Position Independent MatricesPAM Matrices (Percent Accepted Mutation)
• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly
conserved blocks• Each matrix derived separately from blocks with a
defined percent identity cutoff• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST
NC
BI
Fie
ldG
uid
e
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
D
F
Negative for less likely substitutions
D
Y
FPositive for more likely substitutions
NC
BI
Fie
ldG
uid
e
Position-Specific Score Matrix
DAF-1
Serine/Threonine protein kinases catalytic loop
1 7 4PSSM scores 5 4
NC
BI
Fie
ldG
uid
e
A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3
Position-Specific Score Matrix
catalytic loop
NC
BI
Fie
ldG
uid
eLocal Alignment Statistics
High scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score (S)
Alig
nm
en
ts
(applies to ungapped alignments)
E = Kmne-S or E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect ValueE = number of database hits you expect to find by chance, ≥ S
your score
expected number of
random hits
More info: www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
NC
BI
Fie
ldG
uid
e
Gapped Alignments
Gapping provides more biologically realistic alignments
Gapped BLAST parameters are simulated for each scoring matrix
Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)
NC
BI
Fie
ldG
uid
e
An Alignment BLAST Cannot Make
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Reason:
no contiguous exact match of 7 bp.
NC
BI
Fie
ldG
uid
e
BLAST 2 Sequences (blastx) output:
An Alignment BLAST Can Make
Solution: compare protein sequences; BLASTXScore = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
NC
BI
Fie
ldG
uid
e
Other BLAST Algorithms
• Megablast
• Discontiguous Megablast
• PSI-BLAST
• PHI-BLAST
• Megablast
• Discontiguous Megablast
• PSI-BLAST
• PHI-BLAST
NC
BI
Fie
ldG
uid
e
Megablast: NCBI’s Genome Annotator
• Long alignments of similar DNA sequences
• Greedy algorithm
• Concatenation of query sequences
• Faster than blastn; less sensitive
• Long alignments of similar DNA sequences
• Greedy algorithm
• Concatenation of query sequences
• Faster than blastn; less sensitive
NC
BI
Fie
ldG
uid
e
MegaBLAST & Word Size
Trade-off: sensitivity vs speed
Too fast foryou?
NC
BI
Fie
ldG
uid
e
MegaBLAST & Word Size
Trade-off: sensitivity vs speed
23blastp
828megablast
711blastn
minimumdefaultWORD SIZE
NC
BI
Fie
ldG
uid
e
Discontiguous Megablast
• Uses discontiguous word matches
• Better for cross-species comparisons
• Uses discontiguous word matches
• Better for cross-species comparisons
NC
BI
Fie
ldG
uid
e
Templates for Discontiguous Words
W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111
Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5
W = word size; # matches in template
t = template length
NC
BI
Fie
ldG
uid
eDiscontiguous (Cross-species)
MegaBLAST
NC
BI
Fie
ldG
uid
eDiscontiguous Word
Options
NC
BI
Fie
ldG
uid
eMegaBLAST vs Discontiguous
MegaBLAST
NM_017460 Homo sapiens cytochrome P450, family 3, subfamily A, polypeptide 4 (CYP3A4),
transcript variant 1, mRNA (2768 letters)
vs Drosophila
NC
BI
Fie
ldG
uid
e
MegaBLAST vs Discontiguous MegaBLAST
MegaBLAST = “No significant similarity found.”
Discontiguous megaBLAST =
NC
BI
Fie
ldG
uid
e
Another Example . . .
Discontiguous megaBLAST = numerous hits . . .
Query: NM_078651
Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp)
/note= mushroom bodies tiny; synonyms: Pak2, STE20, dPAK2
MegaBLAST = “No significant similarity found.”
Database: nr (nt), Mammalia[orgn]
NC
BI
Fie
ldG
uid
eEx: Discontiguous MegaBLAST
NC
BI
Fie
ldG
uid
eEx: BLASTN
NC
BI
Fie
ldG
uid
e
PSI-BLAST
Example: Confirming relationships of purine
nucleotide metabolism proteins
Position-specific Iterated BLAST
NC
BI
Fie
ldG
uid
e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK
PSI-BLAST
0.005 E value cutoff for PSSM
NC
BI
Fie
ldG
uid
eRESULTS: Initial BLASTP
Same results as protein-protein BLAST; different format
NC
BI
Fie
ldG
uid
eResults of First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NC
BI
Fie
ldG
uid
eTenth PSSM Search: Convergence
Just below threshold, another nucleotide metabolism enzyme
Check to add to PSSM
NC
BI
Fie
ldG
uid
eReverse PSI-BLAST (RPS)-BLAST
NC
BI
Fie
ldG
uid
eAdenosine/AMP Deaminase Domain
AMP Deaminases
.
.
.
NC
BI
Fie
ldG
uid
e
PHI-BLAST
>gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASELIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHVIKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDILKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEIASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK
[GA]xxxxGK[ST]
NC
BI
Fie
ldG
uid
eGenome BLAST
NC
BI
Fie
ldG
uid
eGenome BLAST via Map Viewer
NC
BI
Fie
ldG
uid
eExample Search Pathways:
Hemochromatosis
Gene
OMIMOMIM GeneGene
“hemochromatosis”HFE
nucleotide sequence
GenomeBLAST Map Viewer
SNP
Protein
Domains
text search
sequence search
NC
BI
Fie
ldG
uid
eExample: Human Genome BLAST
TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGC
Human EST
NC
BI
Fie
ldG
uid
eHuman Genome BLAST: Results
NC
BI
Fie
ldG
uid
e
Human Genome BLAST: MapViewer
Entrez GeneEntrez Gene
NC
BI
Fie
ldG
uid
e
What’s New?
NC
BI
Fie
ldG
uid
e
BLAST DatabasesNucleotide
• refseq_rna = NM_*, XM_*
• refseq_genomic = NC_*, NG_*
• env_nt– environmental sample[filter], e.g., 16S
rRNA
Protein
• refseq = NP_*, XP_*
• env_nr
NC
BI
Fie
ldG
uid
eNew Formatter
Select lower case Select red
NC
BI
Fie
ldG
uid
eNew Formatter
• gray line = same database hit
• hsp’s color-coded independently
NC
BI
Fie
ldG
uid
e
BLAST Output: Alignments & Filter
low complexity sequence filtered
NC
BI
Fie
ldG
uid
eAdvanced Options
Limit to Organism
all[filter] NOT ma
Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]
Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments
Example Entrez Queriesall[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]
Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments
-e 10000 -v 2000
NC
BI
Fie
ldG
uid
eSearching by Structure
Why search for similar structures?
• Find homologs with low sequence similarity
• Explore protein evolution: similar protein folds can support different functions
• Identify conserved core elements to model related proteins of unknown structure
Why search for similar structures?
• Find homologs with low sequence similarity
• Explore protein evolution: similar protein folds can support different functions
• Identify conserved core elements to model related proteins of unknown structure
NC
BI
Fie
ldG
uid
e
Indexing into MMDB
Structure
id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } ,
Add secondary structure
inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } ,
Add chemical bonds
• Import only experimentally determined structures• Convert to ASN.1 • Verify sequences
• Create “backbone” model (Cα, P only)• Create single-conformer model
MMDBMolecular Modeling Data Base
NC
BI
Fie
ldG
uid
e
Structure Summary
Conserved Domains3D Domain Neighbors
Structure Neighbors
NC
BI
Fie
ldG
uid
e
3D Domains
1
32
4
NC
BI
Fie
ldG
uid
e
Conserved Domains
TyrKc
SH3
SH2
NC
BI
Fie
ldG
uid
e
VAST: Alignment
For each protein chain,
locate SSEs (secondarystructure elements),
represent SSEs asindividual vectors, 1
2
3
4
5 6
Human IL-4
IL-4 &Leptinalign the vectors.
NC
BI
Fie
ldG
uid
e
VAST
Structure neighbors
Taq DNA polymerase
NC
BI
Fie
ldG
uid
eVAST Results for the Chain
Table view
NC
BI
Fie
ldG
uid
e
VAST
Vector Alignment Search Tool
3D Domain structure neighbors
NC
BI
Fie
ldG
uid
eVAST Results for Domain 1
Not found with Chain query!
NC
BI
Fie
ldG
uid
e
Best way to convert PDB files to MMDB format
for viewing with Cn3D!
Best way to convert PDB files to MMDB format
for viewing with Cn3D!
submit file to PDB
NC
BI
Fie
ldG
uid
eExample: Mapping Oligos Onto
a Genome
>forwardCCATGGCGACCCTGGAAAAGC
>reverseCAGCAGCGGCTGTGCCTGCGG
??
?
NC
BI
Fie
ldG
uid
eMap Oligos Onto Genome
>CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG
-W 7 –e 1000
forward primer reverse primer
NC
BI
Fie
ldG
uid
eGenome BLAST Results
NC
BI
Fie
ldG
uid
e
Primer Alignments
forward primer
reverse primer
NC
BI
Fie
ldG
uid
e
MapViewer
NC
BI
Fie
ldG
uid
e
MapViewer
NC
BI
Fie
ldG
uid
eSequence View (sv)
forward
reverse
NC
BI
Fie
ldG
uid
e
Service Addresses
•BLAST [email protected]
•General Help [email protected]•Wayne Matten [email protected]
•BLAST [email protected]
•General Help [email protected]•Wayne Matten [email protected]