biological databases
DESCRIPTION
Biological Databases. What types of data are available? What is a database? What are Genbank and Entrez? What does a typical entry look like? How does one use the database?. BIO520 Bioinformatics Jim Lund. NCBI Biological Databases. Central Dogma-o-centric Genomic DNA sequence - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/1.jpg)
Biological Databases
What types of data are available?
What is a database?
What are Genbank and Entrez?
What does a typical entry look like?
How does one use the database?
BIO520 Bioinformatics Jim Lund
![Page 2: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/2.jpg)
NCBI Biological Databases
Central Dogma-o-centric
• Genomic DNA sequence
• mRNA/cDNA sequence
• Protein sequence
• Protein 3D structure
• Literature (Function)
![Page 3: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/3.jpg)
Biological Data• Genomic DNA sequence (complete)• mRNA/cDNA sequence • Gene expression data (NEW)
– Microarrays, SAGE– Expression catalogs
• Protein sequence– Protein interaction/complex data (NEW)
• Protein 3D structure• Literature (Function)
– Organism databases (NEW)– Annotation and classification projects (NEW)
![Page 4: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/4.jpg)
What is a Biological Database?An organized body of persistent data and
associated computer software for updating, querying, and retrieving data records.
• Collection of records and files• Organized for a particular purpose
• The database is separate from the interface and can have several interfaces.– NCBI Protein can be searched by protein name or
using BLAST (Basic Local Alignment Search Tool).
![Page 5: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/5.jpg)
Common database features
• Relational Databases
– Tables
– Relationships between tables
• Version Control
• Consistency enforcement
• Multiauthor/multiuser with security
![Page 6: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/6.jpg)
BIO 520 Student Database
BIO520
Name ID Grade
Amy 123 A
Joe 456 B
Sue 789 C
Table
Record
.
Column
Value
![Page 7: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/7.jpg)
Genbank EntryLOCUS BC005255 495 bp mRNA linear PRI 23-JUN-2006DEFINITION Homo sapiens insulin, mRNA (cDNA clone IMAGE:3950204), complete cds.ACCESSION BC005255VERSION BC005255.1 GI:13528923KEYWORDS MGC.SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.FEATURES Location/Qualifierssource 1..495 /organism="Homo sapiens"gene 1..495 /gene="INS" /db_xref="GeneID:3630"CDS 60..392 /gene="INS" /translation="MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSL YQLENYCN"ORIGIN 1 agccctccag gacaggctgc atcagaagag gccatcaagc agatcactgt ccttctgcca… 421 ccgcctcctg caccgagaga gatggaataa agcccttgaa ccaacaaaaa aaaaaaaaaa 481 aaaaaaaaaa aaaaa//
![Page 8: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/8.jpg)
The CORE: DDBJ, EMBL, and Genbank
![Page 9: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/9.jpg)
Genbank DNA Sequence Database
• Genbank/EMBL/DDBJ mirror & exchange sequence records.
• Primary vs. Secondary Databases– nr (non-redundant database)
• Primary vs. secondary records– Sequence vs. inferred property (coding
region)
![Page 10: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/10.jpg)
Primary vs. Derivative Databases
• Primary Databases– Original submissions by experimentalists– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain
![Page 11: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/11.jpg)
A TraditionalGenBank Record
LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.ACCESSION AY182241VERSION AY182241.2 GI:32265057KEYWORDS .SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004)REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USAREFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN"ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt//
Header
Feature Table
Sequence
The Flatfile Format(formatted text)
![Page 12: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/12.jpg)
Genbank Entry
LOCUS PCU30791 1234 bp mRNA PLN 31-MAY-1996
DEFINITION Pneumocystis carinii carinii form 6 guanine nucleotide binding protein alpha subunit (pcg1) mRNA, complete cds.
ACCESSION U30791
NID g1345098
VERSION U30791.1 GI:1345098
Unique IDVersion Control
![Page 13: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/13.jpg)
Content-Taxonomy
SOURCE Pneumocystis carinii f. sp. carinii.
ORGANISM Pneumocystis carinii f. sp. carinii Eukaryota; Fungi; Ascomycota; Archiascomycetes; Pneumocystidaceae; Pneumocystis.
![Page 14: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/14.jpg)
Reference
REFERENCE 1 (bases 1 to 1234)AUTHORS Smulian,A.G., Ryan,M., Staben,C. and Cushion,M. TITLE Signal transduction in Pneumocystis carinii: characterization of the genes (pcg1) encoding the alpha subunit of the G protein (PCG1)
of Pneumocystis carinii carinii and Pneumocystis carinii ratti
JOURNAL Infect. Immun. 64 (3), 691-701 (1996) PUBMED 96186460
•Unique cross reference•Can be >1 reference
![Page 15: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/15.jpg)
Features
FEATURES Location/Qualifiers
source 1..1234 /organism="Pneumocystis carinii f. sp.
carinii“/strain="Form 6“/note="450 kb chromosome" /db_xref="taxon:38081“5'UTR 1..90 gene 91..1155 /gene="pcg1"
![Page 16: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/16.jpg)
CDSCDS 91..1155
/gene="pcg1”
/note="G-protein alpha subunit" /codon_start=1
/product= "guanosine nucleotide binding protein alpha subunit" /protein_id="AAC49295.1"
/db_xref="PID:g1345099" /db_xref="GI:1345099" /translation="MGCCFSATYNQDTLRSKEIE SYLRQEQEHACHEAKILLLGAGES…
.
Related info in another
database
INFERRED
![Page 17: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/17.jpg)
DNA
BASE COUNT 421 a 171 c 195 g 447 t ORIGIN
1 tgaattctaa attttatatt …1201… tattttttta tgctccagat aaaa //
![Page 18: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/18.jpg)
Genbank entries• Combination of required (LOCUS,
SOURCE) and optional fields.– The entry is hierarchical, some fields
contain subfields. • REFERENCE->AUTHORS
• Some fields can appear multiple times (REFERENCE, /gene)
• Some fields are numerical, other are text. Some fields contain free text, others use a controlled vocabulary or an database ID.
![Page 19: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/19.jpg)
Other Genbank output formats
• FASTA– Simple, little annotation information– Easy to use– Common denominator format
• ASN1– Computer friendly, human unfriendly
• XML, INSDSeqXML, TinySeqXML• Graph (graphical map of seq features)…and more
![Page 20: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/20.jpg)
• Genbank (used by VectorNTI)
• FASTA
• GCG– Accelrys GCG (Genetics Computer
Group) package– formerly GCG Wisconsin Package
Many others!
DNA Sequence Files Common formats
![Page 21: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/21.jpg)
FASTA
>gi|1345098|gb|U30791.1|PCU30791 TGAATTCTAAATTTTATATTTCTAATTGCATTTTATATTTTTGATAATACTAGATTTATTCCTGGAAACT TAAATTAGTTATTTTAAGTTATGGGATGTTGTTTTTCTGCTACATATAACCAAGATACACTTCGTTCCAA
One annotation line only!
![Page 22: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/22.jpg)
Submitting sequences to Genbank
•Sequin–Stand-alone sequence submission tool.
•BankIt–Web based sequence submission.
![Page 23: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/23.jpg)
Genbank is an ARCHIVE
•The literature and secondary databases are the knowledge sources.
•There are many additional NCBI annotation databases
![Page 24: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/24.jpg)
NCBI annotation databases!
•Genbank -> RefSeq (Single sequence for each gene)
•Entrez Gene (Gene-based links to annotation sources).•HomoloGene (Homologs)•OMIM•Conserved domains, 3D domains•GEO (Gene expression datasets)•DNA, protein, 3D structures•Interaction data•Links to other databases!
•NCBI Genomes•NCBI Map viewer
![Page 25: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/25.jpg)
Finding and editing DNA files
• Find DNA: Entrez
• Downloading files
• Format Conversion
• Sequence viewing/editing
![Page 26: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/26.jpg)
Entrez• Database
searching/browsing• Example: Pneumocystis G-
proteins– PCR a cDNA to express
in E. coli– Read about it and
related genes– Check similarity to
related G-proteins– View the 3D structure??
•http://www.ncbi.nlm.nih.gov/Entrez/
![Page 27: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/27.jpg)
Entrez Neighbors-Protein
Protein
Protein Literature
BLASTP
citation
DNA
encoding
3D Structurecitation
![Page 28: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/28.jpg)
Mapping the menagerie of biological databases
![Page 29: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/29.jpg)
Nucleic Acid Manipulations
• On the web:– Baylor Human Genome Center (BCM)
http://searchlauncher.bcm.tmc.edu/seq-util/seq-util.html
– European Bioinformatics Institute (EBI) http://www.ebi.ac.uk/Tools/misc.html
![Page 30: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/30.jpg)
DNA/Protein sequence format conversion
• Readseq– Download program:– http://iubio.bio.indiana.edu/soft/molbio/
readseq
– Use online:– http://www.ebi.ac.uk/cgi-bin/readseq.cgi– http://searchlauncher.bcm.tmc.edu/seq-
util/readseq.html
Beware Information Loss!
![Page 31: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/31.jpg)
Reverse Complementing
5’-GAATCA-3’
5’-TGATTC-3’NOT
5’-ACTAAAG-3’
![Page 32: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/32.jpg)
Sequence Statistics
• Nucleotide frequencies (di, tri…)
• UV Absorbance
• MW
• Tm
![Page 33: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/33.jpg)
Restriction Map• Linear vs Circular• Enzyme sets
– Which enzymes, where they cut.
• Gel simulation– Gel-to-map MUCH harder!!
• Useful for:– Cloning– Southern blots– Specialized mol bio techniques
![Page 34: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/34.jpg)
Translation/ORFs
• Translation table– Standard vs non-standard
• Frame (1,2,3,4,5,6)
• Segmental translation (exon-intron)
• Primary translation vs mature polypeptide
![Page 35: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/35.jpg)
Sequence Annotation and Editing
• Text editor– Notepad– Word processor– vi
Nonproportional fonts
(courier, monospaced…)
MWGTCC
IIIIII
MWGTCC
IIIIII
•Artemis•Sequin
•NCBI’s Genbank entry creation/viewing tool
![Page 36: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/36.jpg)
http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi
Primer design program: Primer3
![Page 37: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/37.jpg)
Primary vs. Derivative Databases
• Primary Databases– Original submissions by experimentalists– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases– Built from primary data– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain
![Page 38: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/38.jpg)
Other NCBI Databases
•Structure: imported structures (PDB)Cn3D viewer, NCBI
curation
•CDD: conserved domain databaseProtein families (COGs
and KOGs)
Single domains (PFAM, SMART, CD)
•dbSNP: nucleotide polymorphism
•Gene: gene recordsUnifies LocusLink and
Microbial Genomes
![Page 39: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/39.jpg)
Homologene Cluster
![Page 40: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/40.jpg)
Entrez Protein: Derivative DatabaseData SourceGenPept
Sequences6,937,176
RefSeq 3,359,561
Third Party Annotation 5,136
Swiss Prot 255,159
PIR 29,996
PRF 12,079
PDB 91,116
PAT Division 669,035
Total 10,690,223
BLAST nr total(no patents or env)
4,545,310
![Page 41: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/41.jpg)
Redundant Proteins
>gi|741682|prf||2007430A DNA mismatch repair protei...MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair...MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|4557757|ref|NP_000240.1| MutL protein homolog 1...MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|13905126|gb|AAH06850.1| MutL protein homolog 1 ... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|1079787|gb|AAA82079.1| DNA mismatch repair prot... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
GenPept
NCBI RefSeq
Swiss-Prot
PRF
![Page 42: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/42.jpg)
RefSeq: NCBI’s Derivative Sequence Database
• Curated transcripts and proteins– reviewed– human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more
• Model transcripts and proteins• Assembled Genomic Regions (contigs)
– human– mouse– rat
• Chromosome records– Human genome– microbial– organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
srcdb_refseq[Properties]
– chicken– honeybee– sea urchin
– zebrafish– cow– dog
– black poplar
![Page 43: Biological Databases](https://reader035.vdocument.in/reader035/viewer/2022081503/5681492a550346895db6627b/html5/thumbnails/43.jpg)
RefSeq Accession Numbers
mRNAs and Proteins
NM_123456 Curated mRNANP_123456 Curated ProteinNR_123456 Curated non-coding RNAXM_123456 Predicted mRNAXP_123456 Predicted Protein XR_123456 Predicted non-coding RNAGene RecordsNG_123456 Reference Genomic SequenceChromosomeNC_123455 also Microbial replicons, organelles genomes, human chromosomesAssembliesNT_123456 Contig NW_123456 WGS Supercontig
http://www.ncbi.nlm.nih.gov/RefSeq/key.html