introduction to bioinformatics databases: nucleic acid databases neha jain

48
Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Upload: morgan-kelley

Post on 12-Jan-2016

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Introduction to Bioinformatics databases: Nucleic Acid

Databases

Neha Jain

Page 2: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

What is Database

• General:• A database is any collection of related data.• A Computerized archive used to store and

organize data in such a way that information can be retrieved easily.

• A database is a collection of interrelated data store together without harmful and unnecessary redundancy (duplicate data) to serve multiple applications

• Retrieving is called firing a query.

Page 3: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

DATABASE SYSTEM      Database System is an integrated collection of related files along with the detail about their definition, interpretation, manipulation and maintenance

A database system is based on the data. Also a database system can be run or executed by using software called DBMS (Database Management System). A database system controls the data from unauthorized access.A database management system (DBMS) is acollection of programs that enables users to createand maintain a database.

Page 4: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

What Does a DBMS Do?Database management systems provide severalfunctions in addition to simple file management:• allow concurrency• control security• maintain data integrity• provide for backup and recovery• control redundancy• allow data independence• provide non-procedural query language• perform automatic query optimization

What is a relational database?• a database that treats all of its data as acollection of relations

Page 5: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Biological databases: why?

• Need for storing and communicating large datasets has grown

• Make biological data available to scientists.

• To make biological data available in computer-readable form.

Page 6: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Different classifications of databases

• Type of data – nucleotide sequences – protein sequences – proteins sequence patterns or motifs – macromolecular 3D structure – gene expression data – metabolic pathways

Page 7: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Different classifications of databases….

• Primary or derived databases – Primary databases: experimental results

directly into database – Secondary databases: results of analysis of

primary databases – Aggregate of many databases

• Links to other data items • Combination of data • Consolidation of data

Page 8: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Different classifications of databases….

• Availability – Publicly available, no restrictions – Available, but with copyright – Accessible, but not downloadable – Academic, but not freely available – Proprietary, commercial; possibly free for

academics

Page 9: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

9

NCBI and Entrez

• One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA)

• Entrez is the search engine of NCBI• Search for :

genes, proteins, genomes, structures, diseases, publications and more.

• http://www.ncbi.nlm.nih.gov/

Page 10: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Primary Databases

• This databases contains the raw nucleic acid sequence data which are produced and submitted by researchers worldwide.

• Nucleic acid EMBL

GenBank

DDBJ (DNA Data Bank of Japan)

• Protein

PIR

MIPS

SWISS-PROT

TrEMBL

NRL-3D

Page 11: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Nucleotide sequence databases

• EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases

• EMBL www.ebi.ac.uk/embl/

• GenBank www.ncbi.nlm.nih.gov/Genbank/

• DDBJ www.ddbj.nig.ac.jp • They together constitute the International Nucleotide

Sequence database callaboration.

Page 12: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Genbank

• An annotated collection of all publicly available nucleotide and proteins

• Set up in 1979 at the LANL (Los Alamos).

• Maintained since 1992 NCBI (Bethesda).

• http://www.ncbi.nlm.nih.gov

Page 13: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

GenBank file format

Page 14: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

GenBank file format

Page 15: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

EMBL Nucleotide Sequence Database

• An annotated collection of all publicly available nucleotide and protein sequences

• Created in 1980 at the European Molecular Biology Laboratory in Heidelberg.

• Maintained since 1994 by EBI- Cambridge.

• http://www.ebi.ac.uk/embl.html

Page 16: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

DDBJ–DNA Data Bank of Japan

• An annotated collection of all publicly available nucleotide and protein sequences

• Started, 1984 at the National Institute of Genetics (NIG) in Mishima.

• Still maintained in this institute a team led by Takashi Gojobori.

• http://www.ddbj.nig.ac.jp

Page 17: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Databases related to Genomics

Contain information on genes, gene location (mapping), gene nomenclature and links to sequence databases;Exist for most organisms important for life science research;Examples: OMIM, GDB (human), MGD (mouse), FlyBase (Drosophila), SGD (yeast), MaizeDB (maize), SubtiList (B.subtilis), etc.

Page 18: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Other NCBI nucleic acids DBs• EST database: A collection of expressed sequence tags, or short, single-pass sequence

reads from mRNA (cDNA).

• HomoloGene: A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs.

• HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. 

• SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms.

• RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, supports data-gathering efforts. 

Page 19: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Nucleic acid structuredatabases

• NDB Nucleic acid-containing structures http://ndbserver.rutgers.edu/

• NTDB Thermodynamic data for nucleic acids http://ntdb.chem.cuhk.edu.hk/

• RNABase RNA-containing structures from PDB and NDB http://www.rnabase.org/

• SCOR Structural classification of RNA: RNA motifs by structure, function and tertiary interactions

• http://scor.lbl.gov/

Page 20: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Protein Sequence Databases

One of the first biological sequence databases was probably the book "Atlas of Protein Sequences and Structures" by Margaret Dayhoff and colleagues, first published in 1965. It contained the protein sequences determined at the time, and new editions of the book were published till 1978. It became the foundationof the PIR database.

http://pir.georgetown.edu/

Protein Information Resource

Page 21: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

•SWISS-PROT: Annotated Sequence Database•TrEMBL: Database of EMBL nucleotide translated sequences•InterPro:Integrated resource for protein families, domains and functional sites.•CluSTr:Offers an automatic classification of SWISS-PROT and TrEMBL.•IPI: A non-redundant human proteome set constructed from SWISS-PROT, TrEMBL, Ensembl and RefSeq.•GOA: Provides assignments of gene products to the Gene Ontology (GO) resource.•Proteome Analysis: Statistical and comparative analysis of the predicted proteomes of fully sequenced organisms•Protein Profiles: Tables of SWISS-PROT and TrEMBL entries and alignments for the protein families of the Protein Profile.•IntEnz: The Integrated relational Enzyme database (IntEnz) will contain enzyme data approved by the Nomenclature Committee.

Reference site : www.ebi.ac.uk/Databases/protein.html

Protein Databases

Page 22: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Swiss-Prot

• A protein sequence database which strives to provide a high level of annotation:* the function of a protein* domains structure* post-translational modifications* variants

• One entry for each protein• Complete, Curated, Non-redundant and

cross-referenced with 34 other databases

Page 23: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

UniProt: http://www.uniprot.org/

• The Universal Protein Resource (UniProt) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

• It features BLAST, align sequence, retrieve sequences based on identifiers, and ID mapping from other databases such as GenBank, EMBL, DDBJ etc.

Page 24: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

TrEMBL (Translation of EMBL)

• Created in 1996 as a computer annotated supplement to SWISS-PROT.

• Contains translations of all coding sequences (CDS) in EMBL.

Has 2 main sections:1.SP-TrEMBL: contains entries that will eventually be

incorporated into SWISS-PROT, but that have not yet been manually annotated.

2. REM-TrEMBL: contains sequences that are not destined to be included in SWISS-PROT, these include immunoglobulins and T-cell receptors, synthetic and patented sequences and codon translations that do not encode real proteins.

Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data…

TrEMBL contains all what is not yet in SWISS-PROT

Page 25: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Structure Databases

•MSD:The Macromolecular Structure Database – A relational database representation of clean Protein Data Bank (PDB)•3DSeq: 3D sequence alignment server- Annotation of the alignments between sequence database and the PDB•FSSP: Based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB)•DALI: Fold Classification based on Structure-Structure Assignments•3Dee: Database of protein domain definitions wherein the domains have been clustered on sequence and structural similarity•NDB: Nucleic Acid Structure Database

Page 26: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain
Page 27: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Protein DataBank (PDB)• Important in solving real problems in

molecular biology• Protein Databank

– PDB Established in 1972 at Brookhaven National Laboratory (BNL)

– Sole international repository of macromolecular structure data

– Moved to Research Collaboratory

for Structural Bioinformatics

http://www.rcsb.org/

Page 28: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

PDB: example

HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6REVDAT 1 15-OCT-92 12CA 0 12CA 7JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13REMARK 1 12CA 14REMARK 2 12CA 15REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16REMARK 3 12CA 17REMARK 3 REFINEMENT. 12CA 18REMARK 3 PROGRAM PROLSQ 12CA 19REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20REMARK 3 R VALUE 0.170 12CA 21REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23REMARK 4 12CA 24REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27………

Page 29: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

PDB (cont.)SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102…….

Page 30: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Databases related to Proteomics

• Contain information obtained by 2D-PAGE: master images of the gels and description of identified proteins

• Examples: SWISS-2DPAGE (Two-dimensional polyacrylamide gel electrophoresis database)

• , ECO2DBASE, Maize-2DPAGE, Sub2D, Cyano2DBase, etc.

• Format: composed of image and text files• Most 2D-PAGE databases are “federated” and

use SWISS-PROT as a master index• Mass Spectrometry (MS) database

Page 31: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Munich Information Center for Protein Sequences (MIPS)

• A research centre hosted at the Institute for Bioinformatics (IBI) at Neuherberg, Germany.

• Contains information for Systematic analysis of genome information including the development and application of bioinformatics methods in genome annotation, gene expression analysis and proteomics.

• MIPS supports and maintains a set of generic databases as well as the systematic comparative analysis of microbial, fungal, and plant genomes.

04/21/23 05:34

Page 32: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

The Institue of Genomic Research (TIGR)

• Maintained by The Center for the Advancement of Genomics (TCAG)

• Its Database is TDB• TDB: A database of The Institute of Genomic

Research:provides a substantial suite of databases containing DNA and protein sequence, gene expression, cellular role, protein family information, and taxonomic data for microbes, plants and humans.

04/21/23 05:34

Page 33: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

HOVERGEN : Homologous Vertebrate Genes Database

• HOVERGEN is a database of homologous vertebrate genes.• It allows one to select sets of homologous genes among vertebrate species, and to visualize multiple alignments and phylogenetic trees• Thus HOVERGEN is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies.

•Divided into 2 parts

•HOVERGEN contains the protein sequences

• HOVERGENDNA contains the associated nucleotide sequences.

The database contains all vertebrate protein sequences from the UniProt Knowledgebase (Swiss-Prot and TrEMBL)

Page 34: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

The Arabidopsis Information Resource TAIR

• TAIR maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana. 

• Data available from TAIR includes the complete genome sequence along with gene structure, gene product information, metabolism, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publications, and information about the Arabidopsis research community.

• Its an up to date database which updates in every 2 weeks

04/21/23 05:34

Page 35: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

PlasmoDB: a functional genomic database for malaria parasites

• PlasmoDB (http://PlasmoDB.org) is a functional genomic database for Plasmodium spp. that provides a resource for data analysis and visualization in a gene-by-gene or genome-wide scale.

• The latest release, PlasmoDB 5.5, contains numerous new data types from several broad categories—annotated genomes, evidence of transcription, proteomics evidence, protein function evidence, population biology and evolution.

04/21/23 05:34

Page 36: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

 ECDC (European Centre for Disease Prevention and Control) 

• The European Centre for Disease Prevention and Control (ECDC) was established in 2005. It is an EU agency aimed at strengthening Europe's defences against infectious diseases.

• ECDC publishes scientific and technical reports on various issues related to communicable diseases prevention and control, including comprehensive reports from key technical and scientific meetings.

04/21/23 05:34

Page 37: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Other Databases

• KEGG (Kyoto Encyclopedia of Gene and Genomics) – for Pathways

• GeneCards – A databases of human genes, their products and their involvement in diseases. It’s a secondary database which contains link for many other databases.

• All in one database of human genes (a project by Weizmann institute)

• Attempts to integrate as many as possible databases, publications and all available knowledge

• There are many databases available for microarray, SAGE, ESTs and SNPs.

Page 38: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

FASTA Format

• Popular Format and commonly used

> Seq1

ALVLRARLATGPATGCTRTARARLATGALVLRARLATGPARARLATGPATGCTRTARA

RLATGALVLRARRLATGPATGCTRRLATGPATGCTRRARLATGPATGCTRTARARLAT

GALVLRAR

>Seq2

TGCTRTARARLATGALVLRARLATGPARARALVLRARLATGPATGCTRTARATGALVL

RARLATGPARARALVLRARLATG

>Seq 3

……..

Page 39: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

Identifiers and Accession numbers

• Identifier: string of letters and digits that generally is “understandable”– Example: TPIS_CHICK (Triose Phosphate Isomerase from

chicken (gallus gallus) ) in SwissProt

– The identifier can change (based on the curator)

• Accession code: a string of letters and digits that uniquely identifies an entry in its database.– The accession number for TPIS_CHICK in Swissprot is

P00940

– Accession number should not changed!!

Page 40: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

04/21/23 05:34

Page 41: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain
Page 42: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain
Page 43: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

43

Google scholarhttp://scholar.google.com/

Page 44: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

04/21/23 05:34

Page 45: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

04/21/23 05:3445

Exercise

• Retrieve all publications in which the first author is: Mayrose I and the last author is: Pupko T

Page 46: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

46

The MOST important of all

1.Google (or any search engine)

Page 47: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

47

And always remember:

2.RTM –

Read the manual!!

Page 48: Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

48

Help!

• Read the Help section

• Read the FAQ section

• Google the question!