biological databases genbank

Upload: jaineem

Post on 09-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Biological Databases Genbank

    1/31

    BIOLOGICAL DATABASES

  • 8/8/2019 Biological Databases Genbank

    2/31

  • 8/8/2019 Biological Databases Genbank

    3/31

    Sequence Databses

  • 8/8/2019 Biological Databases Genbank

    4/31

    Other Databses

  • 8/8/2019 Biological Databases Genbank

    5/31

    The Nucleotide Giants

    GenBank

    DDBJDNA Databank ofJapan

    EMBLEuropean MolecularBiology Laboratory

  • 8/8/2019 Biological Databases Genbank

    6/31

  • 8/8/2019 Biological Databases Genbank

    7/31

    GenBank

    The GenBank sequence database is an annotated

    collection of all publicly available nucleotide sequences

    and theirprotein translations. This database is produced

    at National Center for Biotechnology Information (NCBI)

    as part of an international collaboration with the

    European Molecular Biology Laboratory (EMBL), DataLibrary from the European Bioinformatics Institute (EBI)

    and the DNA Data Bank of Japan (DDBJ).

  • 8/8/2019 Biological Databases Genbank

    8/31

    History

    Initially, GenBank was built and maintained at LosAlamos National Laboratory (LANL). In the early 1990s,this responsibility was awarded to NCBI throughcongressional mandate. NCBI undertook the task ofscanning the literature for sequences and manuallytyping the sequences into the database. Staff thenadded annotation to these records, based uponinformation in the published article.

    This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences are firstdeposited into publicly available databases

    (DDBJ/EMBL/GenBank) so that the Accession numbercan be cited and the sequence can be retrieved whenthe article is published.

    NCBI began accepting direct submissions to GenBank in1993 and received data from LANL until 1996.

  • 8/8/2019 Biological Databases Genbank

    9/31

    International Collaboration

    GenBank

    DDBJEMBL

  • 8/8/2019 Biological Databases Genbank

    10/31

    International Collaboration

    In February, 1986 , the GenBank database became part of the

    International Nucleotide Sequence Database Collaboration with the

    EMBL database (European Bioinformatics Institute

    [http://www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome

    Sequence Database (GSDB; LANL, Los Alamos, NM).

    Subsequently, the GSDB was removed and DDBJ

    [http://www.ddbj.nig.ac.jp/] (Mishima, Japan) joined the group in

    1987. Each database has its own set of submission and retrieval

    tools, but the three databases exchange data daily so that all three

    databases should contain the same set of sequences.

    An entry can only be updated by the database that initially

    prepared it to avoid conflicting data at the three sites.

  • 8/8/2019 Biological Databases Genbank

    11/31

    International Collaboration

    The Collaboration created a Feature Table Definition

    [http://www.ncbi.nlm.nih.gov/collab/FT/index.html]

    that outlines legal features and syntax for the DDBJ,

    EMBL, and GenBank feature tables. The purpose of thisdocument is to standardize annotation across the

    databases. The presentation and format of the data are

    different in the three databases, however, the underlying

    biological information is the same.

    The International Nucleotide Sequence Database Collaboration alsoexchanges new and updated records daily. Therefore, all sequencespresent in GenBank are also present in DDBJ and EMBL

  • 8/8/2019 Biological Databases Genbank

    12/31

    How to access them ?

    Main SitesMain Sites

    NCBI : http://www.ncbi.nlm.nih.gov/

    EMBL : http://www.ebi.ac.uk/DDBJ : http://www.ddbj.nig.ac.jp

  • 8/8/2019 Biological Databases Genbank

    13/31

  • 8/8/2019 Biological Databases Genbank

    14/31

  • 8/8/2019 Biological Databases Genbank

    15/31

    THE GENBANK FLATFILE:

    A DISSECTION

    In FASTA format

    The GenBank flatfile (GBFF) is the elementary

    unit of information in the GenBank database. It is

    one of the most commonly used formats in the

    representation of biological sequences.

  • 8/8/2019 Biological Databases Genbank

    16/31

    EMBL and DDBJ

    The European counterpart to GenBank is the European Molecular Biology

    Laboratory Nucleotide Sequence Database (EMBL) located at the European

    Bioinformatics institute (EBI).

    Another primary nucleotide sequence database, the DNA Database of Japan

    (DDBJ) [ddbj], is operated by the Center for Information Biology (CIB) [cib] in

    Japan and is the primary nucleotide sequence database for Asia. The three database operators NCBI, EBI, and CIB comprise the International

    Nucleotide Sequence Database Collaboration and synchronize their databases

    every 24 h. A query of all three individual databases is therefore not necessary,

    nor is it required to enter a new nucleotide sequence into all three databases.

    While the database format of DDBJ is identical to that of NCBI, that of EMBL

    differs somewhat.

  • 8/8/2019 Biological Databases Genbank

    17/31

  • 8/8/2019 Biological Databases Genbank

    18/31

    The Sequence Retrieval System

    SRS was developed at EBI to manage primary

    and secondary biological databases (Etzold etal. 1996). SRS can also facilitate complex

    queries. Operation of SRS is the same at

    either DDBJ or EBI and the following section

    describes the system at EBI.

  • 8/8/2019 Biological Databases Genbank

    19/31

    Protein Database

    SWISSPROT One of the most important collections of annotated protein sequences is

    the Swissprot database [swissprot] of the Swiss Institute of

    Bioinformatics (SIB), which also operates the Expert Protein Analysis

    System (Expasy) server [expasy]. The Swissprot database is high quality database as it is manually

    curated

    Furthermore, Swissprot is part of the UniProt databases (see Sect. 3.2.2

    Uniprot) collectively known as the UniProt Knowledgebase

    (UniProtKB).

    Because SIB specialists can not keep pace with the growing number ofnew entries, a supplement to Swissprot has been developed, the

    TrEMBL database. TrEMBL stands forTranslated EMBL and contains all

    nucleic acid to protein translations of the EMBL database that have not

    yet been included in Swissprot. All entries are annotated automatically,

    and so their quality is less than those curated.

    Both databases can be accessed via the Swissprot main page.

  • 8/8/2019 Biological Databases Genbank

    20/31

  • 8/8/2019 Biological Databases Genbank

    21/31

    NCBI Protein Database

    Another well-known protein sequence database is maintained at

    the NCBI.

    This database, however, is not a single database but a

    compilation of entries found in other protein sequence databases.

    For example, the NCBI database contains entries from Swissprot,the PIR database [pir], the PDB database [pdb], protein

    translations of the GenBank database, as well as from a number

    of other sequence databases.

    Its format corresponds to that ofGenBank and queries are carried

    out analogously to those ofGenBank via the Entrez system ofNCBI.

  • 8/8/2019 Biological Databases Genbank

    22/31

    Universal Protein Resource (UniProt) The UnitProt Consortium

    2007), which unites the information in the three protein databases,

    Swissprot, TrEMBL, and PIR.

    UniProt consists of three parts, the UniProt Knowledgebase

    (UniProtKB), the UniProt Reference Clusters Database (UniRef),and the UniProt Archive (UniPArc), a collection of protein

    sequences and their history.

    UniProtKB is a comprehensive directory of protein annotations

    and is based on the Swissprot and TrEMBL databases.

    UniRef is a nonredundant sequence database that allows for fastsimilarity searches. The database exists in three versions:

    UniRef100, UniRef90, and UniRef50.

  • 8/8/2019 Biological Databases Genbank

    23/31

    Secondary Databases

  • 8/8/2019 Biological Databases Genbank

    24/31

    PROSITE

    An important secondary biological database is Prosite (Falquet et

    al. 2002) resident at the SIB

    Classifi cation of proteins in Prosite is determined using single

    conserved motifs i.e., short sequence regions (1020 amino

    acids) that are conserved in related proteins and usually have akey role in the proteins function.

    A motif is derived from multiple alignments (see Chap. 4) and

    saved in the database as a regular expression .

    [GSTNE][GSTNE]--[GSTQCR][GSTQCR]--[FYW][FYW]--{ANW}{ANW}--x(2)x(2)--P.P.

    Besides searching for keywords, one can examine a sequence forthe presence of Prosite motifs. Furthermore, using the algorithm

    ScanProsite, Prosite offers the possibility to search Swissprot,

    TrEMBL, and PDB for protein sequences that contain a user-defi

    ned pattern.

  • 8/8/2019 Biological Databases Genbank

    25/31

    PRINTS

    The Prints database [prints] (Attwood et al. 2003) uses fi

    ngerprints to classify sequences.

    Fingerprints consist of several sequence motifs, represented in

    the Prints database by short local ungapped alignments

    The Prints database takes advantage of the fact that proteinsusually contain functional regions that result in several sequence

    motifs per protein.

    Besides information on how to derive a fi ngerprint and judge its

    quality, Prints database also offers cross-references to entries in

    related databases, thus permitting access to more informationregarding the protein family.

  • 8/8/2019 Biological Databases Genbank

    26/31

    Pfam

    The Pfam database [pfam] (Bateman et al. 2002) classifi es

    protein families according to profiles.

    The Pfam database [pfam] (Bateman et al. 2002) classifi es

    protein families according to profi les. A profi le is a pattern that

    evaluates the probability of the appearance of a given amino acid,an insertion or a deletion at every position in a protein sequence.

    Pfam is based on sequence alignments.

    Further sequences are then automatically added to the individual

    alignments of the Swissprot database.

    The resulting alignments should represent functionally interestingstructures and contain evolutionarily related sequences.

    Because of the partly automatic construction of the alignments,

    however, it is also possible that sequence alignments arise that

    have no evolutionary relationship to one other. Therefore, results

    of a search against the Pfam database should be carefullyreviewed.

  • 8/8/2019 Biological Databases Genbank

    27/31

    InterPro

    The Integrated Resource of Protein Families,

    Domains, and Sites (Interpro) [interpro] (Mulder et al.

    2007) integrates important secondary databases into a

    comprehensive signature database. Interpro merges the databases Swissprot, TrEMBL,

    Prosite, Pfam, Prints, ProDom, Smart, and TIGRFAMs

    [tigr] and thereby allows a simple and simultaneous

    query of these databases.

    The result page combines the output of the individual

    queries. This makes for a fast comparison of the

    results while taking into account the strengths and

    weaknesses of the individual databases.

  • 8/8/2019 Biological Databases Genbank

    28/31

    Other Databases

    GenotypePhenotype Databases For diseases to emerge and progress, several genes or their

    products are frequently required. The identifi cation of genes

    relevant to disease is, therefore, of vital importance in a

    target-based approach for rational drug development.

    A number of genotype-phenotype databases have been

    established that record relationships between genes and the

    biological properties of organisms.

    OMIM Online Mendelian Inheritance In Man

    dbGap

    OMIA Online Mendelian Inheritance In Animals (except

    Mice and Human)

    Mouse Genome Database

    FlyBase & WormBase

  • 8/8/2019 Biological Databases Genbank

    29/31

    Molecular Structure Databases

    PDB

    SCOP

    CATH

    Protein Data Bank

    Class (C), Architecture (A), Topology (T), and Homologous Superfamily (H).

  • 8/8/2019 Biological Databases Genbank

    30/31

    PDB

    The Protein Data Bank (PDB) is a database of experimentally determined

    crystal structures of biological macromolecules.

    The PDB was founded at the Brookhaven National Laboratory in 1971,

    reflected in the frequent use of the name Brookhaven Protein Data Bank.

    About 46,000 macromolecule structures are stored in the PDB database(as of September 2007).

    These are predominantly proteins, but also include DNA and RNA

    structures and proteinnucleic acid complexes.

    As of 2002, only those crystal structures that have been solved

    experimentally are stored in the PDB database, whereas data of

    theoretical protein models are kept in their own section [pdb-models]. The PDB database offers a number of query options. A textbased

    search for a PDB-ID or a keyword can be initiated on the main page.

  • 8/8/2019 Biological Databases Genbank

    31/31

    SCOP

    Proteins that perform a similar biological unction and are evolutionary

    related must have a similar structural organization, at least in the region

    of their active centers. It should, therefore, be possible to predict the

    function of an unknown protein by comparison of its structural

    organization with that of known proteins. Two databases, SCOP and

    CATH, provide such predictions.

    SCOP (Structural Classifi cation Of Proteins) [scop] (Murzin et al. 1995)

    classifi es proteins of a known structure in a hierarchical manner. The

    three main classifi cations are families, super families, and folds. Families

    describe proteins with a clear evolutionary relationship to each other and

    are limited by a sequence identity that must be at least 30% over the total

    length of the proteins.