lecture 2 – biological databases...the central dogma of molecular biology structure of protein dna...

Lecture 2 – Biological Databases

14th September 2011

}  Course webpage }  http://courses.cs.ut.ee/2011/Bioinformatics/

}  Lecture 1 – Introduction to Biology and Bioinformatics }  Lecture 2 – Biological Databases

Human genome

Human body

• 1014 cells • (100 trillion)

One cell

• 23 pairs of chromosomes

DNA

• ≈21,000 to 23,000 genes

RNA

• 3 billion pairs of DNA bases

Protein

• ≈1 million different proteins

DNA vs RNA }  deoxyribonucleic acid }  Sugar is deoxyribose }  DNA is a polymer of

deoxyribonucleotides }  Bases are adenine (A),

guanine (G), cytosine (C)and thymine (T)

}  ribonucleic acid }  Sugar is ribose }  RNA is a polymer of

ribonucleotides. }  Bases are adenine (A),

guanine (G), cytosine (C) and uracil (U) (instead of thymine)

RNA vs DNA

Structure of DNA-Watson and Crick 1953

DNA with high GC-content is more stable than DNA with low GC-content, 3 hydrogen bonds

RNA }  There are 4 types of RNA, each

encoded by its own type of gene: }  mRNA - Messenger RNA:

Encodes amino acid sequence of a polypeptide.

}  tRNA - Transfer RNA: Brings amino acids to ribosomes during translation.

}  rRNA - Ribosomal RNA: With ribosomal proteins, makes up the ribosomes, the organelles that translate the mRNA.

}  snRNA - Small nuclear RNA: With proteins, forms complexes

that are used in RNA processing in eukaryotes.

(Not found in prokaryotes.)

http://finchtalk.geospiza.com/2009_05_01_archive.html

4%

96%

Amino acids

Codon table

1/Sep/10

Gene Transcription, Translation, and Protein Synthesis

The central dogma of molecular biology

Structure of protein

DNA sequences }  Genes are encoded in

genomic sequences }  Genes are transcribed

into mRNAs (includes coding, intronic, 5’ and 3’ untranslated regions)

}  mRNA’s have spliced introns and translated into proteins

}  mRNAs are copied to cDNAs (5’ and 3’ UTRs)

Basic units

DNA

RNA

Protein

DNA-RNA-Protein }  All these information are stored in databases

}  What do we want from the databases

Structure of an experiment

DNA }  Raw DNA sequence }  Coding or noncoding }  Parse into genes }  4 bases ATGC

1! atgtctggca gttgctcttc taggaaatgc ttctccgtgc cagccacctc tctctgctcc ! 61 actgaggtga gctgtggagg ccccatctgc ctgcccagtt cctgccagag ccagacatgg ! 121 cagctggtga cttgtcaaga cagctgtgga tcatccagct gtgggccaca gtgccgtcag

! .......... .......... .......... .......... .......... ..........!1441 gtggatgtct cagaagaggc tccctgccag cccactgaag ccaaacccat cagcccaacc !1501 acccgtgagg ccgcagcagc tcagcctgct gccagcaagc ctgccaactg ctaa!

Protein sequence

}  20 letter alphabet }  ACDEFGHIKLMNPQRSTVWY }  [not BJOUXZ] }  Proteins are divided into domains !

}  /translation="MSGSCSSRKCFSVPATSLCSTEVSCGGPICLPSSCQSQTWQLVT CQDSCGSSSCGPQCRQPSCPVSSCAQPLCCDPVICEPSCSVSSGCQPVCCEATTCEPS CSVSNCYQPVCFEATICEPSCSVSNCCQPVCFEATVCEPSCSVSSCAQPVCCEPAICE PSCSVSSCCQPVGSEATSCQPVLCVPTSCQPVLCKSSCCQPVVCEPSCCSAVCTLPSS CQPVVCEPSCCQPVCPTPTCSVTSSCQAVCCDPSPCEPSCSESSICQPATCVALVCEP VCLRPVCCVQSSCEPPSVPSTCQEPSCCVSSICQPICSEPSPCSPAVCVSSPCQPTCY VVKRCPSVCPEPVSCPSTSCRPLSCSPGSSASAICRPTCPRTFYIPSSSKRPCSATIS YRPVSRPICRPICSGLLTYRQPYMTSISYRPACYRPCYSILRRPACVTSYSCRPVYFR PSCTESDSCKRDCKKSTSSQLDCVDTTPCKVDVSEEAPCQPTEAKPISPTTREAAAAQ PAASKPANC”!

1/Sep/10

DNA sequence 5’-!CAATGGCTAGGTACTATGTATGAGATCATGATCTTTACAAATCCGAG-3’ Top Strand

!||||||||||||||||||||||||||||||||||||||||||||||| !3’-!GTTACCGATCCATGATACATACTCTAGTACTAGAAATGTTTAGGCTC-5’ Bottom Strand!

DNA

CAATGGCTAGGTACTATGTATGAGATCATGATCTTTACAAATCCGAG Top Strand ||||||||||||||||||||||||||||||||||||||||||||||| GUUACCGAUCCAUGAUACAUACUCUAGUACTAGAAAUGUUUAGGCUC RNA !!CAAUGGCUAGGUACUAUGUAUGAGAUCAUGAUCUUUACAAAUCCGAG RNA ||||||||||||||||||||||||||||||||||||||||||||||| GTTACCGATCCATGATACATACTCTAGTACTAGAAATGTTTAGGCTC Bottom Strand!

Reading frames !e.g. 5' CAATGGCTAGGTACTATGTATGAGATCATGATCTTTACAAATCCGAG 3' DNA!

Forward Frames !

CAA TGG CTA GGT ACT ATG TAT GAG ATC ATG ATC TTT ACA AAT CCG AG DNA !

Q W L G T M Y E I M I F T N P Amino Acids !

C AAT GGC TAG GTA CTA TGT ATG AGA TCA TGA TCT TTA CAA ATC CGA G DNA !

N G * V L C M R S * S L Q I R Amino Acids !

CA ATG GCT AGG TAC TAT GTA TGA GAT CAT GAT CTT TAC AAA TCC GAG DNA !

M A R Y Y V * D H D L Y K S E Amino Acids!

!

Reverse Frames!

! !5' CAATGGCTAGGTACTATGTATGAGATCATGATCTTTACAAATCCGAG 3' DNA Top Strand ! ||||||||||||||||||||||||||||||||||||||||||||||| !

! !3' GTTACCGATCCATGATACATACTCTAGTACTAGAAATGTTTAGGCTC 5' DNA Bottom (Complimentry) ! ! ! ! ! ! !Strand !

! !5' CTCGGATTTGTAAAGATCATGATCTCATACATAGTACCTAGCCATTG 3' DNA Bottom (Complimentry) ! ! ! ! ! ! !Strand Reversed !

CTC GGA TTT GTA AAG ATC ATG ATC TCA TAC ATA GTA CCT AGC CAT TG DNA !

L G F V K I M I S Y I V P S H X Amino Acids !

C TCG GAT TTG TAA AGA TCA TGA TCT CAT ACA TAG TAC CTA GCC ATT G DNA !

S D L * R S * S H T * Y L A I X Amino Acids !

CT CGG ATT TGT AAA GAT CAT GAT CTC ATA CAT AGT ACC TAG CCA TTG DNA !

R I C K D H D L I H S T * P L Amino Acids!

Outline }  Introduction

}  Data and database types }  Components

}  Sample databases }  Data formats }  How to search biological databases ?

Bioinformatics can be divided into }  Algorithm /

computational method

}  Biological databases }  Visualization

software }  biological data }  Testing biological

hypothesis

Biological database }  are libraries of life sciences information, scientific experiments, published

literature, high-throughput experiment technology, and computational analyses.

}  contain information from research areas - genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics.

}  information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.

}  Relational database concepts of computer science and Information retrieval concepts of digital libraries are important for understanding biological databases.

}  Biological database design, development, and long-term management is a core area of the discipline of bioinformatics

}  Data contents include gene sequences, textual descriptions, attributes and ontology classifications, citations, and tabular data.

}  These are often described as semi-structured data, and can be represented as tables, key delimited records, and XML structures. Cross-references among databases are common, using database accession numbers.

Biological database architecture

© 2003 Nature Publishing Group

ORTHOLOGUE

A homologous gene that isderived from a speciation eventor by vertical descent.

338 | MAY 2003 | VOLUME 4 www.nature.com/reviews/genetics

R E V I EW S

However, the differences are deeper than the user-interface issues. Consider this scenario. A humangeneticist has localized an obesity factor to a 5-Mbregion of human chromosome 1. Are there any genes inthis region that have homologues that are involved inthe regulation of lipid metabolism in any model sys-tem? To answer this question, it is necessary to traverseseveral databases. First, the researcher goes to the UCSCor Ensembl databases to find all the predicted genes inthe region. Next, the BLAST search engine at NCBI is visited, to find homologues in the various model-organism systems. Then, the researcher goes to the GOweb site to look up the GO terms that are related to lipidmetabolism. Finally, the researcher either visits each ofthe model-organism databases in turn to find outwhether any of the homologues are related to one of theGO terms, or downloads the entire list of genes that are related to lipid metabolism from the GO web siteand checks whether one or more of the homologues areon this list.

The problem is that each of these database resourcescontains a different subset of biological knowledge.Although each database can answer questions in itsdomain, it cannot help with questions that span domainboundaries. For example, the FlyBase database does notknow about orthology relationships, The Institute forGenomic Research (TIGR) database can help withorthology but not with map position and the UCSCdatabase knows nothing about the fly. As this exampleshows, researchers must become adept at ‘database surf-ing’ to answer many reasonable questions. However, anautomated form of this strategy, called data mining, hasbeen raised to a high art in the bioinformatics world.

Integration is difficultLife would be much simpler if there was a single biologi-cal database, but this would be a poor solution. Thediverse databases reflect the expertise and interests ofthe groups that maintain them. A single database wouldreflect a series of compromises that would ultimatelyimpoverish the information resources that are availableto the scientific community. A better solution wouldmaintain the scientific and political independence of thedatabases, but allow the information that they containto be easily integrated to enable cross-database queries.Unfortunately, this is not trivial.

There are many integration challenges. One of themost difficult is the one that might seem the mostminor — how do you assign and maintain the correctnames of biological objects across databases? For exam-ple, consider the DNA-damage checkpoint-pathwaygene that is named Rad24 in Saccharomyces cerevisiae(budding yeast). Saccharomyces pombe (fission yeast)also has a gene named rad24 that is involved in thecheckpoint pathway, but it is not the ORTHOLOGUE of theS. cerevisiae Rad24. Instead, the correct S. pombe ortho-logue is rad17, which is not to be confused with the sim-ilarly named Rad17 gene in S. cerevisiae. Meanwhile,the human checkpoint-pathway genes are sometimesnamed after the S. cerevisiae orthologues, sometimesafter the S. pombe orthologues, and sometimes have

Database access software

Database managementsystem

Web server

Web browser

Tier 3

Tier 2

Tier 1

Figure 1 | Biological database architecture. Most biologicaldatabases use a three-tier architecture that consists of adatabase management system, a middleware layer and a web interface.

Figure 2 | Searching for a gene in Ensembl.

Data warehouse

© 2003 Nature Publishing Group

340 | MAY 2003 | VOLUME 4 www.nature.com/reviews/genetics

R E V I EW S

that is associated with direct access to the source data-bases. The failure of these languages to be adopted bythe academic bioinformatics community is more puz-zling, but might result from the complexity of writingand maintaining the component database drivers.Supporting this view is the fact that a few commercialentities are using K2 to manage in-house data — theirlarge bioinformatics groups can handle the complexityof the system, and their control of in-house databasessimplifies the maintenance issues.

Data warehousing. The last general approach can bebroadly described as bringing all the data under one roofin a single database (FIG. 5). The first step in data ware-housing is to develop a unified data model that canaccommodate all the information that is contained in thevarious source databases. The next step is to develop aseries of software programs that will fetch the data fromthe source databases, transform them to match the uni-fied data model and then load them into the warehouse.The warehouse can then be used as a ‘one-stop shop’ foranswering any of the questions that the source databasescan handle, as well as those that require integrated knowl-edge that the individual sources do not have.

It is more difficult to create a data warehouse than itmight sound. The single biggest issue is keeping the datawarehouse up to date. New information is being contin-ually added to the source databases, which means thatthe new data must be re-imported into the warehousein a timely fashion or the warehouse will go out of date.To make matters worse, database designs do not standstill; their maintainers are continually tinkering with thedata model by adding new data types, changing fieldsand nomenclature, and changing the relationshipsamong data types. This constant churn means thatdump, transform and load software that have been writ-ten for one version of a database will not necessarilywork with a later version.

Unfortunately, link integration is problematic onseveral fronts. First, it is extraordinarily vulnerable tonaming clashes and ambiguities. For example, a naiveresearcher who tries to navigate through the Rad24 geneconfusion using SRS or direct database-to-databaselinking might wander into the wrong gene family.Second, there are update issues. An outgoing web linkrepresents a leap of faith that the page at the other endof the link is still valid — if it exists at all. For example, ifthe curators of the S. cerevisiae database withdraw orrename Rad24, they have no way of informing all thedatabases that point to the Rad24 page that an update isrequired. Third, as we are all too familiar with on a dailybasis, link-level integration puts the onus of integrationand interpretation on the researcher.

View integration. View integration leaves the informa-tion in its source databases, but builds an environmentaround the databases that makes them all seem to bepart of one large system. The most complete attempt atthis involved the development of the cross-databasequery languages Kleisli and K2 at the University ofPennsylvania in the 1990s (REF. 8). If either of these lan-guages is given a query, the language processor analysesthe query to discover which databases need to beaccessed to satisfy the request, and generates a series ofsubqueries. The processor then hands the subqueriesseveral ‘drivers’ that can extract the information fromparticular databases — for example, the GenBank driver, which can query the NCBI Entrez web interface.After the drivers fetch the data, the Kleisli/K2 queryprocessor transforms and integrates it, and returns thedata to the user.

Despite the appeal of this approach, the system hasfailed to catch on with the community. Researchersmight be disappointed in the performance of the sys-tem. Because processing a query is limited by the slowestdata source, Kleisli and K2 rarely have the performance

Warehouse

Source database 1

Source database 2 Source database 3

Source database 4

Translation software 1




Figure 5 | Data warehousing. The data warehouse technique transforms the contents of multiple source databases to fit acommon data model. It then integrates the source data into a single large database.

Information retrieval }  Biological databases contain enormous amounts of data }  Requirement

}  Well annotated database }  Easily searchable }  Simple to retrieve }  Computer readable formats

Biological databases }  Growing steadily in number

}  Growing amazingly in size

}  Specialisations }  Which genome they contain (mouse, human, all of them) }  Which types of information about the genome they contain

}  Contain information such as }  Sequences: of bases and of residues }  Structure: 3d conformations of known proteins }  Families: Which sets of genes are known to be homologous }  Annotations: which processes each gene is involved in

}  And lots of other information }  Conceptual structure

}  How concepts in biology/genetics are related to each other

User interface }  Database search

}  Free text }  Specific fields }  Sequence based search

}  Database output }  Text }  Graphics }  Active

http://nar.oxfordjournals.org/content/39/suppl_1

The 2011 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources

The current 18th Database issue of Nucleic Acids Research includes descriptions of 96 new and 83 updated online databases. The Database Collection now lists 1330 carefully selected databases covering various aspects of molecular and cell biology. While most data resource descriptions remain very brief, the issue includes several longer papers that highlight recent significant developments in such databases as DDBJ, GenBank, Pfam, MetaCyc, UniProt, GEO, PDB, ArrayExpress etc,. The databases described in the Database Issue and Database Collection, however, are far more than a distinct set of resources; they form a network of connected data, concepts and shared technology.

Database Issue, Volume 39 suppl 1 January 2011

Sources for DNA databases }  Direct submission from researchers }  Labs/groups doing Genome sequencing }  Patent applications }  Scientific literature

Data Sources of mRNA’s }  Experimental

}  Cloning a new gene }  Cloning a gene from database }  hybrid system

}  Database }  Full length cDNA }  EST

}  Sources of mRNA’s are from Labs, Refseq, Full Length Sequencing projects etc.

Protein databases }  many different protein

}  Primary Amino Acids sequence }  Secondary structure }  3D structure }  Protein family domains }  active sites

}  Data sources of protein }  Proteins worked experimentally }  mRNA products obtained experimentally (no actual protein

sequencing done) }  Translated DNA from mRNA sequences

Examples of public databases }  Primary sequence databases }  Genome databases-whole chromosomes (not a specific

sequence) }  Protein sequence/structure/model databases }  Meta databases }  Carbohydrate structure databases }  Protein-protein interactions }  Signaling pathway databases }  Metabolic pathway databases }  Microarray databases }  Mathematical model databases }  PCR / real time PCR primer databases }  Specialized databases (Species specific, Sequencing technique

specific

DNA database classification }  PRI - primate (human,monkey) }  MAM - other mammalian (bovine,cat) }  ROD - rodent (mouse,rat) }  VRT - other vertebrate (chicken) }  INV - invertebrate }  PLN - plant, fungal, and alga }  BCT – bacteria }  VRL – viruses }  PHG – bacteriophage }  SYN - synthetic (plasmids, vectors) }  UNA - unannotated sequences }  PAT - patent sequences }  EST - Expressed Sequence Tags }  STS - Sequence Tagged Sites }  GSS - Genome Survey Sequences }  HTG - High Throughput Genomic Sequences }  HTC - High Throughput cDNA Sequences

International DNA databases }  Genbank - NCBI

}  http://www.ncbi.nlm.nih.gov/

}  EMBL - EBI }  http://www.ebi.ac.uk/embl/

}  DDBJ - Japan }  http://www.ddbj.nig.ac.jp/

Completely sequenced genomes

Growth of sequences in Genbank

bp

entries

356,178,346

342,299,616,138

NCBI }  National Center for Biotechnology Information

(NCBI) providing open access to resources }  Literature (e.g.,PubMed) }  DNA&RNA (e.g.GenBank,Nucleotide Database) }  Proteins (e.g.Protein) }  Genes & Expression (e.g.Gene,Gene Expression

Omnibus(GEO)) }  Genomes (e.g.Genome – database of sequence and map

data of over 1000 organisms; Genome Project - a collection of sequencing, assembly, annotation and mapping)

}  Variation (e.g.Database of Genotypes and Phenotypes, dbGaP)

}  SNP Database, etc. }  ENTREZ search tool

www.ncbi.nlm.nih.gov

EBI }  European Bioinformatics Institute (EBI) maintained by

European Molecular Biology Laboratory (EMBL) in Hinxton, Cambridge, UK.

}  EMBL nucleotide sequence database }  UniProt– a resource of protein sequence and functional

information }  ArrayExpress– a public archive for functional genomics data }  Ensembl– vertebrates and eukaryotic genome database with the

Genome Browser }  InterPro– an integrated database of predictive protein signatures

for classification and annotation of proteins and genomes }  Integr8- web portal of genomes and corresponding proteomes }  PDBe– Protein Data Bank in Europe, a collection of PDB studies }  Literature databases, e.g. MEDLINE, OMIM,Patent Abstracts.

http://www.ebi.ac.uk/

Refseq (NCBI) }  The Reference Sequence (RefSeq) collection aims to

provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins.

}  RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses.

}  July 21, 2010: RefSeq Release 42 }  Proteins:10,640,515 }  Organisms:10,728

ftp://ftp.ncbi.nih.gov/refseq/release/

Refseq (NCBI) }  Main features }  non-redundancy }  explicitly linked nucleotide and protein sequences }  updates to reflect current knowledge of sequence data

and biology }  data validation and format consistency }  distinct accession series (all accessions include an

underscore '_' character) }  ongoing curation by NCBI staff and collaborators, with

reviewed records indicated

Accession format Accession Molecule type

AC_543210 Complete genomic

NG_543210 incomplete genomic

NM_543210 mRNA

AP_543210 protein

NT_543210 BAC and/or Whole Genome Shotgun sequence data

NW_543210 BAC or Whole Genome Shotgun sequence data

XM_543210 mRNA (contig)

XR_543210 RNA (genomic contig)

XP_543210 Protein products (genomics contigs)

DIFFERENCE GENBANK REFSEQ

• Archival database (publicly available DNA sequences submitted from individual laboratories and large • Accession numbers are assigned to these submitted sequences • Submitted sequence data is exchanged between NCBIs GenBank, EMBL Data Library (EMBL) and the DNA Data Bank of Japan (DDBJ) • very redundant for some loci • Sequence records are owned by the original submitter and can not be altered by a third party

• Sequences are derived from GenBank and provide non-redundant curated data Entries records represent current knowledge • owned by NCBI (can be updated as needed to maintain current annotation or to incorporate additional sequence information) • records include additional sequence information (missing in archival database but is available in the literature) • sequence records are provided through collaboration (may not be available in any one GenBank record) • sequences are not submitted as primary seqs.

ESTs }  ESTs are small pieces of DNA sequence (usually 200 to

500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene.

Role of ESTs Uses Problems

• predict coding regions • Detect alternative splicing • clustering to form “genes” and gene families

• low copy number genes • rare tissues • enrichment of 3’ ends of genes • incomplete coverage of genes (breaks the regions)

Genome sequencing

Entrez, the Life Sciences Search Engine: Global Query (NCBI)

}  The Entrez page is home to the Entrez Global Query database search engine (the Entrez cross-database search page).

}  The entire group of individual Entrez databases is organized on this page with literature databases at the top including PubMed, PubMed Central, Books, OMIM and OMIA. The NCBI Site Search is also listed. The sequence databases include Nucleotide, EST, GSS, Protein, Genome, Structure, and SNP. The remaining databases are Taxonomy, dbGaP, Genome Project, Gene, UniGene, HomoloGene, Conserved Domains (CDD), 3D Domains, dbVAR, UniSTS, PopSet, GENSAT, GEO Profiles, GEO Datasets, Peptidome, Protein Clusters, PubChem BioAssay, PubChem Compound, PubChem Substance, Cancer Chromosomes, Probe, SRA MeSH, Journals, MeSH, and NLM Catalog. Links to popular NCBI Web pages, such as PubMed, Human Genome, Map Viewer, and BLAST, are on the toolbar. There is also a link to the "GenBank" database, leading to an overview of GenBank. A list of the databases and brief descriptions of each follows the Global Query section below.

http://www.ncbi.nlm.nih.gov/Entrez/

Data reliability }  huge amount of data collected in databases present a lot of

problems }  Data accuracy }  Sequence redundancy }  Inconsistent nomenclature }  Inaccurate annotation }  Sequence contamination (vectors, bacterial) }  database staff notify the Authors about the error (or

contamination) }  takes time to correct the data, sometimes the error is

continued }  Many sequences in the database are old, not updated since

they were submitted

Uniprot (EBI) }  The Universal Protein Resource (UniProt) is a

comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.

http://www.uniprot.org/

UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). Across the three institutes close to 150 people are involved through different tasks such as database curation, software development and support.

Swissprot }  Release on 10-Aug-10 of UniProtKB/Swiss-Prot contains

519348 sequence entries, comprising 183273162 amino acids abstracted from 191032 references.

Swiss-Prot Database (primary database) }  Minimal redundancy }  Description of protein function }  Protein domain structure }  Post-translational modifications }  Protein variants }  Annotation }  Minimal Redundancy }  Integration with other databases }  Domains and sites }  Secondary structure }  Quaternary structure }  Similarities to other proteins }  Disease(s) associated with deficiency(s) }  Sequence conflicts, variants }  periodically update the annotations of families or groups of proteins

http://expasy.org/sprot/

TrEMBL / nr Protein }  TrEMBL database is a computer-annotated part of SWISS-

PROT which contains all the EMBL (DNA) translations. }  The entries not yet integrated in SWISS-PROT. }  NR Protein database (primary databases from NCBI)

contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to PIR, SWISSPROT, PRF, PDB (sequences from solved structures).

Errors in protein sequence data }  About 30% proteins in the databases have erroneous

sequences due to missing exons in the DNA translation }  Translations of Introns }  Assigning functions to new proteins based on on

sequence similarity }  Eg. Sequence similarity

Protein A Protein B =

Error (initially) Corrected later Error carried forward

Protein D

Protein C =

Error carried forward

The Protein Data Bank (PDB) }  is a repository for the 3-D structural data of large

biological molecules, such as proteins and nucleic acids. }  The data, typically obtained by X-ray crystallography or

NMR spectroscopy }  submitted by biologists and biochemists }  The PDB is a key resource in areas of structural biology,

such as structural genomics. }  there are hundreds of derived (i.e., secondary) databases

that categorize the data differently. }  SCOP and CATH categorize structures according to type of

structure and assumed evolutionary relations; }  GO categorize structures based on genes.

http://www.pdb.org/

67656 Structures (Tuesday Aug 31, 2010)

Pfam 24.0 (Oct 2009, 11912 families) }  Proteins are generally comprised of one or more functional regions, commonly

termed domains. The presence of different domains in varying combinations in different proteins gives rise to the diverse repertoire of proteins found in nature. Identifying the domains present in a protein can provide insights into the function of that protein.

}  The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and hidden Markov models (HMMs).

}  Pfam families }  Pfam-A entries are derived from the underlying sequence database, known as Pfamseq,

which is built from the most recent release of UniProtKB at a given time-point. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases.

}  Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ADDA release. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found.

http://pfam.janelia.org/

Domain organisation

Pfam entries are classified in one of four ways:

}  Family: A collection of related proteins }  Domain:A structural unit which can be found in multiple

protein contexts }  Repeat:A short unit which is unstable in isolation but

forms a stable structure when multiple copies are present }  Motifs:A short unit found outside globular domains }  Related Pfam entries are grouped together into clans; the

relationship may be defined by similarity of sequence, structure or profile-HMM.

Gene Ontology }  The Gene Ontology project is a major bioinformatics initiative with

the aim of standardizing the representation of gene and gene product attributes across species and databases.

}  controlled vocabulary of gene and gene product attributes; to annotate genes and gene products, and assimilate and disseminate annotation data; to provide tools to facilitate access to all aspects of the data provided by the Gene Ontology project.

}  The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases.

}  The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.

www.geneontology.org

Database formats (sequences) }  Many formats

}  Fasta format }  Genbank format }  EMBL format }  GCG formats }  Clustal formats

Fasta formats }  Simplest format }  Limited information }  Unique names

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken! ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID! FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA! DIDGDGQVNYEEFVQMMTAK*!!! >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]! LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV! EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG! LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL! GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX! IENY!

Header <80 identifier and description of the sequence

MultiFASTA

http://en.wikipedia.org/wiki/FASTA_format

Genbank format LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999!DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.!ACCESSION U49845!VERSION U49845.1 GI:1293613!KEYWORDS .!SOURCE Saccharomyces cerevisiae (baker's yeast) !ORGANISM Saccharomyces cerevisiae ! Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.!REFERENCE 1 (bases 1 to 5028) !AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. !TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in S. cerevisiae !JOURNAL Yeast 10 (11), 1503-1509 (1994) !PUBMED 7871890!REFERENCE 2 (bases 1 to 5028) !AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. !TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein !JOURNAL Genes Dev. 10 (7), 777-793 (1996) !PUBMED 8846915!REFERENCE 3 (bases 1 to 5028) !AUTHORS Roemer,T. !TITLE Direct Submission !JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA!FEATURES Location/Qualifiers !source 1..5028 !

!/organism="Saccharomyces cerevisiae" !!/db_xref="taxon:4932" !!/chromosome="IX" !!/map="9" !

CDS <1..206 !!/codon_start=3 !!/product="TCP1-beta" !!/protein_id="AAA98665.1" !!/db_xref="GI:1293614" !!/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM”!

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

gene !687..3158 !!/gene="AXL2" !

CDS !687..3158 !!/gene="AXL2" !!/note="plasma membrane glycoprotein" !!/codon_start=1 !!/function="required for axial budding pattern of S.cerevisiae" !!/product="Axl2p" !!/protein_id="AAA98666.1" !!/db_xref="GI:1293615" !!/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF !TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN !VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE !VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE !TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV !YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG !DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ !DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA !NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA !CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN !NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ !SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS !YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK !HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL !VDFSNKSNVNVGQVKDIHGRIPEML" !

gene complement(3300..4037) !!/gene="REV7"!

CDS complement(3300..4037) !!/gene="REV7" !!/codon_start=1 !!/product="Rev7p" !!/protein_id="AAA98667.1" !!/db_xref="GI:1293616" !!/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ !FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD !KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR !RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK !LISGDDKILNGVYSQYEEGESIFGSLF”!

ORIGIN ! 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg ! 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct ! 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa ! 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg! ...... ...... ...... ...... ...... ...... ...... ...... ...... ..! .... ...... ...... ...... ...... ...... ...... ...... ....... ...!4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac !4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct !4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct !4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc!//!

More formats }  Pdb

}  http://www.wwpdb.org/docs.html }  http://en.wikipedia.org/wiki/Protein_Data_Bank_

%28file_format%29

}  Assignments }  http://courses.cs.ut.ee/2011/Bioinformatics

lecture 2 – biological databases...the central dogma of molecular biology structure of protein dna...

Documents