bioinformatics – biological databases

©M. Thollesson, 2001

Bioinformatics Bioinformatics – Biological databases– Biological databases

Mikael ThollessonMikael Thollesson

Evolutionary Biology Centre and Linnaeus Evolutionary Biology Centre and Linnaeus Centre for Bioinformatics, Uppsala UniversityCentre for Bioinformatics, Uppsala University


What is a database?What is a database?What is a database?What is a database?

• The data itself (i.e., information)The data itself (i.e., information)

• The organisation of the data - Database structure The organisation of the data - Database structure – Flat-file databasesFlat-file databases

• Mark-up and tagsMark-up and tags

– Relational databases – records and fieldsRelational databases – records and fields

– Object oriented databasesObject oriented databases

• Database Management System – DBMSDatabase Management System – DBMS– Queries and retrievalsQueries and retrievals

• InterfacesInterfaces– User interfaces, e.g. web pages or dedicated clientsUser interfaces, e.g. web pages or dedicated clients

– Computer interfacesComputer interfaces


Relational databasesRelational databases

• Consists of tables with homogenous content, where each table contains records (items) and each record has one or several fields (properties)

Name Dept. RoomMikael Thollesson EBC 1006Siv Andersson EBC 1100


Relational databasesRelational databases• Records in different tables are related by key fields

• Contents from different tables are brought together using these key values

Name Dept. RoomMikael Thollesson EBC 1006Siv Andersson EBC 1100

Dept. Visiting addressEBC Norbyvägen 18CCTH Ashebergsgatan 15


WWW-server, e.g. Apache

Interface code, e.g. Perl or PHP

DBMS, e.g. mySQL

DBMS, e.g. mySQL

SQL query

OS, e.g. Linux

http/https

SQL query

SQL reply

htmlClient

Server


Structure databases

Sequence databases

Predictions on proteins

Phylogenetic inference

Pairwise/Multiple alignment

Contig assembly

?

One view of One view of BioinformaticsBioinformatics

Phylogenetic databases

Gene

Function, localisation

Phylogenies

BLAST

Literature databases

Predictions on DNA

Metabolic databases

Expressionpatterns

Regulatorymechanism

Genomedatabases

Expression databases


Sequence databasesSequence databases

• Nucleic acid sequence databases– Contain primary nucleotide sequence dataContain primary nucleotide sequence data

– Repositories, i.e. the content of these databases are not curatedRepositories, i.e. the content of these databases are not curated

• Protein sequence databases– Contain secondary and primary protein sequence dataContain secondary and primary protein sequence data

– Some are curated, others are just extracts from other databasesSome are curated, others are just extracts from other databases

• Several kinds of interfaces/search engines are available to retrieve Several kinds of interfaces/search engines are available to retrieve data, e.g. SRS (Sequence Retrieval System) and the data, e.g. SRS (Sequence Retrieval System) and the Entrez browser browser


Nucleotide sequence repositoriesNucleotide sequence repositories

• Three primary centres, which exchange information on a daily basisThree primary centres, which exchange information on a daily basis– EMBL / European Molecular Biology LaboratoryEMBL / European Molecular Biology Laboratory

– DDBJ / DNA Data Bank of JapanDDBJ / DNA Data Bank of Japan

– GenBankGenBank

• All three adhere to the DDBJ/EMBL/GenBank Feature Table Definition – All three adhere to the DDBJ/EMBL/GenBank Feature Table Definition – http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html, i.e. the content of each record is the same for these databases


EMBL – European Molecular Biology EMBL – European Molecular Biology LaboratoryLaboratory

• Europe’s primary nucleotide sequence resource

• Established 1980 in Heidelberg by EMBL, now maintained by EBI (European Bioinformatics Institute) in Cambridge, UK

• Main sources of sequences are direct submissions from individual researchers, genome project and patent applications

• Contains two main parts– A release section (embl_rel) that is issued every three months

– A new section (embl_new) where new sequences are added daily

• Also split into divisions depending on the origin of the sequence

• http://www.ebi.ac.uk/embl/Access/index.html

• Entries has a Entries has a format that differs from GenBank and DDBJ that differs from GenBank and DDBJ


Division Entries Nucleotides

ESTs 3,693,981 1,421,971,352Fungi 37,369 67,929,603GSSs 1,263,218 648,288,952HTG 20,658 2,377,710,044Human 109,289 689,777,485Invertebrates 47,169 194,132,983Other Mammals 23,412 22,195,795Organelles 60,335 50,831,921Patents 201,151 64,516,716Bacteriophage 1,525 3,949,359Plants 56,133 163,347,453Prokaryotes 73,966 174,930,528Rodents 51,595 79,840,999STSs 112,145 49,289,439Synthetic 3,574 8,655,224Unclassified 1,221 1,842,760Viruses 87,844 77,168,641Other Vertebrates 21,157 24,529,423Total 5,865,742 6,120,908,677

EMBL divisionsEMBL divisions


- DNA Database of Japan

• Mainly collects data from Japanese activities (but accepts submissions from any researcher in any country)

• Began DNA repository activities in 1986 endorsed by the Ministry of Education, Science, Sports, and Culture

• http://www.ddbj.nig.ac.jp/

• Entries has the same format as Entries has the same format as GenBank


GenBank

• US primary nucleotide sequence resource

• Established in 1988

• Maintained by National Center for Biotechnology Information (NCBI), Bethesda, MD

• Contains a release section and a new section as EMBL

• http://www.ncbi.nlm.nih.gov/

• Entries has aEntries has a format that is different from that is different from EMBL


EST databasesEST databases

• Expressed Sequence Tags (ESTs) are short sequences from mRNAExpressed Sequence Tags (ESTs) are short sequences from mRNA

• ESTs are useful to get a handle on expressed genesESTs are useful to get a handle on expressed genes

• dbEstdbEst– Is a division of GenBank containing ESTs from a number of organismsIs a division of GenBank containing ESTs from a number of organisms

• UniGeneUniGene– A non-redundant set of gene-oriented clustersA non-redundant set of gene-oriented clusters

– Contains numerous novel ESTs, but also “proper” sequencesContains numerous novel ESTs, but also “proper” sequences

– Presently Presently HomoHomo, , RattusRattus, and , and MusMus has been processed has been processed


• SWISS-PROT and TREMBLSWISS-PROT and TREMBL– Developed by Swiss Institute of Bioinformatics (SIB) and European Developed by Swiss Institute of Bioinformatics (SIB) and European

Bioinformatics Institute (EBI)Bioinformatics Institute (EBI)

• PIR-PSDPIR-PSD– A collaboration between National Biomedical Research Foundation A collaboration between National Biomedical Research Foundation

(NBRF), Munich Center for Protein Studies (MIPS) and Japan (NBRF), Munich Center for Protein Studies (MIPS) and Japan International Protein Information Database (JIPID)International Protein Information Database (JIPID)

Protein databasesProtein databases


Protein databases IProtein databases I

• SWISS-PROT (86000 entries June SWISS-PROT (86000 entries June 2000)2000)

– Is a curated protein sequence Is a curated protein sequence databasedatabase

– Aims to provide a high level of Aims to provide a high level of annotations (e.g., function, domain annotations (e.g., function, domain structure, post-translational structure, post-translational modifications)modifications)

– Divided into Swissprot_rel and Divided into Swissprot_rel and Swissprot_newSwissprot_new

– Not divided into sections based on Not divided into sections based on speciesspecies

• TREMBL (ca 300 000 entries June TREMBL (ca 300 000 entries June 2000)2000)

– Contains translated sequences from Contains translated sequences from the EMBL databasethe EMBL database

– Divided into Divided into • SP-TREMBL with sequences that SP-TREMBL with sequences that

are candidates for incorporation are candidates for incorporation into SWISS-PROT into SWISS-PROT

• REM-TREMBL that will not be REM-TREMBL that will not be incorporated into SWISS-PROTincorporated into SWISS-PROT


Protein databases IIProtein databases II• Protein Information Resource - Protein Sequence Database (PIR-PSD) is similar to SWISS-

PROT in its aims

• PIR’s stated goal is “to provide a comprehensive, non-redundant, classified, well-annotated, and freely available, protein sequence database, in which entries are classified into family groups and alignments of each group are available”

• Also produces a computer generated supplemental database of translations, PATCHX, similar to TrEMBLE with sequences not yet incorporated

• New entries in batches from genome sequencing projects or from selected GenBank/EMBL entries

• The PIR database is in constant flux as the level of annotation on entries increases and new entries with minimal annotation are added

• PIR-PSD database is growing at a higher rate than the SWISS-PROT, but has a lower level of annotation per entry.

• The PIR-PSD consists of four sections:– PIR1. Fully Classified Entries

– PIR2. Verified and Classified Entries

– PIR3. Unverified Entries– PIR4. Un-encoded or Un-translated Entries


Interfaces to public databasesInterfaces to public databasesSeveral different databases are usually accessible through the same Several different databases are usually accessible through the same

WWW interface.WWW interface.

For example, the databases below are accessible via National Institute of For example, the databases below are accessible via National Institute of Health/National Centre for Biotechnology Information (NIH/NCBI) Health/National Centre for Biotechnology Information (NIH/NCBI) ((

http://www.http://www.ncbincbi..nlmnlm..nihnih..govgov/Database//Database/)

OMIMOMIM PubMedPubMed Full-text Electronicjournals

Full-text Electronicjournals

3DStructures

3DStructures

TaxonomyTaxonomyProtein

sequencesProtein

sequences

NucleotidesequencesNucleotidesequences

Maps & GenomesMaps & Genomes


Genome databasesGenome databases• Differs from sequence databases by being more heterogeneous and diverse

• A genome database organises all information on an organisms genome, such as– Genetic mapping

• Maps how genes are located relative to each other and with a distance measured as percentage recombination

– Physical mapping• Ranges from cytogenetic maps (banding patterns of chromosomes) to the positions of

clone contigs

– Sequence data• Nucleotide sequences are (usually) deposited at the nucleotide sequence repositories

even before finishing the genome sequencing

• Entries to genome databases are e.g. – Genome Net – http://www.genome.ad.jp/

– NCBI’s genome section – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome


Protein structuresProtein structures

n-IVTAHAFVMI-c• Primary structure; the order of amino acids

• Secondary structure; conformations, mainly alpha helices and beta sheaths

• Tertiary structure; the complete three dimensional folding of the polypeptide

• Quaternary structure; exists if the protein is composed of two or more polypeptide chains


Structural databasesStructural databases

• Contain information on the three-dimensional structure of Contain information on the three-dimensional structure of molecules, chiefly proteinsmolecules, chiefly proteins

• Data is primarily based on x-ray crystallography (>80%), Data is primarily based on x-ray crystallography (>80%), NMR, or theoretical models (<2%)NMR, or theoretical models (<2%)

• Examples of such databasesExamples of such databases– Protein databank (PDB) - Protein databank (PDB) - http://www.rcsb.org/pdb/

– Molecular Modelling Database (MMDB) - Molecular Modelling Database (MMDB) - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structurehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure


• All metabolic databases use All metabolic databases use EC-numbers, which are a combination of , which are a combination of four figures that classify the type of reaction the enzyme catalysesfour figures that classify the type of reaction the enzyme catalyses

• Example: EC 1.2.3.4 is a oxido-reductase (1) that act on aldehyde or Example: EC 1.2.3.4 is a oxido-reductase (1) that act on aldehyde or oxo groups (1.2) with oxygen as acceptor (1.2.3). The last digit, 4, is oxo groups (1.2) with oxygen as acceptor (1.2.3). The last digit, 4, is an ordnial number within the classan ordnial number within the class

• Pros and consPros and cons

+ EC provides an unique identifier+ EC provides an unique identifier

+ Enables a synonym dictionary+ Enables a synonym dictionary

- Many classes of enzymes are not covered in sufficient detail, - Many classes of enzymes are not covered in sufficient detail, especially proteases and nucleases with macromolecules as substrateespecially proteases and nucleases with macromolecules as substrate

Metabolic databasesMetabolic databases


Metabolic databasesMetabolic databases

• Describe enzymes, reactions, substrates, products, and biochemical Describe enzymes, reactions, substrates, products, and biochemical reactions reactions

• Data are specific for different organisms (“type organisms”) as well as Data are specific for different organisms (“type organisms”) as well as general overviews and links to sequence and structure databasesgeneral overviews and links to sequence and structure databases

• ExampleExample– Kyoto Encyclopedia of Genes and Genomes – Kyoto Encyclopedia of Genes and Genomes – http://www.genome.ad.jp/

kegg/


Phylogenetic databasesPhylogenetic databases

• Primary (repositories) and secondary (data analysis and interpretation) Primary (repositories) and secondary (data analysis and interpretation) databasedatabase

• Primary databases contain information on the result of phylogenetic Primary databases contain information on the result of phylogenetic analyses (trees, taxonomic names), data, and assumptions on which analyses (trees, taxonomic names), data, and assumptions on which the analyses are basethe analyses are base

• Secondary databases contain interpretations and assembled Secondary databases contain interpretations and assembled phylogenetic hypotheses for all kinds of taxaphylogenetic hypotheses for all kinds of taxa

• ExamplesExamples– TreeBase – TreeBase – http://www.herbaria.harvard.edu/treebase/index.html

(Primary)(Primary)

– Tree of Life – Tree of Life – http://phylogeny.arizona.edu/tree/ (Secondary) (Secondary)


Expression databasesExpression databases

• Functional genomicsFunctional genomics– DNA arrays (cDNA probes on a chip) are used to assess the RNA levels of DNA arrays (cDNA probes on a chip) are used to assess the RNA levels of

different genes (several hundreds at a time)different genes (several hundreds at a time)

– Measurements are taken at intervals after some treatment is initialisedMeasurements are taken at intervals after some treatment is initialised

– Genes are grouped in clusters according to expression profileGenes are grouped in clusters according to expression profile

– Reverse engineering of expression levels of these groups are used to propose Reverse engineering of expression levels of these groups are used to propose regulatory genetic networksregulatory genetic networks

• No unified format for DNA-chip data yet, although work is in progressNo unified format for DNA-chip data yet, although work is in progress

• Example of gene expression databases areExample of gene expression databases are– EBI ArrayExpress database – http://www.ebi.ac.uk/arrayexpress/

– KEGG Expression Database – http://www.genome.ad.jp/kegg/expression/

bioinformatics – biological databases

Documents