introduction to biological databases · why biological databases (db) ? • exponential growth in...

148
Introduction to biological databases Introduction to Biological Databases

Upload: others

Post on 29-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

Introduction to biological databases

Introduction to Biological Databases

Page 2: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

2, Introduction to biological databases

What is ‗Bioinformatics‘?

Bioinformatics is the application of computer sciences to

biology

… interdisciplinary science

… strives to solve the problems of the life sciences with

theoretical computer-assisted methods

… indispensable for modern biology and medicine

… uses techniques such as applied mathematics and

statistics

Page 3: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

3, Introduction to biological databases

Some major research areas in bioinformatics

• Sequence analysis and function prediction

• Analysis and prediction of protein structure

• Computational evolutionary biology

• Comparative genomics

• Gene and protein expression

• Protein-Protein Interaction (PPI) analysis

• System‘s biology

• Image analysis

• Visualization

Page 4: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

4, Introduction to biological databases

Indispensable for bioinformatic studies:

1. Databases

2. Software tools

3. Servers

Introduction

Page 5: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

5, Introduction to biological databases

Outline

• Introduction

• Selected categories of life sciences databases 1. Nucleotide sequences

2. Genomics

3. Mutation/polymorphism

4. Protein sequences

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

8. Metabolism/Pathways

9. Bibliography

10. Others

• Concluding remarks

• Practicals

Page 6: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

6, Introduction to biological databases

Introduction

What is a database (db) ?

• A collection of related data, which are: – structured

– searchable (index) -> table of contents

– updated periodically (release) -> new edition

– cross-referenced (hyperlinks) -> links with other db

• Includes also associated tools (software) necessary for

db access, db updating, db information insertion or

deletion….

• Data storage format: flat files (text, FASTA), relational

(XML, RDF)

Page 7: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

7, Introduction to biological databases

Introduction

Why biological databases (db) ?

• Exponential growth in biological data

• Data are no longer published in a conventional manner,

but directly submitted to databases (nucleotides & amino acids sequences, 3D structures, 2D gel analysis, MS

analysis, microarrays, publications, protein-protein interactions,…)

• Essential tools for biological research

Page 8: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

8, Introduction to biological databases

P. Gaudet, ‗A community of Biocurators‘

Page 9: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

9, Introduction to biological databases

Science cover, February 2011

Page 10: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

10, Introduction to biological databases

Some statistics and remarks

• More than 1000 different "biological" databases

• Variable size: <100Kb to >100Gb (ENA > 728Gb !) – DNA: > 100 Gb

– Protein: 2 Gb

– 3D structure: 5 Gb

– Other: smaller

• Update frequency: daily to annually

• Generally accessible through the web (free!?)

Page 11: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

11, Introduction to biological databases

Where can we find…

• a video -> Youtube

• info on S. Hawking-> wikipedia

• a book -> Amazon

• a friend -> Facebook, Google plus

• DNA sequence -> EMBL

• protein sequence -> UniProtKB, RefSeq

• 3D data -> PDB

• Microarrays data -> ArrayExpress, GEO

• Publications -> PubMed

Page 12: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

12, Introduction to biological databases

10 most important bioinformatics databases *

* according to the "Bioinformatics for dummies"

Name URL Data type

GenBank www.ncbi.nlm.nih.gov Nucleotide sequences

Ensembl www.ensembl.org Genomes

PubMed www.ncbi.nlm.nih.gov Literature references

NCBI nr www.ncbi.nlm.nih.gov Protein sequences

UniProtKB www.uniprot.org Protein sequences

InterPro www.ebi.ac.uk Protein domains

OMIM www.omim.org/ Genetic diseases

Enzymes http://enzyme.expasy.org/ Enzymes

PDB www.rcsb.org/pdb/ Protein structures

KEGG www.genome.ad.jp Metabolic pathways

Page 13: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly
Page 14: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

14, Introduction to biological databases

Databases / Servers

• A server is a computer (from a given institute) that

provides services (stores databases and associated

tools) to other computers

• Main biological servers: – ExPASy (www.expasy.org/)

– UniProt (www.uniprot.org)

– NCBI (www.ncbi.nlm.nih.gov/)

– EBI (www.ebi.ac.uk/)

– Japanese GenomeNet (www.genome.jp/)

• Not all servers give access to the same databases and

to the same search tools ! ... when servers give access to the same databases, the 'look' is different ...

and beware the date of the latest release !

Page 15: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

UniProt NCBI

The same data on different servers…. Same data on different servers ...

Page 16: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

16, Introduction to biological databases

How to find a database ?

• The Nucleic Acids Research (NAR) Online Molecular Biology Database collection 2011:

a total of 1‘330 databases

http://www.oxfordjournals.org/nar/database/a/

• Expasy Life Science Directory: http://www.expasy.org/links.html (no more updated)

• Google: http://www.google.com/

Page 17: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

17, Introduction to biological databases

http://www.expasy.org/links.html Expasy Life Science Directory

Page 18: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

18, Introduction to biological databases

Awareness of the content

and usage of knowledge resources

is a pre-requisite to do any type of "serious" research

in the field of molecular life sciences

(Amos Bairoch, 2007)

Page 19: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

19, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

Page 20: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

20, Introduction to biological databases

Deluge of sequence data

• ~ 3200 genomes sequenced (single organism, varying sizes, including virus)

• ~ 5‘000 ongoing genome sequencing projects

• cDNAs sequencing projects (ESTs or cDNAs)

• metagenome sequencing projects (~300) (environmental samples: multiple ‘unknown’ organisms, varying sizes)

– Ecological metagenomics: beach sand, Sargasso Sea, New-York air, …

– Organismal metagenomics: human fluids, mouse gut, …

• Personal Human genomes

Page 21: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

21, Introduction to biological databases

Deluge of sequence data

• Personal human genomes!

http://www.youtube.com/watch?v=mVZI7NBgcWM

Page 22: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

22, Introduction to biological databases

Deluge of sequence data

But…we know now that his apoE allele is the one

associated with increased risk for Alzheimer and

that he has the ‗blue eye‘ allele…

Page 23: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

23, Introduction to biological databases

Enseml genome browser

Page 24: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

24, Introduction to biological databases

Deluge of sequence data

http://www.personalgenomes.org

Page 25: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

25, Introduction to biological databases

DNA sequence of the

telomeric region of

human chromosome x

Page 26: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

26, Introduction to biological databases

1. Nucleotide sequences db

• The main DNA sequence db are:

EMBL/ENA (Europe)/GenBank (USA) /DDBJ (Japan)

-> INSDC collaboration

• There are also specialized databases for the different

types of RNAs (i.e. tRNA, rRNA, tmRNA, uRNA, etc…)

• Others:

Eucaryotic promoter db (EPD); RNA editing sites,...

Page 27: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

27, Introduction to biological databases

1. EMBL-ENA/GenBank/DDBJ http://www.insdc.org/

Archive of primary sequence data and corresponding annotation

submitted by the laboratories that did the sequencing.

Page 28: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

28, Introduction to biological databases

1. Same data on different servers

EBI (EMBL/ENA) NCBI (GenBank)

NIG (DDBJ)

Page 29: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

29, Introduction to biological databases

1. EMBL-ENA/GenBank/DDBJ

• Serve as archives : ‗nothing goes out‘

• Contain all public sequences derived from: – Genome projects (> 80 % of entries)

– Sequencing centers (cDNAs, ESTs…)

– Individual scientists ( 15 % of entries)

– Patent offices (i.e. European Patent Office, EPO)

• Currently: ~150x106 sequences, ~200 x109 bp;

• Sequences from > 500‘000 different species;

Page 30: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

30, Introduction to biological databases

1. Ideal content of a "sequence" db

• Sequences !!

• Unique Accession number (AC)

• References

• Taxonomic data

• ANNOTATION/CURATION

• Keywords

• Cross-references

• Documentation

Minimal requirements !

Page 31: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

31, Introduction to biological databases

1. EMBL-ENA entry

Cross-references

accession number

taxonomy

references

Page 32: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

32, Introduction to biological databases

1. EMBL-ENA entry (cont.)

Annotation

(Prediction or

experimentally determined)

sequence

CDS

Coding Sequence

(proposed by submitters)

Page 33: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

33, Introduction to biological databases

cDNAs, ESTs, genes, genomes, …

EMBL/ENA GenBank DDBJ

Data not submitted to public databases: delayed or cancelled…

1. The hectic life of a sequence

CDS

Coding sequence

Portion of DNA/RNA translated into protein (from Met to 'STOP')

Experimentally proved or derived from gene prediction

Not so well documented !

with or without annotated CDS

provided by the authors

Page 34: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

34, Introduction to biological databases

Coding Sequence (CDS): Alignments between a mRNA and a genomic sequence

Page 35: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

1. EMBL-ENA vs GenBank format

Page 36: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

36, Introduction to biological databases

1. Fasta format

Page 37: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

37, Introduction to biological databases

1. EMBL-ENA/GenBank/DDBJ

• Heterogeneous sequence length and quality: – ESTs, genomes, variants, fragments…

• Sequence sizes: – max 350‘000 bp /entry (! genomic sequences, overlapping)

– min 10 bp /entry

• Archive: nothing goes out -> highly redundant !

• full of errors: in sequences, in annotations, in CDS

attribution, no consistency of annotations: – most annotations are done by the submitters;

– heterogeneity of the quality and the completion and updating of the

information

Page 38: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

38, Introduction to biological databases

1. EMBL-ENA/GenBank/DDBJ

• Unexpected information you can find in these db: FT source 1..124

FT /db_xref="taxon:4097"

FT /organelle="plastid:chloroplast"

FT /organism="Nicotiana tabacum"

FT /isolate="Cuban cahibo cigar, gift from

FT President Fidel Castro"

• Or: FT source 1..17084

FT /chromosome="complete mitochondrial genome"

FT /db_xref="taxon:9267"

FT /organelle="mitochondrion"

FT /organism="Didelphis virginiana"

FT /dev_stage="adult"

FT /isolate="fresh road killed individual"

FT /tissue_type="liver"

Page 39: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

39, Introduction to biological databases

1. Other nucleotide sequences databases

http://www.rnaiweb.com/RNAi/RNAi_Web_Resources/siRNA_Collections___Databases/

Page 40: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

40, Introduction to biological databases

1. Other nucleotide sequences databases

• EPD is a rigorously selected database. In order to be included in EPD, a

promoter must be: – recognized by eukaryotic RNA POL II,

– active in a higher eukaryote,

– experimentally defined, or homologous and sufficiently similar to an experimentally defined promoter,

– biologically functional,

– available in the current ENA release,

– distinct from other promoters in the database.

http://www.epd.isb-sib.ch/

Page 41: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

41, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

Page 42: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

42, Introduction to biological databases

2. ‗Genomics databases‘

• Contain information on gene chromosomal location

(mapping) and nomenclature, and provide links to

sequence databases; contain usually no sequence!

• Exist for most model organisms; usually species specific.

• Examples: MIM (human), MGD (mouse), FlyBase

(Drosophila), SGD (yeast), MaizeDB (maize), SubtiList

(B.subtilis), TAIR (arabidopsis) etc.;

Page 43: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

43, Introduction to biological databases

2. TAIR

http://www.arabidopsis.org/

Page 44: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

44, Introduction to biological databases

• ~20‘300 human protein-coding genes

• 2850 protein-coding genes with mutations causing

human disorders

• ~ 1800 more to be discovered

• ~1100 loci affecting more than 165 polygenic disease

have been identified (PMID:21307931)

2. OMIM: Online Mendelian Inheritance in Man

Page 45: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

45, Introduction to biological databases

2. OMIM: Online Mendelian Inheritance in Man

http://www.omim.org/

Page 46: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

46, Introduction to biological databases

OMIM: ‗gene‘ entry

Page 47: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

47, Introduction to biological databases

OMIM: ‗disease‘ entry

Page 48: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

48, Introduction to biological databases

2. Genome browser: Ensembl

• Ensembl provides a bioinformatics framework to

organize biology around the sequences of large

genomes.

http://www.ensembl.org/

Page 49: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

49, Introduction to biological databases

Enseml genome browser

Page 50: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

50, Introduction to biological databases

Genome browser: USCS

http://genome.ucsc.edu/cgi-bin/hgGateway

Page 51: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

51, Introduction to biological databases

A eukaryotic gene (UCSC)

5‘ untranslated

region

Initial exon

Final exon

Introns

Internal exons

5’ 3’

Stop Met

Page 52: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

52, Introduction to biological databases

Genome browser: USCS

http://genome.ucsc.edu/cgi-bin/hgGateway

Page 53: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

53, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

Page 54: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

54, Introduction to biological databases

3. Mutation/polymorphism

Single nucleotide polymorphisms (SNPs) are unique genetic

differences between individuals that contribute in significant ways to

the determination of human variation including physical characteristics

like height and appearance as well as less obvious traits such as

personality, behaviour and disease susceptibility. SNPs can also

significantly influence responses to pharmacotherapy and whether

drugs will produce adverse reactions.

DOI: 10.2174/157016308785739811

SNP Technologies for Drug Discovery: A Current Review.

Each human genome contains: ~3‘000‘000 Single Nucleotide Polymorphisms (SNP) variants (1/1000 pb).

Page 55: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

55, Introduction to biological databases

S.E. Antonorakis

Page 56: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

56, Introduction to biological databases

3. Mutation/polymorphism db

• Contain information on sequence variations that are linked or not to

genetic diseases;

• General db:

– dbSNP - Human single nucleotide polymorphism (SNP) db

(variants with frequency > 1 %;

!!! a disease mutation is rare -> dbSNP has not much ‗disease–linked mutation‘)

• Disease-specific db: most of these databases are either linked to a

single gene or to a single disease;

– p53 mutation db

– ADB - Albinism db (Mutations in human genes causing albinism)

– Asthma and Allergy gene db

– ….

Page 57: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

57, Introduction to biological databases

http://www.ncbi.nlm.nih.gov/SNP/

Page 58: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

58, Introduction to biological databases

Page 59: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

59, Introduction to biological databases

Page 60: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

60, Introduction to biological databases

Blue eye allele… db SNP: rs12913832 -> link to the Alfred database

Yeux bleus Yeux bruns

Page 61: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

61, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

Page 62: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

62, Introduction to biological databases

4. Protein sequences – Eukaryotic cell

Cell elemental composition

Cells are made of 90% water.

The remaining is approximately:

• 50% protein (3.5kg)

• 15% carbohydrate

• 15% nucleic acid (1.3kg)

• 10% lipid

• 10% miscellaneous

Page 63: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

63, Introduction to biological databases

Amino acid sequence

(1 letter code)

of human titin

Page 64: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

64, Introduction to biological databases

4. Protein sequence origin

• About 180 milliards of proteins (?)

• > 15.0 millions of ‗known‘ protein sequences in 2011

• More than 99 % of the protein sequences are derived

from the translation of nucleotide sequences

• Less than 1 % direct protein sequencing (Edman,

MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequencing & gene prediction quality) !

Page 65: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

65, Introduction to biological databases

http://www.nature.com/news/2010/100922/full/467380a.html

(US$30 million per year)

Page 66: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

66, Introduction to biological databases

cDNAs, ESTs, genes, genomes, …

ENA GenBank DDBJ

Data not submitted to public databases: delayed or cancelled…

4. The hectic life of a sequence

Nucleic acid databases

Protein sequence

databases

…if the submitters provide an

annotated Coding Sequence (CDS) (1/10 ENA entries)

Gene prediction

RefSeq, Ensembl

no CDS

Page 67: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

67, Introduction to biological databases

cDNAs, ESTs, genes, genomes, …

ENA GenBank DDBJ

Data not submitted to public databases: delayed or cancelled…

4. The hectic life of a sequence

TrEMBL Genpept RefSeq PRF

Scientific publications

derived sequences

Swiss-Prot

CoDing Sequences provided by submitters

CoDing Sequences provided by submitters

and gene prediction

UniProtKB Ensembl

CCDS

UniParc

PDB (PIR)

+ all ‗species‘ specific databases (EcoGene, TAIR, …)

(IPI)

UniMES

Page 68: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

68, Introduction to biological databases

4. Major protein sequence db ‗sources‘

1. UniProtKB: Swiss-Prot + TrEMBL

2. NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

UniProtKB/Swiss-Prot: manually annotated protein sequences (12‘500 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with

Swiss-Prot (380‘000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (380‘000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‗published‘ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual

annotation (16‘000 species)

PIR PDB PRF Integrated resources

‗cross-references‘

Separated resources

Page 69: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

69, Introduction to biological databases

Swiss-Prot

TrEMBL

Look for toll-like receptor 4

(homo sapiens)

www.uniprot.org

Page 70: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

70, Introduction to biological databases

GenPept

Swiss-Prot

RefSeq

GenPept

GenPept

GenPept

GenPept

GenPept

GenPept

Look for toll-like receptor 4

(homo sapiens)

http://www.ncbi.nlm.nih.gov/

Page 71: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

71, Introduction to biological databases

4. UniProt - The Universal Protein resource

is maintained by the UniProt consortium: SIB + EBI + PIR

http://www.uniprot.org/

UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG02712-04. Additional support for the EBI's involvement in

UniProt comes from the European Commission (EC)'s FELICS grant (021902RII3) and from the NIH grant 1R01HGO2273-01. Swiss-Prot

activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the

European Commission contracts FELICS (021902RII3) and SLING (226073). PIR activities are also supported by the NIH grants and

contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the Department of Defense grant W81XWH0720112.

Page 72: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

4. UniProtKB: from ENA to TrEMBL

ENA (DNA)

TrEMBL

Translated CDS

Reference + tissue

Protein name

Translated CDS

Product name

Tissue

Reference

Automated extraction of

protein sequence

(translated CDS), gene

name and references.

Automated annotation.

Page 73: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

73, Introduction to biological databases

UniProtKB/TrEMBL

Automatic annotation

Protein sequence

- The quality of the protein sequences is dependent on the information provided by the

submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e.

Ensembl).

- 100% identical sequences (same length, same organism are merged automatically).

Biological information Sources of annotation

-Provided by the submitter (EMBL, PDB, TAIR…)

-From automated annotation (automated generated annotation rules (i.e. SAAS) and/or

manually generated annotation rules (i.e. UniRule))

Page 74: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

74, Introduction to biological databases

Page 75: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

4. UniProtKB: from TrEMBL to Swiss-Prot

TrEMBL

Translated CDS

Reference

Protein name

Swiss-Prot

Manual annotation of

the sequence and

manual review of

associated biological

information

Protein nameS

Many more references

Translated CDS

+ polymorphisms

+ isoforms

+ …

Full annotation

Once manually annotated and integrated into Swiss-Prot,

the entry is deleted from TrEMBL

-> minimal redundancy

Page 76: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

76, Introduction to biological databases

UniProtKB/Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence

discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature

information, ortholog data propagation, …)

Page 77: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly
Page 78: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

78, Introduction to biological databases

Protein and gene names

Page 79: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

79, Introduction to biological databases

…enable researchers to obtain a summary of what is known about a protein…

General annotation

(Comments)

www.uniprot.org

Page 80: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

80, Introduction to biological databases

Human protein manual annotation:

some statistics (June 2011)

Page 81: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

81, Introduction to biological databases

Sequence annotation

(Features)

…enable researchers to obtain a summary of what is known about a protein…

www.uniprot.org

Page 82: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

82, Introduction to biological databases

Non-experimental qualifiers

UniProtKB/Swiss-Prot considers both experimental and predicted data

and makes a clear distinction between both

Type of evidence Qualifier

Strong experimental evidence None or Ref.X

Light experimental evidence Probable

Inferred by similarity with homologous protein By similarity

Inferred by prediction Potential

Page 83: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

83, Introduction to biological databases

• The ‗Protein existence‘ tag indicates what is the evidence for

the existence of a given protein;

• Different qualifiers:

1. Evidence at protein level (~18%)

(MS, western blot (tissue specificity), immuno (subcellular location),…)

2. Evidence at transcript level (~19%)

3. Inferred from homology (~58 %)

4. Predicted (~5%)

5. Uncertain (mainly in TrEMBL)

‘Protein existence’ tag

http://www.uniprot.org/docs/pe_criteria

Page 84: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

84, Introduction to biological databases

Page 85: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

85, Introduction to biological databases

The UniProt web site

www.uniprot.org

• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches

• Scoring mechanism presenting relevant matches first

• Entry views, search result views and downloads are customizable

• The URL of a result page reflects the query; all pages and queries are

bookmarkable, supporting programmatic access

• Search, Blast, Align, Retrieve, ID mapping

Page 86: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

86, Introduction to biological databases

Search

A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information

Page 87: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

87, Introduction to biological databases

The search interface guides users with helpful suggestions and hints

Page 88: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

88, Introduction to biological databases

Page 89: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

89, Introduction to biological databases

Advanced Search

A very powerful search tool

To be used when you know in which

entry section the information is stored

Page 90: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

90, Introduction to biological databases

Find all the protein localized in the cytoplasm (experimentally proven)

which are phosphorylated on a serine (experimentally proven)

Page 91: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

91, Introduction to biological databases

Result pages: highly customizable

Page 92: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

92, Introduction to biological databases

Result pages: downloadable

Page 93: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

93, Introduction to biological databases

Page 94: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

94, Introduction to biological databases

4. Major protein sequence db ‗sources‘

1. UniProtKB: Swiss-Prot + TrEMBL

2. NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

UniProtKB/Swiss-Prot: manually annotated protein sequences (12‘500 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with

Swiss-Prot (380‘000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (380‘000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‗published‘ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual

annotation (16‘000 species)

PIR PDB PRF Integrated resources

‗cross-references‘

Separated resources

Page 95: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

4. NCBI nr - Entrez ‗protein‘

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

Page 96: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

96, Introduction to biological databases

Contains all CDS annotated in

GenBank/ENA/DDBJ sequences

‗translations from annotated coding regions in

GenBank‘

- equivalent to TrEMBL,

except that it is

redundant with other databases

(Swiss-Prot, RefSeq, PIR….)

All PIR data have been

integrated into Swiss-Prot

and TrEMBL (UniProt)

3D structure database:

all the protein sequences

which have been cristallized

(Swiss-Prot/TrEMBL are

crosslinked to PDB)

Scientific publications

derived sequences

« Journal scan »

(integrated into TrEMBL)

4. Protein sequences: NCBI nr

NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

Page 97: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

97, Introduction to biological databases

4. RefSeq

RefSeq: The Reference Sequence (RefSeq) collection aims to provide

a comprehensive, integrated, non-redondant set of sequences,

including genomic DNA, transcript (RNA), and protein products, for

major research organisms.

Tightly linked to Entrez Gene ("interdependent curated resources")

Page 98: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

98, Introduction to biological databases

AC

Taxonomy

References

4. RefSeq

Protein: NP_

mRNA: NM_

DNA: NC_

Page 99: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

99, Introduction to biological databases

Status and Genbank source

Annotation

- automated,

- derived from Swiss-Prot

- in-house

Page 100: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

100, Introduction to biological databases

Annotation

- automated,

- derived from Swiss-Prot

- in-house

Sequence

Cross-references

Page 101: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

101, Introduction to biological databases

4. RefSeq

http://www.ncbi.nlm.nih.gov/RefSeq/

Curation status : manual annotation

GENOME ANNOTATION No

INFERRED No

MODEL No

PREDICTED No

PROVISIONAL No

REVIEWED Yes (sequence + functional information

and features)

VALIDATED Yes (initial sequence)

Whole Genome Sequencing (WGS) No

Page 102: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

102, Introduction to biological databases

These identifiers are all pointing to a TP53 (p53) protein sequence !

P04637, NP_000537, NP_001119584.1, NP_001119585.1,

NP_001119584.1, NP_001119584.1, NP_001119584.1,

NP_001119584.1, ENSG00000141510, CCDS11118,

UPI000002ED67, IPI00025087, etc.

4. Accession number (AC) mapping

Page 103: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

103, Introduction to biological databases

http://www.uniprot.org/mapping/

Page 104: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

104, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

Page 105: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

105, Introduction to biological databases

• Most proteins have « modular » conserved structures

• Estimation: ~ 3 domains / protein

• Estimation: ~ 6000 ‗known‘ domains

-> Prediction of domain content of a unkown protein

sequence may help to find a ‗function‘

…Estimation: ~ 80% of protein have at least a ‗known‘ domain

5. Protein domain/family: some definitions

Page 106: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

CSA_PPIASE

Cys 181: active site residue Binding cleft (motif)

Example of conserved regions (PPID family)

- 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain)

- 3 TPR repeats (tetratrico peptide repeat).

- 1 active site

- Binding cleft (motif)

Page 107: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

107, Introduction to biological databases

Domain signatures methods:

derived from ‗modelled‘ multiple sequence alignments (MSA)

• Pattern

• Fingerprint

• Sequence clustering

• Profile

• HMM

Page 108: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

108, Introduction to biological databases

How to build a PROSITE pattern ?

• Start with a multiple sequence alignment (MSA)

Information lost: 4D 1E

Page 109: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

109, Introduction to biological databases

5. Protein domain/family db

PROSITE Patterns / Profiles

ProDom Aligned motifs (PSI-BLAST) (Pfam B)

PRINTS Aligned motifs

Pfam HMM (Hidden Markov Models)

SMART HMM

TIGRfam HMM

DOMO Aligned motifs

BLOCKS Aligned motifs (PSI-BLAST)

CDD(CDART) PSI-BLAST(PSSM) of Pfam and SMART

I

n

t

e

r

p

r

o

Page 110: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

110, Introduction to biological databases

InterPro scan results

?

Part of the protein

sequence wich has been

‗recognized‘ by different

modelled MSA

Page 111: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

What makes Bee special?

Page 112: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

112, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

Page 113: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

113, Introduction to biological databases

6. Proteomics db

• Mass Spectrometry (MS) database: Pride

• SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE,

Sub2D, Cyano2DBase, etc.

– Contain informations obtained by 2D-PAGE: images of master gels and

description of identified proteins

– Composed of image and text files

Page 114: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly
Page 115: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

115, Introduction to biological databases

6. PRIDE

http://www.ebi.ac.uk/pride/

Page 116: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

116, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

Page 117: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

117, Introduction to biological databases

3D structure

• Only one database : PDB (Protein Data Bank)

but several servers….

• Contains the spatial coordinates of macromolecule

atoms whose 3D structure has been experimentally

obtained by X-ray or NMR studies; also a few models.

• Proteins represent more than 90% of available structures

(others are DNA, RNA, sugars, viruses, protein/DNA

complexes…)

Page 118: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

118, Introduction to biological databases

7. PDB: Protein Data Bank

• Managed by Research Collaboratory for Structural Bioinformatics

(RCSB) (USA).

• Associated with specialized programs allow the visualization of the

corresponding 3D structure (e.g., SwissPDB-viewer, Chime,

Rasmol)).

• Currently - September 28, 2011 - there are 75‘000 structural data for

about 20‘000 different proteins (highly redundant) !

http://www.pdb.org/ (RCSB)

http://www.ebi.ac.uk/pdbe/

http://www.pdbj.org/

Page 119: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

119, Introduction to biological databases

7. PDB: example

HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2

COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3

COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4

SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5

AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6

REVDAT 1 15-OCT-92 12CA 0 12CA 7

JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8

JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9

JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10

JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11

JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12

JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13

REMARK 1 12CA 14

REMARK 2 12CA 15

REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16

REMARK 3 12CA 17

REMARK 3 REFINEMENT. 12CA 18

REMARK 3 PROGRAM PROLSQ 12CA 19

REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20

REMARK 3 R VALUE 0.170 12CA 21

REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22

REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23

REMARK 4 12CA 24

REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25

REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26

REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27

………

Page 120: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

120, Introduction to biological databases

7. PDB (cont.)

SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68

SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69

SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70

SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71

SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72

SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73

SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74

SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75

TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76

TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77

TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78

TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79

TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80

TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81

CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82

ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83

ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84

ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85

SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86

SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87

SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88

ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89

ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90

ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91

ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92

ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93

ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94

ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95

ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96

ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97

ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98

ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99

ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100

ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101

ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102

…….

Coordinates <x; y; z> of each atom

Page 121: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

The same PDB

entry

―visualized‖ with

Chime

Page 122: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

122, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

8. Metabolism/Pathways

Page 123: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

123, Introduction to biological databases

8. Databases: metabolic

• Contain informations that describe enzymes,

biochemical reactions and metabolic pathways;

• Nomenclature databases store informations on enzyme

names and reactions: ENZYME, BRENDA, IntEnz

• Metabolic databases: MetaCyc, KEGG, UniPathway,

RhEA;

• Usually these databases are tightly coupled with query

software that allows the user to visualise reaction

schemes;

• Ligands and chemicals: ChEBI, KEGG ligand;

Page 124: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

124, Introduction to biological databases

Page 125: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

Useful to prepare lab’s experiments ! http://www.brenda-enzymes.org/

8. BRENDA

Page 126: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

http://www.genome.ad.jp/kegg

8. KEGG

Page 127: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

127, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

9. Bibliography

Page 128: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

128, Introduction to biological databases

9. Bibliography

• Bibliographic reference databases contain citations and

abstract information of published life science articles;

• Example: PubMed, PubMed central

• Other more specialized databases also exist:

Agricola ( http://agricola.nal.usda.gov/)

EMBASE - not free

Page 129: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

129, Introduction to biological databases

9. PubMed / Medline

• Established in 1950;

• Database of citations and abstracts to biomedical and

other life science journal literature;

• Encompasses MedLine;

• Gives access to: – > 21 millions papers (dating back to the 1860s),

– > 20‘400 life science journals,

– ~ 55 languages (17‘751 jounals in English, 2‘000 in French, 372 in

Chinese, 29 in Latin, 1 in Azerbaijani, etc…).

PMID: 10923642 (PubMed ID)

UI: 20378145 (Medline ID)

DOI : 10.1016/S0960-9822(03)00148-9 (Digital Object Identifier)

http://www.ncbi.nlm.nih.gov/pubmed/

Page 130: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

130, Introduction to biological databases

PubMed central

• Free digital archive of free access full-texts (since 2000)

• ~700 journals (list: http://www.ncbi.nlm.nih.gov/pmc/journals/), most of which have a corresponding entry in PubMed

• Free access to the full text either immediately after publication of within a 12-month period.

http://www.ncbi.nlm.nih.gov/pmc/

http://www.ncbi.nlm.nih.gov/pmc/

Page 131: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

131, Introduction to biological databases

10. Others

• There are many databases that cannot be classified in

the categories listed previously;

• Examples: – ReBase (restriction enzymes)

– TRANSFAC (transcription factors)

– CarbBank

– GlycoSuiteDB (linked sugars)

– Protein-protein interactions db (DIP, ProNet, BIND, MINT, String),

– Protease db (MEROPS), biotechnology patents db, Microarrays, etc.;

• As well as many other resources concerning any aspects

of macromolecules and molecular biology.

Page 132: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

132, Introduction to biological databases

Protein/protein interaction: description from 1 to more than 20‘000

interactions / publication

Several databases: Intact, BIND, DIP, String

Estimation: 10’000 fundamental interaction types

10. Interactome

Page 133: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

133, Introduction to biological databases

Page 134: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

134, Introduction to biological databases

10. Intact

http://www.ebi.ac.uk/intact/

Page 135: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

135, Introduction to biological databases

10. Gene Ontology

• The Gene Ontology is a controlled vocabulary, a set of

standard terms—words and phrases—used for indexing

and retrieving information.

• In addition to defining terms, GO also defines the

relationships between the terms, making it a structured

vocabulary.

• The Gene Ontology ensures that the flood of information

produced can be effectively utilized by standardization of

biological data/information

http://www.geneontology.org

Page 136: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

136, Introduction to biological databases

http://www.geneontology.org

Page 137: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

137, Introduction to biological databases

About 30‘000 terms (with definition and hierarchy)

biological process

• broad biological phenomena e.g. mitosis, growth, digestion

(included PTMs).

molecular function

• molecular role e.g. catalytic activity, binding

cellular component

• Subcellular location e.g nucleus, ribosome, origin recognition

complex

10. Gene Ontology

Page 138: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

138, Introduction to biological databases

http://www.ebi.ac.uk/QuickGO/

Page 139: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

139, Introduction to biological databases

10. Gene Ontology annotation

Annotation is the process of assigning/mapping

GO terms to gene products…

!!! Electronic vs Manual annotation…

Page 140: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

140, Introduction to biological databases

Example with EPO

Page 141: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

141, Introduction to biological databases

Histone H4

!!! Large scale derived data (‗proteome‘)

Page 142: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

142, Introduction to biological databases

Essential link between biological knowledge and high

throuput genomic and proteomic datasets…

‘summary of the gene ontology classifications for all mapped ESTs…’

10. Gene Ontology

Page 143: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

143, Introduction to biological databases

Genome-Wide RNAi screens identify genes

required for ricin and Pseudomonas

exotoxin intoxications

DOI 10.1016/j.devcel.2011.06.014

Gene Ontology analysis on the 2038 genes hit list.

Page 144: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

144, Introduction to biological databases

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

9. Bibliography

10. Others

• Concluding remarks

Page 145: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

145, Introduction to biological databases

Proliferation of databases

• Which does contain the highest quality data ?

• Which is the more comprehensive ?

• Which is the more up-to-date ?

• Which is the less redundant ?

• Which is the more indexed (allows complex queries) ?

• Which Web server does respond most quickly ?

• …….??????

Page 146: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

146, Introduction to biological databases

Some important practical remarks

• Databases: many errors (automated annotation) !

• Not all db are available on all servers

• The update frequency is not the same for all servers;

• Some servers add automatically cross-references to an

entry (implicit links) in addition to already existing links

(explicit links)… different looks…

Page 147: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

147, Introduction to biological databases

Before the introduction to databases…

After the introduction to databases…

Page 148: Introduction to biological databases · Why biological databases (db) ? • Exponential growth in biological data • Data are no longer published in a conventional manner, but directly

148, Introduction to biological databases

Marie-Claude Blatter

Swiss-Prot, Geneva

SIB Swiss Institute of Bioinformatics

[email protected]

Credits