introduction to biological databases · why biological databases (db) ? • exponential growth in...

Post on 29-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to biological databases

Introduction to Biological Databases

2, Introduction to biological databases

What is ‗Bioinformatics‘?

Bioinformatics is the application of computer sciences to

biology

… interdisciplinary science

… strives to solve the problems of the life sciences with

theoretical computer-assisted methods

… indispensable for modern biology and medicine

… uses techniques such as applied mathematics and

statistics

3, Introduction to biological databases

Some major research areas in bioinformatics

• Sequence analysis and function prediction

• Analysis and prediction of protein structure

• Computational evolutionary biology

• Comparative genomics

• Gene and protein expression

• Protein-Protein Interaction (PPI) analysis

• System‘s biology

• Image analysis

• Visualization

4, Introduction to biological databases

Indispensable for bioinformatic studies:

1. Databases

2. Software tools

3. Servers

Introduction

5, Introduction to biological databases

Outline

• Introduction

• Selected categories of life sciences databases 1. Nucleotide sequences

2. Genomics

3. Mutation/polymorphism

4. Protein sequences

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

8. Metabolism/Pathways

9. Bibliography

10. Others

• Concluding remarks

• Practicals

6, Introduction to biological databases

Introduction

What is a database (db) ?

• A collection of related data, which are: – structured

– searchable (index) -> table of contents

– updated periodically (release) -> new edition

– cross-referenced (hyperlinks) -> links with other db

• Includes also associated tools (software) necessary for

db access, db updating, db information insertion or

deletion….

• Data storage format: flat files (text, FASTA), relational

(XML, RDF)

7, Introduction to biological databases

Introduction

Why biological databases (db) ?

• Exponential growth in biological data

• Data are no longer published in a conventional manner,

but directly submitted to databases (nucleotides & amino acids sequences, 3D structures, 2D gel analysis, MS

analysis, microarrays, publications, protein-protein interactions,…)

• Essential tools for biological research

8, Introduction to biological databases

P. Gaudet, ‗A community of Biocurators‘

9, Introduction to biological databases

Science cover, February 2011

10, Introduction to biological databases

Some statistics and remarks

• More than 1000 different "biological" databases

• Variable size: <100Kb to >100Gb (ENA > 728Gb !) – DNA: > 100 Gb

– Protein: 2 Gb

– 3D structure: 5 Gb

– Other: smaller

• Update frequency: daily to annually

• Generally accessible through the web (free!?)

11, Introduction to biological databases

Where can we find…

• a video -> Youtube

• info on S. Hawking-> wikipedia

• a book -> Amazon

• a friend -> Facebook, Google plus

• DNA sequence -> EMBL

• protein sequence -> UniProtKB, RefSeq

• 3D data -> PDB

• Microarrays data -> ArrayExpress, GEO

• Publications -> PubMed

12, Introduction to biological databases

10 most important bioinformatics databases *

* according to the "Bioinformatics for dummies"

Name URL Data type

GenBank www.ncbi.nlm.nih.gov Nucleotide sequences

Ensembl www.ensembl.org Genomes

PubMed www.ncbi.nlm.nih.gov Literature references

NCBI nr www.ncbi.nlm.nih.gov Protein sequences

UniProtKB www.uniprot.org Protein sequences

InterPro www.ebi.ac.uk Protein domains

OMIM www.omim.org/ Genetic diseases

Enzymes http://enzyme.expasy.org/ Enzymes

PDB www.rcsb.org/pdb/ Protein structures

KEGG www.genome.ad.jp Metabolic pathways

14, Introduction to biological databases

Databases / Servers

• A server is a computer (from a given institute) that

provides services (stores databases and associated

tools) to other computers

• Main biological servers: – ExPASy (www.expasy.org/)

– UniProt (www.uniprot.org)

– NCBI (www.ncbi.nlm.nih.gov/)

– EBI (www.ebi.ac.uk/)

– Japanese GenomeNet (www.genome.jp/)

• Not all servers give access to the same databases and

to the same search tools ! ... when servers give access to the same databases, the 'look' is different ...

and beware the date of the latest release !

UniProt NCBI

The same data on different servers…. Same data on different servers ...

16, Introduction to biological databases

How to find a database ?

• The Nucleic Acids Research (NAR) Online Molecular Biology Database collection 2011:

a total of 1‘330 databases

http://www.oxfordjournals.org/nar/database/a/

• Expasy Life Science Directory: http://www.expasy.org/links.html (no more updated)

• Google: http://www.google.com/

17, Introduction to biological databases

http://www.expasy.org/links.html Expasy Life Science Directory

18, Introduction to biological databases

Awareness of the content

and usage of knowledge resources

is a pre-requisite to do any type of "serious" research

in the field of molecular life sciences

(Amos Bairoch, 2007)

19, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

20, Introduction to biological databases

Deluge of sequence data

• ~ 3200 genomes sequenced (single organism, varying sizes, including virus)

• ~ 5‘000 ongoing genome sequencing projects

• cDNAs sequencing projects (ESTs or cDNAs)

• metagenome sequencing projects (~300) (environmental samples: multiple ‘unknown’ organisms, varying sizes)

– Ecological metagenomics: beach sand, Sargasso Sea, New-York air, …

– Organismal metagenomics: human fluids, mouse gut, …

• Personal Human genomes

21, Introduction to biological databases

Deluge of sequence data

• Personal human genomes!

http://www.youtube.com/watch?v=mVZI7NBgcWM

22, Introduction to biological databases

Deluge of sequence data

But…we know now that his apoE allele is the one

associated with increased risk for Alzheimer and

that he has the ‗blue eye‘ allele…

23, Introduction to biological databases

Enseml genome browser

24, Introduction to biological databases

Deluge of sequence data

http://www.personalgenomes.org

25, Introduction to biological databases

DNA sequence of the

telomeric region of

human chromosome x

26, Introduction to biological databases

1. Nucleotide sequences db

• The main DNA sequence db are:

EMBL/ENA (Europe)/GenBank (USA) /DDBJ (Japan)

-> INSDC collaboration

• There are also specialized databases for the different

types of RNAs (i.e. tRNA, rRNA, tmRNA, uRNA, etc…)

• Others:

Eucaryotic promoter db (EPD); RNA editing sites,...

27, Introduction to biological databases

1. EMBL-ENA/GenBank/DDBJ http://www.insdc.org/

Archive of primary sequence data and corresponding annotation

submitted by the laboratories that did the sequencing.

28, Introduction to biological databases

1. Same data on different servers

EBI (EMBL/ENA) NCBI (GenBank)

NIG (DDBJ)

29, Introduction to biological databases

1. EMBL-ENA/GenBank/DDBJ

• Serve as archives : ‗nothing goes out‘

• Contain all public sequences derived from: – Genome projects (> 80 % of entries)

– Sequencing centers (cDNAs, ESTs…)

– Individual scientists ( 15 % of entries)

– Patent offices (i.e. European Patent Office, EPO)

• Currently: ~150x106 sequences, ~200 x109 bp;

• Sequences from > 500‘000 different species;

30, Introduction to biological databases

1. Ideal content of a "sequence" db

• Sequences !!

• Unique Accession number (AC)

• References

• Taxonomic data

• ANNOTATION/CURATION

• Keywords

• Cross-references

• Documentation

Minimal requirements !

31, Introduction to biological databases

1. EMBL-ENA entry

Cross-references

accession number

taxonomy

references

32, Introduction to biological databases

1. EMBL-ENA entry (cont.)

Annotation

(Prediction or

experimentally determined)

sequence

CDS

Coding Sequence

(proposed by submitters)

33, Introduction to biological databases

cDNAs, ESTs, genes, genomes, …

EMBL/ENA GenBank DDBJ

Data not submitted to public databases: delayed or cancelled…

1. The hectic life of a sequence

CDS

Coding sequence

Portion of DNA/RNA translated into protein (from Met to 'STOP')

Experimentally proved or derived from gene prediction

Not so well documented !

with or without annotated CDS

provided by the authors

34, Introduction to biological databases

Coding Sequence (CDS): Alignments between a mRNA and a genomic sequence

1. EMBL-ENA vs GenBank format

36, Introduction to biological databases

1. Fasta format

37, Introduction to biological databases

1. EMBL-ENA/GenBank/DDBJ

• Heterogeneous sequence length and quality: – ESTs, genomes, variants, fragments…

• Sequence sizes: – max 350‘000 bp /entry (! genomic sequences, overlapping)

– min 10 bp /entry

• Archive: nothing goes out -> highly redundant !

• full of errors: in sequences, in annotations, in CDS

attribution, no consistency of annotations: – most annotations are done by the submitters;

– heterogeneity of the quality and the completion and updating of the

information

38, Introduction to biological databases

1. EMBL-ENA/GenBank/DDBJ

• Unexpected information you can find in these db: FT source 1..124

FT /db_xref="taxon:4097"

FT /organelle="plastid:chloroplast"

FT /organism="Nicotiana tabacum"

FT /isolate="Cuban cahibo cigar, gift from

FT President Fidel Castro"

• Or: FT source 1..17084

FT /chromosome="complete mitochondrial genome"

FT /db_xref="taxon:9267"

FT /organelle="mitochondrion"

FT /organism="Didelphis virginiana"

FT /dev_stage="adult"

FT /isolate="fresh road killed individual"

FT /tissue_type="liver"

39, Introduction to biological databases

1. Other nucleotide sequences databases

http://www.rnaiweb.com/RNAi/RNAi_Web_Resources/siRNA_Collections___Databases/

40, Introduction to biological databases

1. Other nucleotide sequences databases

• EPD is a rigorously selected database. In order to be included in EPD, a

promoter must be: – recognized by eukaryotic RNA POL II,

– active in a higher eukaryote,

– experimentally defined, or homologous and sufficiently similar to an experimentally defined promoter,

– biologically functional,

– available in the current ENA release,

– distinct from other promoters in the database.

http://www.epd.isb-sib.ch/

41, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

42, Introduction to biological databases

2. ‗Genomics databases‘

• Contain information on gene chromosomal location

(mapping) and nomenclature, and provide links to

sequence databases; contain usually no sequence!

• Exist for most model organisms; usually species specific.

• Examples: MIM (human), MGD (mouse), FlyBase

(Drosophila), SGD (yeast), MaizeDB (maize), SubtiList

(B.subtilis), TAIR (arabidopsis) etc.;

43, Introduction to biological databases

2. TAIR

http://www.arabidopsis.org/

44, Introduction to biological databases

• ~20‘300 human protein-coding genes

• 2850 protein-coding genes with mutations causing

human disorders

• ~ 1800 more to be discovered

• ~1100 loci affecting more than 165 polygenic disease

have been identified (PMID:21307931)

2. OMIM: Online Mendelian Inheritance in Man

45, Introduction to biological databases

2. OMIM: Online Mendelian Inheritance in Man

http://www.omim.org/

46, Introduction to biological databases

OMIM: ‗gene‘ entry

47, Introduction to biological databases

OMIM: ‗disease‘ entry

48, Introduction to biological databases

2. Genome browser: Ensembl

• Ensembl provides a bioinformatics framework to

organize biology around the sequences of large

genomes.

http://www.ensembl.org/

49, Introduction to biological databases

Enseml genome browser

50, Introduction to biological databases

Genome browser: USCS

http://genome.ucsc.edu/cgi-bin/hgGateway

51, Introduction to biological databases

A eukaryotic gene (UCSC)

5‘ untranslated

region

Initial exon

Final exon

Introns

Internal exons

5’ 3’

Stop Met

52, Introduction to biological databases

Genome browser: USCS

http://genome.ucsc.edu/cgi-bin/hgGateway

53, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

54, Introduction to biological databases

3. Mutation/polymorphism

Single nucleotide polymorphisms (SNPs) are unique genetic

differences between individuals that contribute in significant ways to

the determination of human variation including physical characteristics

like height and appearance as well as less obvious traits such as

personality, behaviour and disease susceptibility. SNPs can also

significantly influence responses to pharmacotherapy and whether

drugs will produce adverse reactions.

DOI: 10.2174/157016308785739811

SNP Technologies for Drug Discovery: A Current Review.

Each human genome contains: ~3‘000‘000 Single Nucleotide Polymorphisms (SNP) variants (1/1000 pb).

55, Introduction to biological databases

S.E. Antonorakis

56, Introduction to biological databases

3. Mutation/polymorphism db

• Contain information on sequence variations that are linked or not to

genetic diseases;

• General db:

– dbSNP - Human single nucleotide polymorphism (SNP) db

(variants with frequency > 1 %;

!!! a disease mutation is rare -> dbSNP has not much ‗disease–linked mutation‘)

• Disease-specific db: most of these databases are either linked to a

single gene or to a single disease;

– p53 mutation db

– ADB - Albinism db (Mutations in human genes causing albinism)

– Asthma and Allergy gene db

– ….

57, Introduction to biological databases

http://www.ncbi.nlm.nih.gov/SNP/

58, Introduction to biological databases

59, Introduction to biological databases

60, Introduction to biological databases

Blue eye allele… db SNP: rs12913832 -> link to the Alfred database

Yeux bleus Yeux bruns

61, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

62, Introduction to biological databases

4. Protein sequences – Eukaryotic cell

Cell elemental composition

Cells are made of 90% water.

The remaining is approximately:

• 50% protein (3.5kg)

• 15% carbohydrate

• 15% nucleic acid (1.3kg)

• 10% lipid

• 10% miscellaneous

63, Introduction to biological databases

Amino acid sequence

(1 letter code)

of human titin

64, Introduction to biological databases

4. Protein sequence origin

• About 180 milliards of proteins (?)

• > 15.0 millions of ‗known‘ protein sequences in 2011

• More than 99 % of the protein sequences are derived

from the translation of nucleotide sequences

• Less than 1 % direct protein sequencing (Edman,

MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequencing & gene prediction quality) !

65, Introduction to biological databases

http://www.nature.com/news/2010/100922/full/467380a.html

(US$30 million per year)

66, Introduction to biological databases

cDNAs, ESTs, genes, genomes, …

ENA GenBank DDBJ

Data not submitted to public databases: delayed or cancelled…

4. The hectic life of a sequence

Nucleic acid databases

Protein sequence

databases

…if the submitters provide an

annotated Coding Sequence (CDS) (1/10 ENA entries)

Gene prediction

RefSeq, Ensembl

no CDS

67, Introduction to biological databases

cDNAs, ESTs, genes, genomes, …

ENA GenBank DDBJ

Data not submitted to public databases: delayed or cancelled…

4. The hectic life of a sequence

TrEMBL Genpept RefSeq PRF

Scientific publications

derived sequences

Swiss-Prot

CoDing Sequences provided by submitters

CoDing Sequences provided by submitters

and gene prediction

UniProtKB Ensembl

CCDS

UniParc

PDB (PIR)

+ all ‗species‘ specific databases (EcoGene, TAIR, …)

(IPI)

UniMES

68, Introduction to biological databases

4. Major protein sequence db ‗sources‘

1. UniProtKB: Swiss-Prot + TrEMBL

2. NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

UniProtKB/Swiss-Prot: manually annotated protein sequences (12‘500 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with

Swiss-Prot (380‘000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (380‘000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‗published‘ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual

annotation (16‘000 species)

PIR PDB PRF Integrated resources

‗cross-references‘

Separated resources

69, Introduction to biological databases

Swiss-Prot

TrEMBL

Look for toll-like receptor 4

(homo sapiens)

www.uniprot.org

70, Introduction to biological databases

GenPept

Swiss-Prot

RefSeq

GenPept

GenPept

GenPept

GenPept

GenPept

GenPept

Look for toll-like receptor 4

(homo sapiens)

http://www.ncbi.nlm.nih.gov/

71, Introduction to biological databases

4. UniProt - The Universal Protein resource

is maintained by the UniProt consortium: SIB + EBI + PIR

http://www.uniprot.org/

UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG02712-04. Additional support for the EBI's involvement in

UniProt comes from the European Commission (EC)'s FELICS grant (021902RII3) and from the NIH grant 1R01HGO2273-01. Swiss-Prot

activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the

European Commission contracts FELICS (021902RII3) and SLING (226073). PIR activities are also supported by the NIH grants and

contracts HHSN266200400061C, NCI-caBIG, and 5R01GM080646-04, and the Department of Defense grant W81XWH0720112.

4. UniProtKB: from ENA to TrEMBL

ENA (DNA)

TrEMBL

Translated CDS

Reference + tissue

Protein name

Translated CDS

Product name

Tissue

Reference

Automated extraction of

protein sequence

(translated CDS), gene

name and references.

Automated annotation.

73, Introduction to biological databases

UniProtKB/TrEMBL

Automatic annotation

Protein sequence

- The quality of the protein sequences is dependent on the information provided by the

submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e.

Ensembl).

- 100% identical sequences (same length, same organism are merged automatically).

Biological information Sources of annotation

-Provided by the submitter (EMBL, PDB, TAIR…)

-From automated annotation (automated generated annotation rules (i.e. SAAS) and/or

manually generated annotation rules (i.e. UniRule))

74, Introduction to biological databases

4. UniProtKB: from TrEMBL to Swiss-Prot

TrEMBL

Translated CDS

Reference

Protein name

Swiss-Prot

Manual annotation of

the sequence and

manual review of

associated biological

information

Protein nameS

Many more references

Translated CDS

+ polymorphisms

+ isoforms

+ …

Full annotation

Once manually annotated and integrated into Swiss-Prot,

the entry is deleted from TrEMBL

-> minimal redundancy

76, Introduction to biological databases

UniProtKB/Swiss-Prot

Manual annotation

1. Protein sequence (merge available CDS, annotate sequence

discrepancies, report sequencing mistakes…)

2. Biological information (sequence analysis, extract literature

information, ortholog data propagation, …)

78, Introduction to biological databases

Protein and gene names

79, Introduction to biological databases

…enable researchers to obtain a summary of what is known about a protein…

General annotation

(Comments)

www.uniprot.org

80, Introduction to biological databases

Human protein manual annotation:

some statistics (June 2011)

81, Introduction to biological databases

Sequence annotation

(Features)

…enable researchers to obtain a summary of what is known about a protein…

www.uniprot.org

82, Introduction to biological databases

Non-experimental qualifiers

UniProtKB/Swiss-Prot considers both experimental and predicted data

and makes a clear distinction between both

Type of evidence Qualifier

Strong experimental evidence None or Ref.X

Light experimental evidence Probable

Inferred by similarity with homologous protein By similarity

Inferred by prediction Potential

83, Introduction to biological databases

• The ‗Protein existence‘ tag indicates what is the evidence for

the existence of a given protein;

• Different qualifiers:

1. Evidence at protein level (~18%)

(MS, western blot (tissue specificity), immuno (subcellular location),…)

2. Evidence at transcript level (~19%)

3. Inferred from homology (~58 %)

4. Predicted (~5%)

5. Uncertain (mainly in TrEMBL)

‘Protein existence’ tag

http://www.uniprot.org/docs/pe_criteria

84, Introduction to biological databases

85, Introduction to biological databases

The UniProt web site

www.uniprot.org

• Powerful search engine, google-like and easy-to-use, but also supports very directed field searches

• Scoring mechanism presenting relevant matches first

• Entry views, search result views and downloads are customizable

• The URL of a result page reflects the query; all pages and queries are

bookmarkable, supporting programmatic access

• Search, Blast, Align, Retrieve, ID mapping

86, Introduction to biological databases

Search

A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information

87, Introduction to biological databases

The search interface guides users with helpful suggestions and hints

88, Introduction to biological databases

89, Introduction to biological databases

Advanced Search

A very powerful search tool

To be used when you know in which

entry section the information is stored

90, Introduction to biological databases

Find all the protein localized in the cytoplasm (experimentally proven)

which are phosphorylated on a serine (experimentally proven)

91, Introduction to biological databases

Result pages: highly customizable

92, Introduction to biological databases

Result pages: downloadable

93, Introduction to biological databases

94, Introduction to biological databases

4. Major protein sequence db ‗sources‘

1. UniProtKB: Swiss-Prot + TrEMBL

2. NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

UniProtKB/Swiss-Prot: manually annotated protein sequences (12‘500 species)

UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with

Swiss-Prot (380‘000 species)

GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (380‘000 species ?)

PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‗published‘ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual

annotation (16‘000 species)

PIR PDB PRF Integrated resources

‗cross-references‘

Separated resources

4. NCBI nr - Entrez ‗protein‘

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein

96, Introduction to biological databases

Contains all CDS annotated in

GenBank/ENA/DDBJ sequences

‗translations from annotated coding regions in

GenBank‘

- equivalent to TrEMBL,

except that it is

redundant with other databases

(Swiss-Prot, RefSeq, PIR….)

All PIR data have been

integrated into Swiss-Prot

and TrEMBL (UniProt)

3D structure database:

all the protein sequences

which have been cristallized

(Swiss-Prot/TrEMBL are

crosslinked to PDB)

Scientific publications

derived sequences

« Journal scan »

(integrated into TrEMBL)

4. Protein sequences: NCBI nr

NCBI-nr: Swiss-Prot + GenPept + (PIR) + RefSeq + PDB + PRF

97, Introduction to biological databases

4. RefSeq

RefSeq: The Reference Sequence (RefSeq) collection aims to provide

a comprehensive, integrated, non-redondant set of sequences,

including genomic DNA, transcript (RNA), and protein products, for

major research organisms.

Tightly linked to Entrez Gene ("interdependent curated resources")

98, Introduction to biological databases

AC

Taxonomy

References

4. RefSeq

Protein: NP_

mRNA: NM_

DNA: NC_

99, Introduction to biological databases

Status and Genbank source

Annotation

- automated,

- derived from Swiss-Prot

- in-house

100, Introduction to biological databases

Annotation

- automated,

- derived from Swiss-Prot

- in-house

Sequence

Cross-references

101, Introduction to biological databases

4. RefSeq

http://www.ncbi.nlm.nih.gov/RefSeq/

Curation status : manual annotation

GENOME ANNOTATION No

INFERRED No

MODEL No

PREDICTED No

PROVISIONAL No

REVIEWED Yes (sequence + functional information

and features)

VALIDATED Yes (initial sequence)

Whole Genome Sequencing (WGS) No

102, Introduction to biological databases

These identifiers are all pointing to a TP53 (p53) protein sequence !

P04637, NP_000537, NP_001119584.1, NP_001119585.1,

NP_001119584.1, NP_001119584.1, NP_001119584.1,

NP_001119584.1, ENSG00000141510, CCDS11118,

UPI000002ED67, IPI00025087, etc.

4. Accession number (AC) mapping

103, Introduction to biological databases

http://www.uniprot.org/mapping/

104, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

105, Introduction to biological databases

• Most proteins have « modular » conserved structures

• Estimation: ~ 3 domains / protein

• Estimation: ~ 6000 ‗known‘ domains

-> Prediction of domain content of a unkown protein

sequence may help to find a ‗function‘

…Estimation: ~ 80% of protein have at least a ‗known‘ domain

5. Protein domain/family: some definitions

CSA_PPIASE

Cys 181: active site residue Binding cleft (motif)

Example of conserved regions (PPID family)

- 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain)

- 3 TPR repeats (tetratrico peptide repeat).

- 1 active site

- Binding cleft (motif)

107, Introduction to biological databases

Domain signatures methods:

derived from ‗modelled‘ multiple sequence alignments (MSA)

• Pattern

• Fingerprint

• Sequence clustering

• Profile

• HMM

108, Introduction to biological databases

How to build a PROSITE pattern ?

• Start with a multiple sequence alignment (MSA)

Information lost: 4D 1E

109, Introduction to biological databases

5. Protein domain/family db

PROSITE Patterns / Profiles

ProDom Aligned motifs (PSI-BLAST) (Pfam B)

PRINTS Aligned motifs

Pfam HMM (Hidden Markov Models)

SMART HMM

TIGRfam HMM

DOMO Aligned motifs

BLOCKS Aligned motifs (PSI-BLAST)

CDD(CDART) PSI-BLAST(PSSM) of Pfam and SMART

I

n

t

e

r

p

r

o

110, Introduction to biological databases

InterPro scan results

?

Part of the protein

sequence wich has been

‗recognized‘ by different

modelled MSA

What makes Bee special?

112, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

113, Introduction to biological databases

6. Proteomics db

• Mass Spectrometry (MS) database: Pride

• SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE,

Sub2D, Cyano2DBase, etc.

– Contain informations obtained by 2D-PAGE: images of master gels and

description of identified proteins

– Composed of image and text files

115, Introduction to biological databases

6. PRIDE

http://www.ebi.ac.uk/pride/

116, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

117, Introduction to biological databases

3D structure

• Only one database : PDB (Protein Data Bank)

but several servers….

• Contains the spatial coordinates of macromolecule

atoms whose 3D structure has been experimentally

obtained by X-ray or NMR studies; also a few models.

• Proteins represent more than 90% of available structures

(others are DNA, RNA, sugars, viruses, protein/DNA

complexes…)

118, Introduction to biological databases

7. PDB: Protein Data Bank

• Managed by Research Collaboratory for Structural Bioinformatics

(RCSB) (USA).

• Associated with specialized programs allow the visualization of the

corresponding 3D structure (e.g., SwissPDB-viewer, Chime,

Rasmol)).

• Currently - September 28, 2011 - there are 75‘000 structural data for

about 20‘000 different proteins (highly redundant) !

http://www.pdb.org/ (RCSB)

http://www.ebi.ac.uk/pdbe/

http://www.pdbj.org/

119, Introduction to biological databases

7. PDB: example

HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2

COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3

COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4

SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5

AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6

REVDAT 1 15-OCT-92 12CA 0 12CA 7

JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8

JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9

JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10

JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11

JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12

JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13

REMARK 1 12CA 14

REMARK 2 12CA 15

REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16

REMARK 3 12CA 17

REMARK 3 REFINEMENT. 12CA 18

REMARK 3 PROGRAM PROLSQ 12CA 19

REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20

REMARK 3 R VALUE 0.170 12CA 21

REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22

REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23

REMARK 4 12CA 24

REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25

REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26

REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27

………

120, Introduction to biological databases

7. PDB (cont.)

SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68

SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69

SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70

SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71

SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72

SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73

SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74

SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75

TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76

TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77

TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78

TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79

TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80

TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81

CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82

ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83

ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84

ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85

SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86

SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87

SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88

ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89

ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90

ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91

ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92

ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93

ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94

ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95

ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96

ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97

ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98

ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99

ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100

ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101

ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102

…….

Coordinates <x; y; z> of each atom

The same PDB

entry

―visualized‖ with

Chime

122, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

8. Metabolism/Pathways

123, Introduction to biological databases

8. Databases: metabolic

• Contain informations that describe enzymes,

biochemical reactions and metabolic pathways;

• Nomenclature databases store informations on enzyme

names and reactions: ENZYME, BRENDA, IntEnz

• Metabolic databases: MetaCyc, KEGG, UniPathway,

RhEA;

• Usually these databases are tightly coupled with query

software that allows the user to visualise reaction

schemes;

• Ligands and chemicals: ChEBI, KEGG ligand;

124, Introduction to biological databases

Useful to prepare lab’s experiments ! http://www.brenda-enzymes.org/

8. BRENDA

http://www.genome.ad.jp/kegg

8. KEGG

127, Introduction to biological databases

Outline

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

9. Bibliography

128, Introduction to biological databases

9. Bibliography

• Bibliographic reference databases contain citations and

abstract information of published life science articles;

• Example: PubMed, PubMed central

• Other more specialized databases also exist:

Agricola ( http://agricola.nal.usda.gov/)

EMBASE - not free

129, Introduction to biological databases

9. PubMed / Medline

• Established in 1950;

• Database of citations and abstracts to biomedical and

other life science journal literature;

• Encompasses MedLine;

• Gives access to: – > 21 millions papers (dating back to the 1860s),

– > 20‘400 life science journals,

– ~ 55 languages (17‘751 jounals in English, 2‘000 in French, 372 in

Chinese, 29 in Latin, 1 in Azerbaijani, etc…).

PMID: 10923642 (PubMed ID)

UI: 20378145 (Medline ID)

DOI : 10.1016/S0960-9822(03)00148-9 (Digital Object Identifier)

http://www.ncbi.nlm.nih.gov/pubmed/

130, Introduction to biological databases

PubMed central

• Free digital archive of free access full-texts (since 2000)

• ~700 journals (list: http://www.ncbi.nlm.nih.gov/pmc/journals/), most of which have a corresponding entry in PubMed

• Free access to the full text either immediately after publication of within a 12-month period.

http://www.ncbi.nlm.nih.gov/pmc/

http://www.ncbi.nlm.nih.gov/pmc/

131, Introduction to biological databases

10. Others

• There are many databases that cannot be classified in

the categories listed previously;

• Examples: – ReBase (restriction enzymes)

– TRANSFAC (transcription factors)

– CarbBank

– GlycoSuiteDB (linked sugars)

– Protein-protein interactions db (DIP, ProNet, BIND, MINT, String),

– Protease db (MEROPS), biotechnology patents db, Microarrays, etc.;

• As well as many other resources concerning any aspects

of macromolecules and molecular biology.

132, Introduction to biological databases

Protein/protein interaction: description from 1 to more than 20‘000

interactions / publication

Several databases: Intact, BIND, DIP, String

Estimation: 10’000 fundamental interaction types

10. Interactome

133, Introduction to biological databases

134, Introduction to biological databases

10. Intact

http://www.ebi.ac.uk/intact/

135, Introduction to biological databases

10. Gene Ontology

• The Gene Ontology is a controlled vocabulary, a set of

standard terms—words and phrases—used for indexing

and retrieving information.

• In addition to defining terms, GO also defines the

relationships between the terms, making it a structured

vocabulary.

• The Gene Ontology ensures that the flood of information

produced can be effectively utilized by standardization of

biological data/information

http://www.geneontology.org

136, Introduction to biological databases

http://www.geneontology.org

137, Introduction to biological databases

About 30‘000 terms (with definition and hierarchy)

biological process

• broad biological phenomena e.g. mitosis, growth, digestion

(included PTMs).

molecular function

• molecular role e.g. catalytic activity, binding

cellular component

• Subcellular location e.g nucleus, ribosome, origin recognition

complex

10. Gene Ontology

138, Introduction to biological databases

http://www.ebi.ac.uk/QuickGO/

139, Introduction to biological databases

10. Gene Ontology annotation

Annotation is the process of assigning/mapping

GO terms to gene products…

!!! Electronic vs Manual annotation…

140, Introduction to biological databases

Example with EPO

141, Introduction to biological databases

Histone H4

!!! Large scale derived data (‗proteome‘)

142, Introduction to biological databases

Essential link between biological knowledge and high

throuput genomic and proteomic datasets…

‘summary of the gene ontology classifications for all mapped ESTs…’

10. Gene Ontology

143, Introduction to biological databases

Genome-Wide RNAi screens identify genes

required for ricin and Pseudomonas

exotoxin intoxications

DOI 10.1016/j.devcel.2011.06.014

Gene Ontology analysis on the 2038 genes hit list.

144, Introduction to biological databases

• Selected categories of life sciences db 1. Nucleotide sequences -> Primary db

2. Genomics

3. Mutation/polymorphism

4. Protein sequences -> Primary db

5. Protein domain/family

6. Proteomics (2D gel, Mass Spectrometry)

7. 3D structure

9. Bibliography

10. Others

• Concluding remarks

145, Introduction to biological databases

Proliferation of databases

• Which does contain the highest quality data ?

• Which is the more comprehensive ?

• Which is the more up-to-date ?

• Which is the less redundant ?

• Which is the more indexed (allows complex queries) ?

• Which Web server does respond most quickly ?

• …….??????

146, Introduction to biological databases

Some important practical remarks

• Databases: many errors (automated annotation) !

• Not all db are available on all servers

• The update frequency is not the same for all servers;

• Some servers add automatically cross-references to an

entry (implicit links) in addition to already existing links

(explicit links)… different looks…

147, Introduction to biological databases

Before the introduction to databases…

After the introduction to databases…

148, Introduction to biological databases

Marie-Claude Blatter

Swiss-Prot, Geneva

SIB Swiss Institute of Bioinformatics

Marie-Claude.Blatter@isb-sib.ch

Credits

top related