biological databases an introduction

47
Biological databases an introduction By Dr. Erik Bongcam-Rudloff LCB-UU/SLU ILRI 2007

Upload: mairi

Post on 13-Jan-2016

66 views

Category:

Documents


1 download

DESCRIPTION

Biological databases an introduction. By Dr. Erik Bongcam-Rudloff LCB-UU/SLU ILRI 2007. Biological Databases. Sequence Databases Genome Databases Structure Databases. Sequence Databases. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Biological databases an introduction

Biological databasesan introduction

Biological databasesan introduction

By Dr. Erik Bongcam-Rudloff

LCB-UU/SLU

ILRI 2007

By Dr. Erik Bongcam-Rudloff

LCB-UU/SLU

ILRI 2007

Page 2: Biological databases an introduction

Biological Databases Biological Databases

Sequence Databases Genome Databases Structure Databases

Sequence Databases Genome Databases Structure Databases

Page 3: Biological databases an introduction

Sequence Databases Sequence Databases

The sequence databases are the oldest type of biological databases, and also the most widely used

The sequence databases are the oldest type of biological databases, and also the most widely used

Page 4: Biological databases an introduction

Sequence DatabasesSequence Databases

Nucleotide: ATGC

Protein: MERITSAPLG

Nucleotide: ATGC

Protein: MERITSAPLG

Page 5: Biological databases an introduction

The nucleotide sequence repositories

The nucleotide sequence repositories

There are three main repositories for nucleotide sequences: EMBL, GenBank, and DDBJ.

All of these should in theory contain "all" known public DNA or RNA sequences

These repositories have a collaboration so that any data submitted to one of databases will be redistributed to the others.

There are three main repositories for nucleotide sequences: EMBL, GenBank, and DDBJ.

All of these should in theory contain "all" known public DNA or RNA sequences

These repositories have a collaboration so that any data submitted to one of databases will be redistributed to the others.

Page 6: Biological databases an introduction

The three databases are the only databases that can issue sequence accession numbers.

Accession numbers are unique identifiers which permanently identify sequences in the databases.

These accession numbers are required by many biological journals before manuscripts are accepted.

The three databases are the only databases that can issue sequence accession numbers.

Accession numbers are unique identifiers which permanently identify sequences in the databases.

These accession numbers are required by many biological journals before manuscripts are accepted.

Page 7: Biological databases an introduction

It should be noted that during the last decade several commercial companies have engaged in sequencing ESTs and genomes that they have not made public.

It should be noted that during the last decade several commercial companies have engaged in sequencing ESTs and genomes that they have not made public.

Page 8: Biological databases an introduction

EST databases EST databases

Expressed sequence tags (ESTs) are short sequences from expressed mRNAs.

The basic idea is to get a handle on the parts of the genome that is expressed as mRNA (often called the transcriptome ).

ESTs are generated by end-sequencing clones from cDNA libraries from different sources.

Expressed sequence tags (ESTs) are short sequences from expressed mRNAs.

The basic idea is to get a handle on the parts of the genome that is expressed as mRNA (often called the transcriptome ).

ESTs are generated by end-sequencing clones from cDNA libraries from different sources.

Page 9: Biological databases an introduction

EST cluster databases EST cluster databases

UniGene UniGene is a database at NCBI that

contains clusters (UniGene clusters) of sequences that represent unique genes. These cluster are made automatically by partitioning GenBank sequences into a non-redundant set of gene-oriented clusters.

UniGene UniGene is a database at NCBI that

contains clusters (UniGene clusters) of sequences that represent unique genes. These cluster are made automatically by partitioning GenBank sequences into a non-redundant set of gene-oriented clusters.

Page 10: Biological databases an introduction

Ideal minimal content of a « sequence » dbIdeal minimal content of a « sequence » db

Sequences !!Accession number (AC)ReferencesTaxonomic dataANNOTATION/CURATIONKeywordsCross-referencesDocumentation

Sequences !!Accession number (AC)ReferencesTaxonomic dataANNOTATION/CURATIONKeywordsCross-referencesDocumentation

Page 11: Biological databases an introduction

Example: Swiss-Prot entry

Example: Swiss-Prot entry

sequence

Accession number

Entry name

Page 12: Biological databases an introduction

Protein nameGene name

Protein nameGene name

Taxonomy

Page 13: Biological databases an introduction

References

Page 14: Biological databases an introduction

Comments

Page 15: Biological databases an introduction

Cross-referencesCross-references

Page 16: Biological databases an introduction

KeywordsKeywords

Page 17: Biological databases an introduction

Feature table(sequence

description)

Page 18: Biological databases an introduction

Sequence database: exampleSequence database: example…a SWISS-PROT entry, in fasta format:

>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens(Human).

MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE

NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA

VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD

AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR

…a SWISS-PROT entry, in fasta format:

>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens(Human).

MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE

NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA

VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD

AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR

Page 19: Biological databases an introduction

SWISS-PROT knowledgebaseSWISS-PROT knowledgebase

Created by Amos Bairoch in 1986 Collaboration between the SIB (CH) and EBI (UK) Annotated (manually), non-redundant, cross-

referenced, documented protein sequence database. ~122 ’000 sequences from more than 7’700 different

species; 192 ’000 references (publications); 958 ’000 cross-references (databases); ~400 Mb of annotations.

Weekly releases; available from more than 50 servers across the world, the main source being ExPASy

Created by Amos Bairoch in 1986 Collaboration between the SIB (CH) and EBI (UK) Annotated (manually), non-redundant, cross-

referenced, documented protein sequence database. ~122 ’000 sequences from more than 7’700 different

species; 192 ’000 references (publications); 958 ’000 cross-references (databases); ~400 Mb of annotations.

Weekly releases; available from more than 50 servers across the world, the main source being ExPASy

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 20: Biological databases an introduction

SWISS-PROT: speciesSWISS-PROT: species

7’700 different species 20 species represent about 42% of all

sequences in the database 5’000 species are only represented by one

to three sequences. In most cases, these are sequences which were obtained in the context of a phylogenetic study

7’700 different species 20 species represent about 42% of all

sequences in the database 5’000 species are only represented by one

to three sequences. In most cases, these are sequences which were obtained in the context of a phylogenetic study

Page 21: Biological databases an introduction

Domains, functional sites, protein familiesPROSITEInterProPfamPRINTSSMARTMendel-GFDb

Domains, functional sites, protein familiesPROSITEInterProPfamPRINTSSMARTMendel-GFDb

Nucleotide sequence dbEMBL, GeneBank, DDBJ

Nucleotide sequence dbEMBL, GeneBank, DDBJ

2D and 3D Structural dbsHSSPPDB

2D and 3D Structural dbsHSSPPDB

Organism-spec. dbsDictyDbEcoGeneFlyBaseHIVMaizeDBMGDSGDStyGeneSubtiListTIGRTubercuListWormPepZebrafish

Organism-spec. dbsDictyDbEcoGeneFlyBaseHIVMaizeDBMGDSGDStyGeneSubtiListTIGRTubercuListWormPepZebrafish

Protein-specific dbsGCRDbMEROPSREBASETRANSFAC

Protein-specific dbsGCRDbMEROPSREBASETRANSFAC

SWISS-PROTSWISS-PROT

2D-gel protein databasesSWISS-2DPAGEECO2DBASEHSC-2DPAGEAarhus and GhentMAIZE-2DPAGE

2D-gel protein databasesSWISS-2DPAGEECO2DBASEHSC-2DPAGEAarhus and GhentMAIZE-2DPAGE

Human diseasesMIM

Human diseasesMIM

PTMCarbBankGlycoSuiteDB

PTMCarbBankGlycoSuiteDB

Page 22: Biological databases an introduction

AnnotationsAnnotations

Function(s)

Post-translational modifications (PTM)

Domains

Quaternary structure

Similarities

Diseases, mutagenesis

Conflicts, variants

Cross-references

Function(s)

Post-translational modifications (PTM)

Domains

Quaternary structure

Similarities

Diseases, mutagenesis

Conflicts, variants

Cross-references

Page 23: Biological databases an introduction

Annotation schemaAnnotation schema

Amos Bairoch

Amos Bairoch

Head annotator 1

Head annotator 1

Head annotator n

Head annotator n

Head annotator 2

Head annotator 2

AnnotatorsAnnotators AnnotatorsAnnotators AnnotatorsAnnotators

ExpertsExperts

……

……

SwissProtSwissProt

Page 24: Biological databases an introduction

Code Content Occurrence in an entry--------- ---------------------------- ---------------------------ID Identification One; starts the entryAC Accession number(s) One or moreDT Date Three timesDE Description One or moreGN Gene name(s) OptionalOS Organism species One or moreOG Organelle OptionalOC Organism classification One or moreOX Taxonomy cross-references One or moreRN Reference number One or moreRP Reference position One or moreRC Reference comment(s) OptionalRX Reference cross-reference(s) OptionalRA Reference authors One or moreRT Reference title OptionalRL Reference location One or moreCC Comments or notes OptionalDR Database cross-references OptionalKW Keywords OptionalFT Feature table data OptionalSQ Sequence header One Amino Acid Sequence One or more// Termination line One; ends the entry

Code Content Occurrence in an entry--------- ---------------------------- ---------------------------ID Identification One; starts the entryAC Accession number(s) One or moreDT Date Three timesDE Description One or moreGN Gene name(s) OptionalOS Organism species One or moreOG Organelle OptionalOC Organism classification One or moreOX Taxonomy cross-references One or moreRN Reference number One or moreRP Reference position One or moreRC Reference comment(s) OptionalRX Reference cross-reference(s) OptionalRA Reference authors One or moreRT Reference title OptionalRL Reference location One or moreCC Comments or notes OptionalDR Database cross-references OptionalKW Keywords OptionalFT Feature table data OptionalSQ Sequence header One Amino Acid Sequence One or more// Termination line One; ends the entry

Manual annotation

Manual annotation

Page 25: Biological databases an introduction

TrEMBL (Translated EMBL)TrEMBL (Translated EMBL)

TrEMBL: created in 1996;

Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data…

Well-structure SWISS-PROT-like resource

Derived from automated EMBL CDS translation (maintained at the EBI (UK))

TrEMBL is automatically generated and annotated using software tools (incompatible with the SWISS-PROT in terms of quality)

TrEMBL contains all what is not yet in SWISS-PROT

Yerk!! But there is no choice and these software tools are becoming quite good !

TrEMBL: created in 1996;

Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data…

Well-structure SWISS-PROT-like resource

Derived from automated EMBL CDS translation (maintained at the EBI (UK))

TrEMBL is automatically generated and annotated using software tools (incompatible with the SWISS-PROT in terms of quality)

TrEMBL contains all what is not yet in SWISS-PROT

Yerk!! But there is no choice and these software tools are becoming quite good !

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 26: Biological databases an introduction

The simplified story of a Sprot entryThe simplified story of a Sprot entrycDNAs, genomes, ….cDNAs, genomes, ….

EMBLnew EMBLEMBLnew EMBL

TrEMBLnew TrEMBLTrEMBLnew TrEMBL

SWISS-PROTSWISS-PROT

« Automatic »• Redundancy check (merge)• InterPro (family attribution)• Annotation

« Automatic »• Redundancy check (merge)• InterPro (family attribution)• Annotation

« Manual »• Redundancy (merge,

conflicts)

• Annotation• Sprot tools (macros…)• Sprot documentation• Medline• Databases (MIM, MGD….)• Brain storming

« Manual »• Redundancy (merge,

conflicts)

• Annotation• Sprot tools (macros…)• Sprot documentation• Medline• Databases (MIM, MGD….)• Brain storming

Once in Sprot, the entry is no more in TrEMBL, but still in EMBL (archive)

Once in Sprot, the entry is no more in TrEMBL, but still in EMBL (archive)

CDS

Page 27: Biological databases an introduction

TrEMBL: exampleTrEMBL: example

Original TrEMBL entry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in TrEMBL anymore.

Original TrEMBL entry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in TrEMBL anymore.

Page 28: Biological databases an introduction

Some protein motif databases

Some protein motif databases

Prosite - Regular expression built from SWISS-PROT

PRINTS - aligned motif consensus built from OWL• (http://bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html)

BLOCKS - PRINTS-like generated from PROSITE families • (http://www.blocks.fhcrc.org/)

IDENTIFY - Fuzzy regular expressions derived from PROSITE

pfam - Hidden Markov Model built from SWISS-PROT

• (http://www.sanger.ac.uk/Software/Pfam)

Profiles - Weight Matrix profiles built from SWISS-PROT

Interpro - All of the above (almost)• (http://www.ebi.ac.uk/InterPro)

Prosite - Regular expression built from SWISS-PROT

PRINTS - aligned motif consensus built from OWL• (http://bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html)

BLOCKS - PRINTS-like generated from PROSITE families • (http://www.blocks.fhcrc.org/)

IDENTIFY - Fuzzy regular expressions derived from PROSITE

pfam - Hidden Markov Model built from SWISS-PROT

• (http://www.sanger.ac.uk/Software/Pfam)

Profiles - Weight Matrix profiles built from SWISS-PROT

Interpro - All of the above (almost)• (http://www.ebi.ac.uk/InterPro)

Page 29: Biological databases an introduction

A domain database synchronised with SWISS-PROT

A domain database synchronised with SWISS-PROT

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 30: Biological databases an introduction

HistoryHistory

Founded by Amos Bairoch

1988 First release in the PC/Gene software

1990 Synchronisation with Swiss-Prot

1994 Integration of « profiles »

1999 PROSITE joins InterPro

January 2003 Current release 17.32

Founded by Amos Bairoch

1988 First release in the PC/Gene software

1990 Synchronisation with Swiss-Prot

1994 Integration of « profiles »

1999 PROSITE joins InterPro

January 2003 Current release 17.32

The databaseThe database

Page 31: Biological databases an introduction

Database contentDatabase content Official Release ~1330 Patterns PSxxxxx PATTERN ~252 Profiles PSxxxxx MATRIX 4 Rules PSxxxxx RULE ~1156 Documentations PDOCxxxxx

Pre-Release ~150 Profiles PSxxxxx MATRIX ~100 Documentations QDOCxxxxx

Official Release ~1330 Patterns PSxxxxx PATTERN ~252 Profiles PSxxxxx MATRIX 4 Rules PSxxxxx RULE ~1156 Documentations PDOCxxxxx

Pre-Release ~150 Profiles PSxxxxx MATRIX ~100 Documentations QDOCxxxxx

Page 32: Biological databases an introduction

Prosite (pattern): exampleProsite (pattern): example

Page 33: Biological databases an introduction

Prosite (pattern): exampleProsite (pattern): example

Page 34: Biological databases an introduction

Database content: documentationDatabase content: documentation

QuickTime™ et undécompresseur TIFF (LZW)sont requis pour visionner cette image.QuickTime™ et undécompresseur TIFF (LZW)sont requis pour visionner cette image.

Page 35: Biological databases an introduction

Other protein domain/family dbOther protein domain/family db

PROSITE Patterns / Profiles

ProDom Aligned motifs (PSI-BLAST) (Pfam B)

PRINTS Aligned motifs

Pfam HMM (Hidden Markov Models)

SMART HMM

TIGRfam HMM

DOMO Aligned motifs

BLOCKS Aligned motifs (PSI-BLAST)

CDD(CDART) PSI-BLAST(PSSM) of Pfam and SMART

PROSITE Patterns / Profiles

ProDom Aligned motifs (PSI-BLAST) (Pfam B)

PRINTS Aligned motifs

Pfam HMM (Hidden Markov Models)

SMART HMM

TIGRfam HMM

DOMO Aligned motifs

BLOCKS Aligned motifs (PSI-BLAST)

CDD(CDART) PSI-BLAST(PSSM) of Pfam and SMART

Interpro

Interpro

Text

Page 36: Biological databases an introduction

InterPro: www.ebi.ac.uk/interproInterPro: www.ebi.ac.uk/interpro

Page 37: Biological databases an introduction

InterPro exampleInterPro example

Page 38: Biological databases an introduction

InterPro exampleInterPro example

Page 39: Biological databases an introduction

InterPro graphic exampleInterPro graphic example

Page 40: Biological databases an introduction

Genomic DatabasesGenomic Databases

Genome databases differ from sequence databases in that the data contained in them are much more diverse.

The idea behind a genome database is to organize all information on an organism (or as much as possible).

In many cases they stem out of the necessity for a centralized resource for a particular genome project. But of course they are also important resources for the research community.

Genome databases differ from sequence databases in that the data contained in them are much more diverse.

The idea behind a genome database is to organize all information on an organism (or as much as possible).

In many cases they stem out of the necessity for a centralized resource for a particular genome project. But of course they are also important resources for the research community.

Page 41: Biological databases an introduction

Genomic DatabasesGenomic Databases

Ensembl Genome Browser NCBI

Ensembl Genome Browser NCBI

Page 42: Biological databases an introduction

Structure Databases Structure Databases

PDB SCOP

PDB SCOP

Page 43: Biological databases an introduction

PDBPDB

The Protein Data Bank ( PDB ) was established at Brookhaven National Laboratories (BNL) (1) in 1971 as an archive for biological macromolecular crystal structures.

The three dimensional structures in PDB are primarily derived from experimental data obtained by X-ray crystallography and NMR .

The Protein Data Bank ( PDB ) was established at Brookhaven National Laboratories (BNL) (1) in 1971 as an archive for biological macromolecular crystal structures.

The three dimensional structures in PDB are primarily derived from experimental data obtained by X-ray crystallography and NMR .

Page 44: Biological databases an introduction

SCOPSCOP

The SCOP database groups different protein structures

according to their evolutionary relationship.The

evolutionary relationship of all known protein structures

have been determined by manual inspection and

automated methods.

The goal of SCOP is to provide detail information about

close relatives of proteins and protein and to provide an

evolutionary based protein classification resource.

The SCOP database groups different protein structures

according to their evolutionary relationship.The

evolutionary relationship of all known protein structures

have been determined by manual inspection and

automated methods.

The goal of SCOP is to provide detail information about

close relatives of proteins and protein and to provide an

evolutionary based protein classification resource.

Page 45: Biological databases an introduction

UniProt: United Protein databaseUniProt: United Protein database

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

SWISS-PROT + TrEMBL + PIR = UniProt

Born in October 2002

NIH pledges cash for global protein database The United States is turning to European bioinformatics facilities to

help it meet its researchers' future needs for databases of protein sequences.

European institutions are set to be the main recipients of a $15-million, three-year grant from the US National Institutes of Health (NIH), to set up a global database of information on protein sequence and function known as the United Protein Databases, or UniProt (Nature, 419, 101 (2002))

SWISS-PROT + TrEMBL + PIR = UniProt

Born in October 2002

NIH pledges cash for global protein database The United States is turning to European bioinformatics facilities to

help it meet its researchers' future needs for databases of protein sequences.

European institutions are set to be the main recipients of a $15-million, three-year grant from the US National Institutes of Health (NIH), to set up a global database of information on protein sequence and function known as the United Protein Databases, or UniProt (Nature, 419, 101 (2002))

Page 46: Biological databases an introduction

Some examples of integrated biological database resources are:

Some examples of integrated biological database resources are:

SRS (Sequence Retrieval System) Entrez Browser (at NCBI) ExPASy (home of SwissProt) Ensembl (Open Source based system) Human Genome Browser (Jim Kents creation)

SRS (Sequence Retrieval System) Entrez Browser (at NCBI) ExPASy (home of SwissProt) Ensembl (Open Source based system) Human Genome Browser (Jim Kents creation)

Page 47: Biological databases an introduction

THANKSTHANKS

Laurent Falquet, SIB and EMBnet-CH for slides and information on SwissProt and Prosite

Laurent Falquet, SIB and EMBnet-CH for slides and information on SwissProt and Prosite