other biological databases - south african national...

43
Other biological databases

Upload: others

Post on 29-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Other biological databases

Biological systems

Taxonomic data

Literature

Protein folding and 3D structure

Small molecules

Pathways and networks

Biological systems

Protein families and domains

Whole genome data

Sequence data

Ontologies -GO

Other Biological Databases

• Genome Databases- Ensembl, FlyBase, WormBase

• Transcription factor binding sites -TRANSFAC

• Protein structure databases- PDB, SCOP, CATH

• Protein family databases- Pfam, Prints, PROSITE etc.

• Chemicals and small molecules -ChEBI

• Gene expression databases –GEO, ArrayExpress

• Metabolic pathways - Reactome, KEGG

• Human genetics-related databases –HapMap, dbSNP

BRCA1

• Viewed the nucleotide entry in ENA

• Viewed protein entry in Swiss-Prot

– Found links to other database

– Linked to disease

– Has 3D structures solved

– Has many natural variants

– Etc.

• What else can we find out about it?

Genome browsers

• Integrate sequence & functional data for a genome

• Ensembl –genome browser for major eukaryotic genomes, e.g. human, mouse etc. http://www.ensembl.org

• UCSC browser -http://genome.ucsc.edu/

• FlyBase –Drosophila genome database: http://www.ebi.ac.uk/flybase

• WormBase –C. elegans: http://www.wormbase.org

• PlasmoDB –Plasmodium (malaria): http://plasmodb.org

• EuPathDB

Ensembl gene page

Gene within context

on chromosome

Additional data can be

mapped onto viewer using

tracks

Visualizing

intron/exon

structure and

variations

Variation data for human

is stored in dbSNP

dbSNP

http://www.ncbi.nlm.nih.gov/SNP/

Repository of all known mutation

(human and other organisms)

Human genetics databases

• GeneCards (http://www.genecards.org/)

• HapMap (http://hapmap.ncbi.nlm.nih.gov/)

• HGDP Human Genome Diversity Project

(http://hagsc.org/hgdp/files.html)

• GnomAD Genome Aggregation Database

https://gnomad.broadinstitute.org/

• OMIM http://www.ncbi.nlm.nih.gov/omim

BRCA1 in

OMIM

Protein family databases

• Databases that produce signatures for identifying

protein families or domains

• Used for functional classification of proteins

• E.g. Pfam, PROSITE, Prints, SMART,

TIGRFAMs etc.

• Integrated into single resource InterPro

(http://www.ebi.ac.uk/interpro)

InterPro search

Search:

Sequence or

Text:

keyword,

protein acc

or InterPro

acc

Results for

protein Acc

or sequence

Example

InterPro

entry

Protein structure databases

• Main resource is Protein Data Bank (PDB): http://www.rcsb.org/pdb/

• Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies

• Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, viruses, protein/DNA complexes…)

• Can search by PDB code

BRCA1 in

European

PDB

Protein structure-related databases

• Structural family databases based on PDB –

SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)

and CATH

(http://www.biochem.ucl.ac.uk/bsm/cath/)

• Predicted structures in SWISS-MODEL

(http://swissmodel.expasy.org//SWISS-

MODEL.html)

Chemicals and small molecules

• Chemical abstracts- http://www.cas.org/

• ChEBI- http://www.ebi.ac.uk/chebi

• KEGG –part of it includes chemicals http://www.genome.jp/kegg

• ChemID plus -chemicals cited in NLM databases http://chem2.sis.nlm.nih.gov/chemidplus/chemidlite.jsp

• MSD-Chem –ligands and chemicals in MSD

CheBI example entry

Hierarchy

for

chemicals

BRCA1 in Binding DB

Protein-protein interaction databases

• Protein-protein interaction databases store pairwise

interactions or complexes

• Can get 1 to more than 20,000 interactions per publication

• IntAct http://www.ebi.ac.uk/intact

• DIP (Database of Interacting Proteins) http://dip.doe-

mbi.ucla.edu/

• BIND (Biomolecular Interaction Network Database)

http://submit.bind.ca:8080/bind/

BRCA1 interactions in UniProt

BRCA1 interactions in IntAct

Gene expression databases

• NCBI Gene Expression Omnibus (GEO)

http://www.ncbi.nlm.nih.gov/geo/

• ArrayExpress http://www.ncbi.nlm.nih.gov/geo/

• Stanford microarray database http://genome-

www5.stanford.edu/

• Can usually search for experiments or particular

expression profiles

GEO

search

page

Profiles search results

Specific

entry and

experiment

info

BRCA1 in

ArrayExpress

BRCA1 in

ArrayExpress

What does the array data look like?

• Info on experiment, array used, etc.

• Raw or processed tab delimited file containing

spots and their intensities (cy3/cy5 ratios) across

different samples

• Files with meta data e.g. sample info, annotation

and coordinates of each spot on array

Proteomics: SWISS-2DPAGE

Proteomics: Pride

www.ebi.ac.uk/pride/ MS data

Enzymes and metabolic pathways

• Contain information describing enzymes,

biochemical reactions and metabolic pathways;

• ENZYME and BRENDA: nomenclature databases

that store information on enzyme names and

reactions;

• IntEnz: Integrated relational Enzyme database

Metabolic Pathway databases

• PATHGUIDE >200 pathways

• KEGG (Kyoto encyclopedia of genes and genomes): http://www.genome.jp/kegg -includes:– Database of chemicals, genes and networks (metabolic,

regulatory etc.)

– Well-curated and quite specific

• EcoCyc (Encyclopedia of E. coli K12 genes and metabolism): http://ecocyc.org –curation of entries genome

• Reactome –curated biological pathways: http://www.reactome.org/

• UniPathway –developed and used by SwissProt

Example of a pathway in BioCyc

One of

BRCA1’s

pathways

in

Reactome

Transcription factor binding sites

• TRANSFAC –database of eukaryotic transcription

factors: not openly accessible

• DBD: Transcription factor prediction database:

http://www.transcriptionfactor.org/index.cgi?Home

• TFsearch –for searching transcription factor binding

sites:

http://www.cbrc.jp/papia/howtouse/howtouse_tfsearc

h.html

What have we learnt about BRCA1?

We found its

genomic position

on #17We found its genomic and

mRNA sequences

We identified the

corresponding protein,

its domains and 3D

structureWe determined where

and when the gene is

expressed and the

protein is found We determined which

proteins it interacts

with and in which

pathways

We found info on SNPs and relation

to human disease

Where to find the databases

• Table of addresses for major databases and tools

• Nucleic Acids Research Database issue January

each year

• Nucleic Acids Research Software issue –new

• Expasy list of tools (not updated):

http://ca.expasy.org/links.html

Large scale data retrieval

• Programmatic access to many databases

• MySQL access to some

• BioMart access –public and private

• FTP sites –large data downloads

Other tutorials

• http://www.ensembl.org/info/website/tutorials/ind

ex.html

• http://www.ebi.ac.uk/training/online/

• http://www.ncbi.nlm.nih.gov/guide/training-

tutorials/