other biological databases - south african national...
TRANSCRIPT
Biological systems
Taxonomic data
Literature
Protein folding and 3D structure
Small molecules
Pathways and networks
Biological systems
Protein families and domains
Whole genome data
Sequence data
Ontologies -GO
Other Biological Databases
• Genome Databases- Ensembl, FlyBase, WormBase
• Transcription factor binding sites -TRANSFAC
• Protein structure databases- PDB, SCOP, CATH
• Protein family databases- Pfam, Prints, PROSITE etc.
• Chemicals and small molecules -ChEBI
• Gene expression databases –GEO, ArrayExpress
• Metabolic pathways - Reactome, KEGG
• Human genetics-related databases –HapMap, dbSNP
BRCA1
• Viewed the nucleotide entry in ENA
• Viewed protein entry in Swiss-Prot
– Found links to other database
– Linked to disease
– Has 3D structures solved
– Has many natural variants
– Etc.
• What else can we find out about it?
Genome browsers
• Integrate sequence & functional data for a genome
• Ensembl –genome browser for major eukaryotic genomes, e.g. human, mouse etc. http://www.ensembl.org
• UCSC browser -http://genome.ucsc.edu/
• FlyBase –Drosophila genome database: http://www.ebi.ac.uk/flybase
• WormBase –C. elegans: http://www.wormbase.org
• PlasmoDB –Plasmodium (malaria): http://plasmodb.org
• EuPathDB
Human genetics databases
• GeneCards (http://www.genecards.org/)
• HapMap (http://hapmap.ncbi.nlm.nih.gov/)
• HGDP Human Genome Diversity Project
(http://hagsc.org/hgdp/files.html)
• GnomAD Genome Aggregation Database
https://gnomad.broadinstitute.org/
• OMIM http://www.ncbi.nlm.nih.gov/omim
Protein family databases
• Databases that produce signatures for identifying
protein families or domains
• Used for functional classification of proteins
• E.g. Pfam, PROSITE, Prints, SMART,
TIGRFAMs etc.
• Integrated into single resource InterPro
(http://www.ebi.ac.uk/interpro)
Protein structure databases
• Main resource is Protein Data Bank (PDB): http://www.rcsb.org/pdb/
• Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies
• Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, viruses, protein/DNA complexes…)
• Can search by PDB code
Protein structure-related databases
• Structural family databases based on PDB –
SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)
and CATH
(http://www.biochem.ucl.ac.uk/bsm/cath/)
• Predicted structures in SWISS-MODEL
(http://swissmodel.expasy.org//SWISS-
MODEL.html)
Chemicals and small molecules
• Chemical abstracts- http://www.cas.org/
• ChEBI- http://www.ebi.ac.uk/chebi
• KEGG –part of it includes chemicals http://www.genome.jp/kegg
• ChemID plus -chemicals cited in NLM databases http://chem2.sis.nlm.nih.gov/chemidplus/chemidlite.jsp
• MSD-Chem –ligands and chemicals in MSD
Protein-protein interaction databases
• Protein-protein interaction databases store pairwise
interactions or complexes
• Can get 1 to more than 20,000 interactions per publication
• IntAct http://www.ebi.ac.uk/intact
• DIP (Database of Interacting Proteins) http://dip.doe-
mbi.ucla.edu/
• BIND (Biomolecular Interaction Network Database)
http://submit.bind.ca:8080/bind/
Gene expression databases
• NCBI Gene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/
• ArrayExpress http://www.ncbi.nlm.nih.gov/geo/
• Stanford microarray database http://genome-
www5.stanford.edu/
• Can usually search for experiments or particular
expression profiles
What does the array data look like?
• Info on experiment, array used, etc.
• Raw or processed tab delimited file containing
spots and their intensities (cy3/cy5 ratios) across
different samples
• Files with meta data e.g. sample info, annotation
and coordinates of each spot on array
Enzymes and metabolic pathways
• Contain information describing enzymes,
biochemical reactions and metabolic pathways;
• ENZYME and BRENDA: nomenclature databases
that store information on enzyme names and
reactions;
• IntEnz: Integrated relational Enzyme database
Metabolic Pathway databases
• PATHGUIDE >200 pathways
• KEGG (Kyoto encyclopedia of genes and genomes): http://www.genome.jp/kegg -includes:– Database of chemicals, genes and networks (metabolic,
regulatory etc.)
– Well-curated and quite specific
• EcoCyc (Encyclopedia of E. coli K12 genes and metabolism): http://ecocyc.org –curation of entries genome
• Reactome –curated biological pathways: http://www.reactome.org/
• UniPathway –developed and used by SwissProt
Transcription factor binding sites
• TRANSFAC –database of eukaryotic transcription
factors: not openly accessible
• DBD: Transcription factor prediction database:
http://www.transcriptionfactor.org/index.cgi?Home
• TFsearch –for searching transcription factor binding
sites:
http://www.cbrc.jp/papia/howtouse/howtouse_tfsearc
h.html
What have we learnt about BRCA1?
We found its
genomic position
on #17We found its genomic and
mRNA sequences
We identified the
corresponding protein,
its domains and 3D
structureWe determined where
and when the gene is
expressed and the
protein is found We determined which
proteins it interacts
with and in which
pathways
We found info on SNPs and relation
to human disease
Where to find the databases
• Table of addresses for major databases and tools
• Nucleic Acids Research Database issue January
each year
• Nucleic Acids Research Software issue –new
• Expasy list of tools (not updated):
http://ca.expasy.org/links.html
Large scale data retrieval
• Programmatic access to many databases
• MySQL access to some
• BioMart access –public and private
• FTP sites –large data downloads