Bio-Trac 25 (Proteomics: Principles and Methods)Bio-Trac 25 (Proteomics: Principles and Methods)March 25March 25, 2005 , 2005
Zhang-Zhi Hu, M.D. Zhang-Zhi Hu, M.D. Senior Bioinformatics ScientistSenior Bioinformatics ScientistProtein Information ResourceProtein Information ResourceNational Biomedical Research Foundation, GUMCNational Biomedical Research Foundation, GUMC
Tutorial: Tutorial: Bioinformatics ResourcesBioinformatics Resources(http://pir.georgetown.edu/~huz/class/bioinfo_resource.html)
2
computer + mouse = bioinformatics (information) (biology)
NIH Biomedical Information Science and Technology NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000)Initiative (BISTI) Working Definition (2000) - - Research, Research, development, or application of computational tools and development, or application of computational tools and approaches for expanding the use of biological, medical, approaches for expanding the use of biological, medical, behavioral or health behavioral or health datadata, including those to , including those to acquireacquire, , storestore,, organizeorganize, , archivearchive, analyze, or , analyze, or visualizevisualize such data. such data.
What is Bioinformatics?What is Bioinformatics?
3
Molecular Biology Database Collection Molecular Biology Database Collection (http://nar.oupjournals.org/cgi/content/full/33/suppl_1/D5)
---- 719719 key databases key databases of of 1414 categories categories
4
Database Collection inDatabase Collection in Nucleic Acids Res.Nucleic Acids Res.
202 226281
335386
548
719
0
100
200
300
400
500
600
700
800
Dat
abas
e nu
mbe
r
1999 2000 2001 2002 2003 2004 2005
Year
NAR Molecular Biology Database Collection
5http://pir.georgetown.edu/~huz/class/2005_database_update.html
6
OverviewOverview
I.I. Text search / Information retrievalText search / Information retrievalII.II. Sequence & genomics databasesSequence & genomics databasesIII.III. Protein family databasesProtein family databasesIV.IV. Database of protein functionsDatabase of protein functionsV.V. Databases of protein structuresDatabases of protein structuresVI.VI. Proteomics databasesProteomics databases
Database Contents, Search and RetrievalDatabase Contents, Search and Retrieval
7
Entrez Text Searches Text Searches (http://www.ncbi.nlm.nih.gov/Entrez/)
8
PubMed Literature DatabasePubMed Literature Database((http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMedhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed) )
9
UniProt Text SearchUniProt Text Search
(http://www.pir.uniprot.org/cgi-bin/textSearch)
10
PIR Text PIR Text Search (I)Search (I)
((http://pir.georgetown.edu/pirwww/search/textsearch.html) )
What’s different between CRAA_RABIT & CYRBAA?
How about Search: Crystallin and SuperFamily?
11
PIR Text PIR Text Search Search (II)(II)
Can you find which crystallin that has 3D structure determined using PIR text search?
12
I. Sequence & Genomics DatabasesI. Sequence & Genomics DatabasesGenBankGenBank: An annotated collection of all publicly available nucleotide An annotated collection of all publicly available nucleotide and protein sequences.and protein sequences.RefSeqRefSeq: NCBI : NCBI non-redundant set of reference sequences, including non-redundant set of reference sequences, including genomic DNA, transcript (RNA), and protein productsgenomic DNA, transcript (RNA), and protein productsUniProtUniProt Consortium DatabaseConsortium Database: : UUniversal protein knowledgebase, a niversal protein knowledgebase, a central resource of protein sequence and function from central resource of protein sequence and function from Swiss-ProtSwiss-Prot, , TrEMBLTrEMBL and and PIRPIR..Entrez GeneEntrez Gene: Gene-centered information at NCBI.: Gene-centered information at NCBI.UniGeneUniGene: Unified clusters of ESTs and full-length mRNA sequences .: Unified clusters of ESTs and full-length mRNA sequences .OMIMOMIM:: Online Mendelian inheritance in man: a catalog of human Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders.genetic and genomic disorders.Model Organism Genome DatabasesModel Organism Genome Databases:: MGD, RGD, SGD, Flybase…MGD, RGD, SGD, Flybase…GeneCardsGeneCards:: Integrated database of human genes, maps, proteins and Integrated database of human genes, maps, proteins and diseases.diseases.SNP SNP Consortium DatabaseConsortium Database
13
UniProt Consortium DatabaseUniProt Consortium Database
((http://www.uniprot.orghttp://www.uniprot.org) )
UniProtKBUniProtKB (knowledgebase)(knowledgebase)
UniRef UniRef (100,90,50)(100,90,50)
UniParc UniParc (archive)(archive)
14
UniProt Sequence Report (I)UniProt Sequence Report (I)
(http://www.pir.uniprot.org/cgi-bin/unipEntry?id=CRAA_RABIT)
15
UniProt Sequence Report (II)UniProt Sequence Report (II)
(http://www.pir.uniprot.org/cgi-bin/unipEntry?id=UniRef90_P02489)
16
Entrez GeneEntrez Gene
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq
17
OMIM: OMIM: Online Mendelian inheritance in manOnline Mendelian inheritance in man
(http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580)
18
II. Protein Family DatabasesII. Protein Family DatabasesWhole Proteins PIRSF: A Network Classification System of Protein Families COG (Clusters of Orthologous Groups) of Complete Genomes ProtoNet: Automated Hierarchical Classification of Proteins
Protein Domains Pfam: Alignments and HMM Models of Protein Domains SMART: Protein Domain Families CDD: Conserved Domain Database
Protein Motifs PROSITE: Protein Patterns and Profiles BLOCKS: Protein Sequence Motifs and Alignments PRINTS: Protein Sequence Motifs and Signatures
Integrated Family Databases iProClass: Superfamilies/Families, Domains, Motifs, Rich Links InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART,
PIRSF, SuperFamily
19
Protein ClusteringProtein Clustering
COGs:COGs: (http://www.ncbi.nlm(http://www.ncbi.nlm.nih.gov/COG/).nih.gov/COG/)
20
KOGs: KOGs: Eukaryotic Eukaryotic ClustersClusters
(http://www.ncbi.nlm.nih.gov/COG/new/shokog.cgi?KOG3591)
21
Domain Domain ClassificationClassification
(http://pir.georgetown.edu/cgi-bin/ipcEntry?id=CRAA_RABIT)
(http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=CRAA_RABIT)
22
Pfam DomainPfam Domain(http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF00525)
23
Integrated Family ClassificationIntegrated Family Classification
InterProInterPro: An integrated resource unifying PROSITE, PRINTS, ProDom, Pfam, SMART, and TIGRFAMs, PIRSF. (http://www.ebi.ac.uk/interpro/search.html)
24
PIRSF: PIRSF: Full Length Full Length Classification Classification
iProClass iProClass Family ReportFamily Report
(http://pir.georgetown.edu/cgi-bin/ipcSF?id=SF002280)
25
Protein MotifsProtein Motifs PROSITE is a database of protein families and domains. It consists of
biologically significant sites, patterns and profiles. (http://us.expasy.org/prosite/)
26
III. Databases of Protein FunctionsIII. Databases of Protein FunctionsMetabolic Pathways, Enzymes, and Compounds Enzyme Classification: Classification and Nomenclature of Enzyme-Catalysed
Reactions (EC-IUBMB) KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes EcoCyc: Encyclopedia of E. coli Genes and Metabolism MetaCyc: Metabolic Encyclopedia (Metabolic Pathways) WIT: Functional Curation and Metabolic Models BRENDA: Enzyme Database UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways
Cellular Regulation and Gene Networks EpoDB: Genes Expressed during Human Erythropoiesis BIND: Descriptions of interactions, molecular complexes and pathways DIP: Catalogs experimentally determined interactions between proteins BioCarta: Biological pathways of human and mouse GO: Gene Ontology Consortium Database
27
KEGG Metabolic & Regulatory PathwaysKEGG Metabolic & Regulatory Pathways
(http://www.genome.ad.jp/dbget-bin/show_pathway?hsa00220+4.3.2.1)
KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks, the information of genes and proteins, and of chemical compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html)
28
BioCyc (EcoCyc/MetaCyc BioCyc (EcoCyc/MetaCyc Metabolic Pathways)Metabolic Pathways)
The BioCyc Knowledge Library is a collection of Pathway/Genome Databases (http://biocyc.org/)
29
BioCarta Cellular PathwaysBioCarta Cellular Pathways(http://www.biocarta.com/index.asp)
30
Protein-Protein Interaction: BINDProtein-Protein Interaction: BIND(http://www.bind.ca/) (http://www.bind.ca/)
31
Gene OntologyGene Ontology((http://www.geneontology.org/http://www.geneontology.org/))
Three GOs: Molecular Function Biological Process Cellular Component
32
IV. Databases of Protein StructuresIV. Databases of Protein StructuresProtein Structure PDB: Structure Determined by X-ray Crystallography and NMR PDBsum: Summaries and analyses of PDB structures MMDB: NCBI’s database of 3D structures, part of NCBI Entrez SWISS-MODEL Repository: Database of annotated protein 3D models ModBase: Annotated comparative protein structure models
Structure Classification CATH: Hierarchical Classification of Protein Domain Structures SCOP: Familial and Structural Protein Relationships FSSP: Protein Fold Classification Based on Structure--Structure
Alignment
33
PDB 3D Structure PDB 3D Structure
(http://www.rcsb.org/pdb/)
Rat gamma-crystallin, Rat gamma-crystallin, chain A, B.chain A, B.
Can you do a text search Can you do a text search at PIR to find this?at PIR to find this?
34
PDBsum:PDBsum:Summary and AnalysisSummary and Analysis (http://www.biochem.ucl.ac.uk/bsm/pdbsum)
35
Protein Structural Classification (1)Protein Structural Classification (1)
CATH: Hierarchical domain classification of protein structures (http://www.biochem. ucl.ac.uk/bsm/cath_new/)
36
Protein Structural Classification (2)Protein Structural Classification (2)
(http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html)
SCOP: comprehensive description of structural and evolutionary relationships between all proteins whose structure is known.
37
SWISS-MODEL RepositorySWISS-MODEL Repository
A database of annotated three-dimensional A database of annotated three-dimensional comparative protein structure modelscomparative protein structure models (http://swissmodel.expasy.org/repository/smr.php?sptr_ac=CRGE_RAT&job=2)
38
VI. Proteomic ResourcesVI. Proteomic Resources
GELBANK (GELBANK (http://gelbank.anl.gov): 2D-gel patterns from completed ): 2D-gel patterns from completed genomes; SWISS-2DPAGE (genomes; SWISS-2DPAGE (http://www.expasy.org/ch2d/))PEP: Predictions for Entire Proteomes: (PEP: Predictions for Entire Proteomes: (http://cubic.bioc.columbia.edu/ pep/): Summarized analyses of protein sequences): Summarized analyses of protein sequences Proteome BioKnowledge Library: (Proteome BioKnowledge Library: (http://www.proteome.com): Detailed ): Detailed information on human, mouse and rat proteomesinformation on human, mouse and rat proteomesProteome Analysis Database (Proteome Analysis Database (http://www.ebi.ac.uk/proteome/): Online ): Online application of InterPro and CluSTr for the functional classification of application of InterPro and CluSTr for the functional classification of proteins in whole genomesproteins in whole genomesExpression Profiling databases: GNF (Expression Profiling databases: GNF (http://expression.gnf.org/cgi-bin/index.cgi, human and mouse , human and mouse transcriptome), SMD (transcriptome), SMD (http://genome-www5.stanford.edu/MicroArray/SMD/, , Stanford microarray data analysis), EBI Microarray Informatics (Stanford microarray data analysis), EBI Microarray Informatics (http://www.ebi.ac.uk/microarray/ index.html , , managing, storing and analyzing managing, storing and analyzing microarray datamicroarray data))
39
2D-Gel Image Databases (1)2D-Gel Image Databases (1)(http://us.expasy.org/ch2d/2d-index.html)
(http://us.expasy.org/cgi-bin/nice2dpage.pl?P02489)
40
2D-Gel Image Databases (2)2D-Gel Image Databases (2)(http://gelbank.anl.gov/2dgels/index.asp)
41
Expression ProfilingExpression Profiling Human and Mouse Transcriptome
(http://expression.gnf.org/cgi-bin/index.cgi)
(http://genome-www.stanford.edu/serum/)
(http://expression.gnf.org/cgi-bin/index.cgi/)
42
Choose additional protein IDs to browse the variety of molecular biology databases each sequence report links to.
Delta crystallin II (Argininosuccinate lyase) (UniProt: CRD2_ANAPL)
Alpha crystallin (UniProt: CRAA_RABIT)
Lab:Lab: