1 protein bioinformatics – advances and challenges sona vasudevan peter mcgarvey by
TRANSCRIPT
22
OutlineOutline• What is Bioinformatics? What is Bioinformatics? Past & Past &
PresentPresent• About PIRAbout PIR• PIR resourcesPIR resources• UniProt resourcesUniProt resources• PIR’s leading role in CaBig; PIR’s leading role in CaBig;
Biodefense and OntologyBiodefense and Ontology
33
What is Bioinformatics?What is Bioinformatics?NIH Biomedical Information Science and Technology Initiative (BISTI) NIH Biomedical Information Science and Technology Initiative (BISTI)
Working Definition (2000)Working Definition (2000)
Bioinformatics: Bioinformatics: Research, development, or application of Research, development, or application of computational tools and approaches for expanding the use of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such to acquire, store, organize, archive, analyze, or visualize such data.data.
Computer + Mouse = Bioinformatics (Information) (Biology)
55
Dr. Margaret Oakley Dayhoff (1925 – 1983)
The origin of the single-letter code for the amino acids
Evolution of Protein databases
(Georgetown University)
66
Challenges we are facing today!Total number of Total number of sequences in NRsequences in NR
~~4,919,3024,919,302
Total number of Total number of environmental environmental sequencessequences
~6,028,191(NCBI)~6,028,191(NCBI)
Number of domainNumber of domain
Families (Pfam)Families (Pfam)~~89578957
Number of domainNumber of domain
Families (SMART)Families (SMART)~~665665
Number of Structures Number of Structures (PDB)(PDB)
~~4333943339
Number of COGSNumber of COGS ~4873 (Unicellular)~4873 (Unicellular)
~4852 (Eukaryote)~4852 (Eukaryote)
77
Molecular Biology Molecular Biology DatabasesDatabases
719 Databases in 14 categories
The DNA sequence database has exceeded 100 gigabases.
1111
Protein Information ResourceProtein Information Resource UniProt Universal Protein Resource:UniProt Universal Protein Resource: Central Central
Resource of Protein Sequence and FunctionResource of Protein Sequence and Function PIRSF Protein Family Classification System:PIRSF Protein Family Classification System:
Protein Classification and Functional Annotation Protein Classification and Functional Annotation iProClass Integrated Protein Knowledgebase:iProClass Integrated Protein Knowledgebase:
Data Integration and Functional Associative Data Integration and Functional Associative AnalysisAnalysis
http://pir.georgetown.edu
Integrated Protein Informatics Resource for Proteomics Research
1212
UniProt DatabasesUniProt Databases UniParc: Comprehensive Sequence Archive with Sequence History UniParc: Comprehensive Sequence Archive with Sequence History UniProt: Knowledgebase with Full Classification and Functional AnnotationUniProt: Knowledgebase with Full Classification and Functional Annotation UniRef: Non-redundant Reference Databases for Sequence SearchUniRef: Non-redundant Reference Databases for Sequence Search
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
Classification, Literature-Based &
Automated Annotation
UniParc (Archive)
UniRef100 (NREF)
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
Swiss-Prot
PIR-PSDTrEMBL RefSeq GenBank/EMBL/DDBJ
Ensembl PDB PatentData
Other Data
UniProt (Knowledgebase)
Clustering at 100, 90, 50% Identity UniRef90
UniRef50
Merging
1313
UniProt KnowledgebaseUniProt Knowledgebase Objective: Stable, Comprehensive, Fully Classified, Objective: Stable, Comprehensive, Fully Classified,
Richly and Accurately Annotated Richly and Accurately Annotated Information ContentInformation Content
Isoform PresentationIsoform Presentation NomenclatureNomenclature Family Classification and Domain IdentificationFamily Classification and Domain Identification Functional AnnotationFunctional Annotation
ApproachesApproaches Full Classification Full Classification Automated AnnotationAutomated Annotation Literature-Based CurationLiterature-Based Curation Database Cross-ReferencesDatabase Cross-References Controlled Vocabularies & OntologiesControlled Vocabularies & Ontologies Evidence AttributionEvidence Attribution
1414
PIRSF Classification SystemPIRSF Classification System PIRSF:PIRSF:
Reflects Reflects evolutionary relationshipsevolutionary relationships of of full-lengthfull-length proteinsproteins A A networknetwork structure from structure from superfamiliessuperfamilies to to subfamiliessubfamilies
Definitions:Definitions: Homeomorphic Family (HF):Homeomorphic Family (HF): Basic UnitBasic Unit HomologousHomologous:: Common ancestry, inferred by sequence Common ancestry, inferred by sequence
similaritysimilarity HomeomorphicHomeomorphic:: Full-length similarity & common domain Full-length similarity & common domain
architecturearchitecture Hierarchy:Hierarchy: Flexible number of levels with varying degrees of Flexible number of levels with varying degrees of
sequence conservationsequence conservation Network StructureNetwork Structure: : Allows multiple parentsAllows multiple parents
AdvantagesAdvantages:: Annotate both general biochemical andAnnotate both general biochemical and specific biological specific biological
functionsfunctions AccurateAccurate propagation of annotation and development ofpropagation of annotation and development of
standardizedstandardized protein nomenclature and ontologyprotein nomenclature and ontology
Credit AN Nikolskaya
1515
PIRSF Classification SystemProtein Classification and Functional Annotation
(http://pir.georgetown.edu/pirsf/)
Comprehensive Classification of All UniProt Proteins Curated Families with Protein Name and Site Rules Classification and Visualization Tools
Taxonomy Distribution and Phylogenetic Pattern
Iterative BlastClust Tree with Annotation Table, MSA & Phylogenetic tree
1616
Classification Tool: Classification Tool: BlastClust BlastClust
Curator-guided Curator-guided clusteringclustering
Single-linkage Single-linkage clustering using clustering using BLASTBLAST
Retrieve all Retrieve all proteins proteins sharing a sharing a common common domaindomain
Iterative Iterative BlastClust BlastClust (fixed (fixed length coverage)length coverage)
1717
PIRSF-Based Protein Annotation
Classification-Driven Rule-Based AnnotationProvides Consistent Annotation and Database Integrity Check Includes:Site Rule (PIRSR): Position-Specific Site Feature (FT)Name Rule (PIRNR): transfer name from PIRSF to individual proteins
Protein Name (DE) with Synonym, EC, MisnomerGO Term
Rule IDRule ID Rule ConditionRule Condition Rule Description (Name Rule Interface)Rule Description (Name Rule Interface)
PIRNR000881PIRNR000881-1-1
PIRSF000881 PIRSF000881 member and member and vertebratesvertebrates
Name: Name: S-acyl fatty acid synthase thioesteraseS-acyl fatty acid synthase thioesteraseEC: oleoyl-[acyl-carrier-protein] hydrolase (EC EC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3.1.2.14) 3.1.2.14)
PIRNR000881PIRNR000881-2-2
PIRSF000881 PIRSF000881 member and not member and not vertebratesvertebrates
Name: Name: Type II thioesteraseType II thioesteraseEC: thiolester hydrolases (EC 3.1.2.-)EC: thiolester hydrolases (EC 3.1.2.-)
PIRNR025624PIRNR025624-1-1
PIRSF025624 PIRSF025624 membermember
Name: ACT domain proteinName: ACT domain proteinMisnomer: chorismate mutaseMisnomer: chorismate mutase
1818
Rule-based Annotation of Protein Entries Using PIRSF
Structure Binding/active sites Identification of residues
1919
MethodologyMethodology
Defining a RuleDefining a Rule Select template structureSelect template structure Align curated PIRSF seed members and structural templateAlign curated PIRSF seed members and structural template Structure-based sequence alignment of seedsStructure-based sequence alignment of seeds Edit MSA retaining conserved regions covering all site Edit MSA retaining conserved regions covering all site
residuesresidues Build Site HMM from concatenated conserved regionsBuild Site HMM from concatenated conserved regions
Rule ConditionRule Condition Membership Check (PIRSF HMM threshold)Membership Check (PIRSF HMM threshold) Conserved Region Check (site HMM threshold)Conserved Region Check (site HMM threshold) Site Residue Check (position-specific residue in HMMAlign) Site Residue Check (position-specific residue in HMMAlign)
Rule PropagationRule Propagation Propagate conserved feature annotation to all members that fit Propagate conserved feature annotation to all members that fit
the rulethe rule
2121
PIRSF Protein Classification provides PIRSF Protein Classification provides a platform for protein annotationa platform for protein annotation Improves AnnotationImproves Annotation Quality Quality
Annotation ofAnnotation of biological function biological function of whole proteinsof whole proteins Annotation of uncharacterized hypothetical proteins Annotation of uncharacterized hypothetical proteins
((functional predictions functional predictions helped by newly detected family helped by newly detected family relationships)relationships)
Correction Correction of annotation errorsof annotation errors Improvement Improvement of under- or over-annotated proteinsof under- or over-annotated proteins
Standardization Standardization of Protein Namesof Protein Names
2222
Data IntegrationData Integration
Data WarehouseData Warehouse Local Copy of Databases in a Unified Database SchemaLocal Copy of Databases in a Unified Database Schema Allows Local Control of Data; Update ProblemAllows Local Control of Data; Update Problem
Hypertext NavigationHypertext Navigation Browsing Model with Hypertext LinksBrowsing Model with Hypertext Links Allows Direct Interaction; Easily Lost in CyberspaceAllows Direct Interaction; Easily Lost in Cyberspace
iProClass ApproachiProClass Approach Data Warehouse + Hypertext NavigationData Warehouse + Hypertext Navigation Rich Links (Links + Executive Summaries) Rich Links (Links + Executive Summaries) Modular and Open Framework for Adding New Modular and Open Framework for Adding New
Components in Distributed Networking EnvironmentComponents in Distributed Networking Environment
2323
iiProClass DatabaseProClass Database
~5,000,000 Protein ~5,000,000 Protein SequencesSequences
Rich Links to >80 Rich Links to >80 DatabasesDatabases
Value-Added Views Value-Added Views for UniProtfor UniProt
Integrated Protein Family, Function, StructureIntegrated Protein Family, Function, Structure Information
Gene
Structure
PDBSCOPCATH
PDBSumMMDBFFSP
Family
PIR SuperfamilyPIR-ASDB
InterProPfam
PROSITECOG
BLOCKSProClassMetaFam
Taxonomy
NCBI TaxonLiterature
PubMed
Protein Sequence
PIR-NREFPIR-PSD
Swiss-ProtTrEMBLRefSeq
GenePept
Gene/Genome
GenBank/EMBL/DDBJLocusLinkUniGene
GDBOMIMSGDMGI
FlyBaseMIPSTIGR
Function/Pathway
EC-IUBMBKEGG
BRENDAWIT
MetaCycEcoCyc
Gene Ontology
Interaction
DIPBIND
Modification
RESIDPhosphoBase
PhosphorylationSite
Protein Structure
Protein Expression
Protein Modification
Protein Interaction
Protein Function/Pathway
Superfamily/Domain/Motif
iProClassProtein Sequence
Expression
PMG
Gene
Structure
PDBSCOPCATH
PDBSumMMDBFFSP
Family
PIR SuperfamilyPIR-ASDB
InterProPfam
PROSITECOG
BLOCKSProClassMetaFam
Taxonomy
NCBI TaxonLiterature
PubMed
Protein Sequence
PIR-NREFPIR-PSD
Swiss-ProtTrEMBLRefSeq
GenePept
Gene/Genome
GenBank/EMBL/DDBJLocusLinkUniGene
GDBOMIMSGDMGI
FlyBaseMIPSTIGR
Function/Pathway
EC-IUBMBKEGG
BRENDAWIT
MetaCycEcoCyc
Gene Ontology
Interaction
DIPBIND
Modification
RESIDPhosphoBase
PhosphorylationSite
Protein Structure
Protein Expression
Protein Modification
Protein Interaction
Protein Function/Pathway
Superfamily/Domain/Motif
iProClassProtein Sequence
Expression
PMG
2424
iProClass ViewsiProClass Views
Sequence Report
Family Report
26
1.1. Albert Einstein College of MedicineAlbert Einstein College of MedicineT. gondii, C. parvumT. gondii, C. parvum
2.2. Caprion Pharmaceuticals Caprion Pharmaceuticals B. abortusB. abortus
3.3. Harvard Institute of Proteomics Harvard Institute of Proteomics V. choleraeV. cholerae, , B. anthracisB. anthracis
4.4. Myriad Genetics Myriad Genetics B. anthracis, Y. pestis, F. tularensis, Vaccinia, B. anthracis, Y. pestis, F. tularensis, Vaccinia, VariolaVariola
5.5. Pacific Northwest National Laboratory Pacific Northwest National Laboratory S. typhimurium, S. typhi, Vaccinia, MonkeypoxS. typhimurium, S. typhi, Vaccinia, Monkeypox
6.6. ScrippsScrippsSARS CoV, SARS CoV, InfluenzaInfluenza
7.7. University of Michigan University of Michigan B. anthracisB. anthracis
Scripps Caprion
MyriadHarvard
U of Michigan
Albert Einstein
PNNL
Resource Center
SSS
PIR VBI
DATA
28
Currently contains 3,733 ORF Clones out of 3,784 Proteins
Master Protein DirectoryMaster Protein Directory
29 Colonization Pathway Proteins
29
Protein Summary ReportClone SequencesOrder Clones from RepositoriesProtein and Reagent InformationProtein and Reagent InformationSearch for Related Proteins in Catalog by
Family Classification or Similarity Searches
NCI caBIG Initiative
cancer Biomedical Informatics Grid: • Informatics platform to enable sharing of research, data and tools
• Designed and built by an open federation of organizations
• Facilitate connectivity via common standards and unifying architecture
• Open source and open access principles
• Domain Workspaces
• Clinical Trial Management Systems
• Integrative Cancer Research
• Imaging
• Tissue Banks and Pathology Tools
• Cross Cutting Workspaces
• Architecture
• Vocabularies and Common Data Elements
PIR Activities in caBIG™
•Integrative Cancer Research Workspace• Developer
• Grid-enablement of PIR
• Adopter• SEED Genome Annotation Tool
(completed)
• GeneConnect Genomic Identifier Mapping Service
•Vocabularies and Common Data Elements• Participant