protein information resource (pir) for functional annotation: protein family classification,...
TRANSCRIPT
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology
In-Silico Analysis of ProteinsCelebrating the 20th anniversary of Swiss-ProtFortaleza, Brazil
August 4, 2006Cathy H. Wu, Ph.D.Director, Protein Information ResourceProfessor, Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical Center
2
Wu CH, Zhao S, Chen HL. (1996)
A protein class database organized with PROSITE protein groups and PIR superfamilies.
Journal of Computational Biology, 3 (4), 547-562.
3
Protein Information Resource (PIR)
UniProt Universal Protein Resource: Central Resource of Protein Sequence and Function
PIRSF Family Classification System: Protein Classification and Functional Annotation
iProClass Integrated Protein Database: Data Integration and Protein Mapping
iProLINK Literature Mining Resource: Annotation Extraction
Other Projects: NIAID Proteomics, caBIG Grid-Enablement
Integrated Protein Informatics Resource for Genomic/Proteomic Research
http://pir.georgetown.edu
4
PIR Protein Sequence Database The PIR-International Protein Sequence
Database (PIR-PSD) grew out of the Atlas of Protein Sequence and Structure (1965-1978), Vol 1-5, Suppl 1-3.
Margaret Dayhoff collected all the known protein sequences to study protein evolution.
The first Atlas contained 65 proteins, the
final volume had 1081 proteins. The PIR-PSD was produced from
1984 (Release 1, 2900 proteins) to 2004 (Release 80, 283,416 proteins).
PIR-PSD has been integrated with the UniProt since 2002. 0
50,000
100,000
150,000
200,000
250,000
300,000
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76
PIR-PSD Release Number
Nu
mb
er
of S
eq
ue
nce
s Joined UniProt (Jan 2002)
5
UniProt Activities at PIR Integration of PIR-PSD into UniProtKB
Incorporation of unique PIR entries Incorporation of PIR annotations: references, experimental
features with literature evidence tag Functional annotation of UniProtKB proteins
Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins
Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site feature)
Production of UniRef100/90/50 databases Creation of UniProt web site and help system => Unified UniProt
web site & user community interaction
6
PIRSF Classification System
PIRSF: Evolutionary relationships of proteins from super- to sub-families Curated families with name rules and site rules Curation platform with classification/visualization tools Dissemination: UniProtKB annotations, InterPro families,
PIRSF reports, PIRSF curation platform
Protein Classification and Functional Annotation
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
7
iProClass Integrated Protein Database
Data integration from >90 databases Underlying data warehouse for protein ID/name/bibliography mapping &
pre-computed BLAST results Integration of protein family, function, structure for functional annotation Rich link (link + summary) for value-added reports of UniProt proteins
Data Integration and Protein Mapping
Disease/Variation
OMIMHapMap
…Ontology
GO
Protein Sequence
UniProtUniRefUniParcRefSeq
GenPept…
Gene/Genome
GenBank/EMBL/DDBJLocusLinkUniGene
MGITIGR
…
Gene Expression
GEOGXD
ArrayExpressCleanExSOURCE
…
Structure
PDBSCOPCATH
PDBSumMMDB
…
Family
PIRSFInterPro
PfamPrositeCOG
…
Interaction
DIPBIND
…
Taxonomy
NCBI TaxonNEWT
Protein Expression
Swiss-2DPAGEPMG
…
Literature
PubMed
Function/Pathway
EC-IUBMBKEGG
BioCartaEcoCyc
WIT…
Modification
RESIDPhosphoBase
…
iProClass
Integrated Protein Knowledgebase
iProClass
Integrated Protein Knowledgebase
NCBI X-Refs
Gene/Genome
Gene Ontology
KEGG PathwayStructure Homolog
PTM
EC
Additional Refs
NCBI X-Refs
Gene/Genome
Gene Ontology
KEGG PathwayStructure Homolog
PTM
EC
Additional Refs
8
iProLINK Text Mining Resource
Curated datasets and literature corpus for development of literature mining and annotation extraction tools
RLIMS-P text-mining tool for extracting protein phosphorylation data BioThesaurus of gene/protein names to resolve synonym and ambiguity
Annotation Extraction and Literature-Based Protein Annotation
iProLINKNLP Text Mining
Research
Literature-Based Curation
Bibliography Mapping
Text Categorization
Annotation Extraction
Named Entity Recognition
Databases
UniProtPIRSF
iProClassGO
Bibliography
PubMed
Literature Mining &Protein Curation
Literature Corpus• Mapping to Proteins/Features• Annotation-Tagged• Name-Tagged
integrated Protein Literature, INformation and Knowledge
http://pir.georgetown.edu/iprolink
Guidelines• Protein/Family Naming Guidelines• Name Tagging Guidelines
Dictionary and Ontology• Protein Names and Synonyms• PIRSF Family Names in DAG
Bibliography Display• Mapping of PubMed IDs to Proteins• Papers Categorized by Annotations
9
NIAID Biodefense Proteomic Program Goals
Characterize proteomes of pathogens and host cells Identify proteins associated with the biology of the microbes Elucidate mechanisms of microbial pathogenesis Understand immune responses and non-immune mediated host responses
Adm Ctr
PRC
Data Type
Organism
10
Multiple Data Typesfrom Proteomics
Research Centers
Data Integration atNIAID Admin Center
Integrated Dataat VBI
Data Exchange FormatControlled Vocabulary
Ontology
Master Protein Directory & Complete Proteomes
at GU-PIR
iProClass UniProtPIRSF
Protein IDPeptide/Protein
Sequence Mapping
Rich annotation - capture experimental data and scientific conclusion; integrate with major databases
http://pir.georgetown.edu/proteomics/
11
NCI caBIG Initiative caBIG (cancer Biomedical Informatics Grid) Cancer research platform to enable sharing of research infrastructure, data, tools
Designed and built by an open federation of organizations Based on common standards and open source/open access principles
One of four caBIG grid reference projects PIR Grid-Enablement: UniProtKB as central protein
information resource for cancer research caBIG Workspaces
Integrative Cancer Research
PIR Developer Project: Grid Enablement of PIR
PIR Adopter Project: SEED Genome Annotation
PIR Adopter Project: GeneConnect ID mapping Vocabularies and Common Data Elements
PIR Participant Project: Protein models, objects, vocabularies, ontologies
caGrid Architecture
12
UniProt Knowledgebase: Accurate, Consistent, and Rich Annotation of
Protein Sequence and Function
Family Classification-Driven and Rule-Based Curation Functional inference of uncharacterized hypothetical proteins Systematic detection and correction of genome annotation errors Improvement of under- or over-annotated proteins
Text Mining-Assisted and Literature-Based Curation Annotation extraction from scientific literature Attribution of experimental evidence
Ontology and Controlled Vocabulary-Based Curation Standardization of protein/gene/family names and annotation terms Annotation of specific protein entities
13
PIR Superfamily Classification
Tree of Life and Evolution of Protein Families (Dayhoff)
The protein superfamily concept (1976) was based on sequence similarity, where sequences were categorized into superfamilies, families, subfamilies, and entries using different % identity thresholds.
14
PIRSF Classification System A network classification system from superfamily to subfamily levels to
reflect the evolutionary relationships of full-length proteins and domains Basic unit is homeomorphic family: Full-length similarity, common domain
architecture Provide annotation of generic biochemical and specific biological functions Basis for evolutionary and comparative genomics research Basis for accurate and consistent automated protein annotation (protein
name, biochemical and biological functions, functional sites) Basis for standardization of protein names and development of ontology
for protein evolution
15
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
16
PIRSF Classification/Curation WorkflowUnclassified UniProtKB proteins
Uncurated Homeomorphic Clusters
Orphans
Preliminary Homeomorphic Families
Final Families, Subfamilies, Superfamilies
Add/Remove Members
Name, Refs, Abstract, Domain Arch.
Automatic Clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned Proteins
Au
tom
atic
Place
me
nt
Hierarchies (Superfamilies/Subfamilies)
Map Domains on Clusters
Merge/SplitClusters
New Proteins
Protein Name Rules/Site Rules Build and Test HMMs
1
2
3
4
5
6
7 8
Unclassified UniProtKB proteins
Uncurated Homeomorphic Clusters
Orphans
Preliminary Homeomorphic Families
Final Families, Subfamilies, Superfamilies
Add/Remove Members
Name, Refs, Abstract, Domain Arch.
Automatic Clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned Proteins
Au
tom
atic
Place
me
nt
Hierarchies (Superfamilies/Subfamilies)
Map Domains on Clusters
Merge/SplitClusters
New Proteins
Protein Name Rules/Site Rules Build and Test HMMs
1
2
3
4
5
6
7 8
1. Computational generation of homeomorphic clusters
2. Computational domain mapping and annotation of preliminary clusters
3. Automatic placement of new proteins into families
4. Computer-assisted expert analysis to define homeomorphic families
5. Family hierarchy created as needed
6. Expert annotation
7. Name rules and optional site rules created
8. Seed members to generate family HMMs
17
PIRSF Classification Tools Iterative BlastClust Tree with Annotation Table Multiple Alignment and Phylogenetic Tree PIRSF Classification in DAG Editor
HPS
KGPDC
Phylogenetic Tree Classification/Annotation Alignment
ISMB: PIRSF Protein Classification System Demo
18
PIRSF Analysis/Visualization Tools Taxonomy Distribution and Phylogenetic Pattern Domain Display Family Hierarchy (DAG Browser)
19
PIRSF Family Report
Curated family name
Description of family
Sequence analysis tools
20
ATP_PFK_DR0635
ATP_PFK_euk
PPi_PFK_PfpB
PPi_PFK_TM0289
PPi_PFK_TP0108
PPi_PFK_SMc01852
PFK_XF0274
E. coli (P06998)Gly105 Gly125
ATP-PFK:Gly105
+Gly125
PPi-PFK:Gly/Asp105
+Lys125
Example - Phosphofructokinase (PFK) classification shows that functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue.
Classification and Functional Annotation
Families
Cla
ssifi
catio
n T
ree
21
Family-Based Rules for Annotation
?
Functional Site Rule: tags
active site, binding, other residue-specific information
Functional Name Rule: gives name, EC, GO, other function-specific information
22
iProLINKNLP Text Mining
Research
Literature-Based Curation
Bibliography Mapping
Text Categorization
Annotation Extraction
Named Entity Recognition
Databases
UniProtPIRSF
iProClassGO
Bibliography
PubMed
Literature Mining &Protein Curation
Literature Corpus• Mapping to Proteins/Features• Annotation-Tagged• Name-Tagged
integrated Protein Literature, INformation and Knowledge
http://pir.georgetown.edu/iprolink
Guidelines• Protein/Family Naming Guidelines• Name Tagging Guidelines
Dictionary and Ontology• Protein Names and Synonyms• PIRSF Family Names in DAG
Bibliography Display• Mapping of PubMed IDs to Proteins• Papers Categorized by Annotations
iProLINK Literature Mining Resource
23
iProLINKNLP Research
Literature-Based Curation
Bibliography Mapping
Text Categorization
Annotation Extraction
Named Entity Recognition
Databases
UniProtPIRSF
iProClassGO
Bibliography
PubMed
Literature Mining &Protein Curation
Literature Corpus• Mapping to Proteins/Features• Annotation-Tagged• Name-Tagged
integrated Protein Literature, INformation and Knowledge
http://pir.georgetown.edu/iprolink
Guidelines• Protein/Family Naming Guidelines• Name Tagging Guidelines
Dictionary and Ontology• Protein Names and Synonyms• PIRSF Family Names in DAG
Bibliography Display• Mapping of PubMed IDs to Proteins• Papers Categorized by Annotations
iProLINK Literature Mining Resource1. UniProtKB Bibliography mapping in iProClass
2. RLIMS-P Rule-based NLP method for extracting protein phosphorylation data
3. Substring-based machine learning method for PTM text categorization
4. BioThesaurus of protein/gene names with UniProtKB association
5. Entity-named tagging Guide
3
1
2
4
5
24
Literature Corpus for Text Mining Literature survey and manual tagging for evidence attribution Training and benchmarking sets for information retrieval and extraction
Protein phosphorylation data used to develop RLIMS-P for extracting phosphorylation information
The five PTM datasets used to develop a machine learning algorithm for text categorization
25
Online RLIMS-P
A
1. Summary table: PMIDs & top-ranking annotation
1
2. Report: Full annotation with evidence tagging and PMID mapping to UniProtKB entry2
3. Name mapping searches BioThesaurus
3
26
BioThesaurus
Raw Thesaurus
iProClass
NCBIEntrez Gene
RefSeqGenPept
UniProtUniProtKB
UniRef90/50PIR-PSD
Genome
FlyBaseWormBase
MGDSGDRGD
OtherHUGO
ECOMIM
Name Filtering
Highly Ambiguous Nonsensical
Terms
Semantic Typing
UMLS
NameExtraction
UniProtKB Entries:
Protein/Gene Names & Synonyms
BioThesaurus
Raw Thesaurus
iProClassiProClass
NCBIEntrez Gene
RefSeqGenPept
NCBIEntrez Gene
RefSeqGenPept
UniProtUniProtKB
UniRef90/50PIR-PSD
UniProtUniProtKB
UniRef90/50PIR-PSD
Genome
FlyBaseWormBase
MGDSGDRGD
Genome
FlyBaseWormBase
MGDSGDRGD
OtherHUGO
ECOMIM
OtherHUGO
ECOMIM
Name Filtering
Highly Ambiguous Nonsensical
Terms
Name Filtering
Highly Ambiguous Nonsensical
Terms
Semantic Typing
UMLS
Semantic Typing
UMLS
NameExtraction
UniProtKB Entries:
Protein/Gene Names & Synonyms
BioThesaurus
UniProtKB Entries:
Protein/Gene Names & Synonyms
BioThesaurus
Comprehensive collection of protein/gene names from 23 databases Associate names (~3.2 million) with UniProtKB entries (>2 million) Web-based searches to retrieve synonymous names, resolve
ambiguous names, evaluate name coverage FTP download for automatic dictionary-based named entity tagging
27
Online BioThersaurus
1
2
1. Search protein entries sharing the same names
2. Retrieve BioThesaurus report
Name ambiguity of CLIM1
Annotation error detection
28
Synonyms for Metalloproteinase inhibitor 3
1
2
Name ambiguity of TIMP-3
BioThesaurus ReportGene/Protein Name Mapping
1. Search Synonyms
2. Resolve Name Ambiguity
3. Underlying ID Mapping
3 ID Mapping
29
Protein Ontology (PRO) PRotein Ontology (PRO) in OBO (Open Biomedical Ontologies)
Framework Two sub-ontologies:
Ontology for Protein Evolution (ProEvo) for the classification of proteins on the basis of evolutionary relationships
Ontology for Protein Modified Forms (ProMod) to represent the multiple protein forms of a gene (genetic variation, alternative splicing, proteolytic cleavage, and post-translational modification).
Why PRO? Allow the specification of relationships between PRO and other
ontologies, such as GO and Disease Ontology Facilitate precise protein annotation of specific proteins/classes
The PRO prototype is illustrated using human proteins from the TGF-beta signaling pathway (http://pir.georgetown.edu/pro).
30
PRO Conceptual Framework
GO Gene Ontology
molecular function
cellular component
biological process
has_ancestral_property participates_in
has_ancestral_property part_of (for complexes) located_in (for compartments)
has_ancestral_propertyhas_functionlacks_function
evolutionary unit
domain
is_a
is_a
ProEvo
structure domain
sequence domain
protein
is_a
Root level
is_a
modified product
is_a
homeomorphic protein
is_a
ProMod
has_part
is_a
splice variant
reference protein
derives_from
PROProtein Ontology
gene product
genetic variant
is_ais_a
derives_from derives_from
Unit Level• The two types of evolutionary units• Not substituted by any other terms
Domain Family Level (structure)• Related by structural similarity• Source: SCOP Superfamily
Domain Family Level (sequence)• Related by sequence similarity• Source: Pfam domain
Protein Family Level• Evolutionarily-related full-length protein• May contain finer-grain sub-categories• Sources: PIRSF family/subfamily, Panther subfamily
Post-translation level• Protein as modified after translation• Source: UniProtKB
Transcript level• Possible transcript forms• Source: UniProtKB
cleaved product
Gene level• All protein products encoded by one gene• Source: UniProtKB
disease
DO/UMLS Disease Ontology/Term
agent_of
is_a
protein modification
has_modification
PSI-MOD Modification
HGNC/MGI Gene Name
gene name
encoded_by
lacks
31
Protein Ontology (PRO)
32
PIR Team Protein Science Team: Darren Natale, Winona Barker, Peter McGarvey,
Zhangzhi Hu, Lai-Su Yeh, Anastasia Nikolskaya, Raja Mazumder, CR Vinayaka, Sona Vasudevan, Cecilia Arighi, Xin Yuan
Informatics Team: Hongzhan Huang, Baris Suzek, Leslie Arminski, Hsing-Kuo Hua, Yongxing Chen, Jing Zhang, Robel Kahsay, Jess Cannata
Students: Natalia Petrova, Paul Ramos, Ti-Cheng Chang, Anna Bank Collaborators
UniProt: Rolf Apweiler, Amos Bairoch and EBI/SIB Teams NIAID: Margaret Moore (SSS), Bruno Sobral (VBI) Text Mining: Hongfang Liu (GUMC), Interjeet Mani (MITRE), Vijay
Shanker (U Delaware), Zoran Obradovic (Temple U) Funding Support
NHGRI/NIGMS (UniProt) NCI caBIG NIAID (Proteomic Admin Center) NSF: iProClass, text mining
Acknowledgements