proteome analyst
Post on 21-Jan-2016
76 Views
Preview:
DESCRIPTION
TRANSCRIPT
Proteome Analyst
Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors
Proteome Analyst
Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell
Proteome Analyst
Proteomeone of many ‘-omes’set of all proteins in an organism
Analysisprediction of protein function or
localization from sequence data
Analyze a Protein
We have examples of annotated proteins in various protein classes.
We have more examples of unannotated proteins.
Analyze a Protein
We have examples of annotated proteins in various protein classes.
We have more examples of unannotated proteins.
What do we do?
Analyze a Protein
We have examples of annotated proteins in various protein classes.
We have more examples of unannotated proteins.
What do we do? Find homologues to each protein and
assume similar function.
Analyze a Protein
We have examples of annotated proteins in various protein classes.
We have more examples of unannotated proteins.
What do we do? Find homologues to each protein and
assume similar function. Find characteristics of each protein that affect
function.
Analyzing Proteins
One Protein?
Analyzing Proteins
One Protein?Just do it.
Analyzing Proteins
One Protein?Just do it.
5 Proteins?
Analyzing Proteins
One Protein?Just do it.
5 Proteins?Post-doc familiar with protein classes.
Analyzing Proteins
One Protein?Just do it.
5 Proteins?Post-doc familiar with protein classes.
50 Proteins?
Analyzing Proteins
One Protein?Just do it.
5 Proteins?Post-doc familiar with protein classes.
50 Proteins?grad student
Analyzing Proteins
One Protein?Just do it.
5 Proteins?Post-doc familiar with protein classes.
50 Proteins?grad student
5000 proteins?
Analyzing Proteins
One Protein?Just do it.
5 Proteins?Post-doc familiar with protein classes.
50 Proteins?grad student
5000 proteins?summer students
Proteome Analyst
Proteome Analyst
High-throughput Transparent Prediction of
Protein FunctionProtein LocalizationCustom Classification
Machine Learning Task
TrainingINPUT: sequences, classesOUTPUT: Classifier
AnalysisINPUT: sequences, ClassifierOUTPUT: classes
Machine Learning Task
TrainingINPUT: sequences, classesOUTPUT: Classifier
AnalysisINPUT: sequences, ClassifierOUTPUT: classes, explanation
Training
INPUTsequences, classes
PA Toolssequences features
ML Algorithmfeatures, classes Classifier
OUTPUTClassifier
Training: INPUT
>class A<Training Seq 1MVGSGLLWLALVSCILTQASAVQRGYGNPIEASSYGL...>class B<Training Seq 2LLDEPFRSTENSAGSQGCDKNMSGWYRFVGEGGVRMS...>class B<Training Seq 3EVIAYLRDPNCSSILQTEERNWVSVTSPVQASACRNI... ...
Training: INPUT
>class A<Training Seq 1MVGSGLLWLALVSCILTQASAVQRGYGNPIEASSYGL...>class B<Training Seq 2LLDEPFRSTENSAGSQGCDKNMSGWYRFVGEGGVRMS...>class B<Training Seq 3EVIAYLRDPNCSSILQTEERNWVSVTSPVQASACRNI... ...
classes
protein sequences
Training: PA Tools
sequences features
Training: PA Tools
sequences featuresHomology Tools (BLAST)
sequence homologueshomologues annotationsannotations features
Homology Tool
sequence features
sequence
homologues
annotations features
seq DB
BLAST
retrieve
parse
Homology Tool
sequence features
sequence
homologues
annotations features
seq DB
BLAST
retrieve
parse
DBSOURCE swissprot: locus MPPB_NEUCR, ...xrefs (non-sequence databases): ...InterProIPR001431,...KEYWORDS Hydrolase; Metalloprotease; Zinc; Mitochondrion; Transit peptide; Oxidoreductase; Electron transport; Respiratory chain.
Homology Tool
sequence features
sequence
homologues
annotations features
seq DB
BLAST
retrieve
parse
Training: PA Tools
sequences featuresHomology Tools (BLAST)
sequence homologueshomologues annotationsannotations features
Pattern Tools (PFAM, ProSite, …)sequences motifsmotifs features
Pattern Tool
sequence features
sequence
patterns
features
patternDB
find
parse
Pattern Tool
sequence features
sequence
patterns
features
patternDB
find
parse
Pfam; PF00234; tryp_alpha_amyl; 1.PROSITE; PS00940; GAMMA_THIONIN; 1.PROSITE; PS00305; 11S_SEED_STORAGE; 1.
Pattern Tool
sequence features
not included in current results
sequence
patterns
features
patternDB
find
parse
Training: ML Algorithm
features, classes Classifier
Training: ML Algorithm
features, classes Classifierany ML Algorithm may be useddefault = naïve Bayes
consistently near-best accuracy
(SVM, ANN slightly better)efficient (for high-throughput)easy to interpret
Training: OUTPUT
Classifier
Analysis (Classification)
INPUTsequences
PA Toolssequences features
Classifierfeatures classes, explanation
OUTPUTclasses
Analysis: INPUT
>Seq 1DTILNINFQCAYPLDMKVSLQAALQPIVSSLNVSVDG...>Seq 2AVELSVESVLYVGAILEQGDTSRFNLVLRNCYATPTE...>Seq 3HVEENGQSSESRFSVQMFMFAGHYDLVFLHCEIHLCD... ...
Analysis: INPUT
>Seq 1DTILNINFQCAYPLDMKVSLQAALQPIVSSLNVSVDG...>Seq 2AVELSVESVLYVGAILEQGDTSRFNLVLRNCYATPTE...>Seq 3HVEENGQSSESRFSVQMFMFAGHYDLVFLHCEIHLCD... ...
protein sequences
Analysis: PA Tools
sequences features
Analysis: PA Tools
sequences featuresHomology Tools (BLAST)
sequence homologueshomologues annotationsannotations features
Pattern Tools (PFAM, ProSite, …)sequences motifsmotifs features
Analysis: Classification
features classes
Analysis: Classification
features classesnaïve Bayes
returns probabilities of each class for each sequence
efficient (for high-throughput)easy to interpret
Analysis: Classification
features classes, explanation
Analysis: Classification
features classes, explanation
Analysis: Classification
features classes, explanation
Analysis: Classification
features classes, explanation
Analysis: Classification
features classes, explanation
Results: General Function
GeneQuiz classification5-fold x-val accuracy on 14 classes
Results: General Function
GeneQuiz classification5-fold x-val accuracy on 14 classes
E. Coli (2370) 82.5%
Yeast (2359) 78.8%
Fly (3842) 76.6%
Results: Specific Function
K+ Ion Channel Proteins5-fold x-val accuracy on
78 sequences, 4 classes
Results: Specific Function
K+ Ion Channel Proteins5-fold x-val accuracy on
78 sequences, 4 classes
Accuracy
1st effort 97.4%
2nd effort 100%
Results: Localization
Sub-cellular localization prediction 3146 sequences from 10 classes
Results: Localization
Sub-cellular localization prediction 3146 sequences from 10 classes
Accuracy Coverage
Nair and Rost 81.5% 36.9%
Proteome Analyst 87.8% 100%
Results
Sub-cellular localization prediction 3146 sequences from 10 classes
Accuracy Coverage
Nair and Rost 81.5% 36.9%
Proteome Analyst 87.8% 100%
Proteome Analyst
High-throughput Transparent Prediction of
Protein FunctionProtein LocalizationCustom Classification
Acknowledgements
Student developers Cynthia Luk Samer Nassar Kevin McKee
Biologists Warren Gallin Kathy Magor
Data Nair and Rost
Acknowledgements
FundingPENCE – Protein Engineering
Network of Centres of ExcellenceNSERC - National Science and
Engineering Research CouncilSun MicrosystemsAICML - Alberta Ingenuity Centre for
Machine Learning
Acknowledgements
Many ‘-ome’ jokesmy wife, Jen
Contact
http://www.cs.ualberta.ca/~bioinfo/PA
poulin@cs.ualberta.ca
top related