proteome analyst

Post on 21-Jan-2016

76 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Proteome Analyst. Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors. Proteome Analyst. Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell. Proteome Analyst. Proteome - PowerPoint PPT Presentation

TRANSCRIPT

Proteome Analyst

Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors

Proteome Analyst

Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell

Proteome Analyst

Proteomeone of many ‘-omes’set of all proteins in an organism

Analysisprediction of protein function or

localization from sequence data

Analyze a Protein

We have examples of annotated proteins in various protein classes.

We have more examples of unannotated proteins.

Analyze a Protein

We have examples of annotated proteins in various protein classes.

We have more examples of unannotated proteins.

What do we do?

Analyze a Protein

We have examples of annotated proteins in various protein classes.

We have more examples of unannotated proteins.

What do we do? Find homologues to each protein and

assume similar function.

Analyze a Protein

We have examples of annotated proteins in various protein classes.

We have more examples of unannotated proteins.

What do we do? Find homologues to each protein and

assume similar function. Find characteristics of each protein that affect

function.

Analyzing Proteins

One Protein?

Analyzing Proteins

One Protein?Just do it.

Analyzing Proteins

One Protein?Just do it.

5 Proteins?

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

50 Proteins?

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

50 Proteins?grad student

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

50 Proteins?grad student

5000 proteins?

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

50 Proteins?grad student

5000 proteins?summer students

Proteome Analyst

Proteome Analyst

High-throughput Transparent Prediction of

Protein FunctionProtein LocalizationCustom Classification

Machine Learning Task

TrainingINPUT: sequences, classesOUTPUT: Classifier

AnalysisINPUT: sequences, ClassifierOUTPUT: classes

Machine Learning Task

TrainingINPUT: sequences, classesOUTPUT: Classifier

AnalysisINPUT: sequences, ClassifierOUTPUT: classes, explanation

Training

INPUTsequences, classes

PA Toolssequences features

ML Algorithmfeatures, classes Classifier

OUTPUTClassifier

Training: INPUT

>class A<Training Seq 1MVGSGLLWLALVSCILTQASAVQRGYGNPIEASSYGL...>class B<Training Seq 2LLDEPFRSTENSAGSQGCDKNMSGWYRFVGEGGVRMS...>class B<Training Seq 3EVIAYLRDPNCSSILQTEERNWVSVTSPVQASACRNI... ...

Training: INPUT

>class A<Training Seq 1MVGSGLLWLALVSCILTQASAVQRGYGNPIEASSYGL...>class B<Training Seq 2LLDEPFRSTENSAGSQGCDKNMSGWYRFVGEGGVRMS...>class B<Training Seq 3EVIAYLRDPNCSSILQTEERNWVSVTSPVQASACRNI... ...

classes

protein sequences

Training: PA Tools

sequences features

Training: PA Tools

sequences featuresHomology Tools (BLAST)

sequence homologueshomologues annotationsannotations features

Homology Tool

sequence features

sequence

homologues

annotations features

seq DB

BLAST

retrieve

parse

Homology Tool

sequence features

sequence

homologues

annotations features

seq DB

BLAST

retrieve

parse

DBSOURCE swissprot: locus MPPB_NEUCR, ...xrefs (non-sequence databases): ...InterProIPR001431,...KEYWORDS Hydrolase; Metalloprotease; Zinc; Mitochondrion; Transit peptide; Oxidoreductase; Electron transport; Respiratory chain.

Homology Tool

sequence features

sequence

homologues

annotations features

seq DB

BLAST

retrieve

parse

Training: PA Tools

sequences featuresHomology Tools (BLAST)

sequence homologueshomologues annotationsannotations features

Pattern Tools (PFAM, ProSite, …)sequences motifsmotifs features

Pattern Tool

sequence features

sequence

patterns

features

patternDB

find

parse

Pattern Tool

sequence features

sequence

patterns

features

patternDB

find

parse

Pfam; PF00234; tryp_alpha_amyl; 1.PROSITE; PS00940; GAMMA_THIONIN; 1.PROSITE; PS00305; 11S_SEED_STORAGE; 1.

Pattern Tool

sequence features

not included in current results

sequence

patterns

features

patternDB

find

parse

Training: ML Algorithm

features, classes Classifier

Training: ML Algorithm

features, classes Classifierany ML Algorithm may be useddefault = naïve Bayes

consistently near-best accuracy

(SVM, ANN slightly better)efficient (for high-throughput)easy to interpret

Training: OUTPUT

Classifier

Analysis (Classification)

INPUTsequences

PA Toolssequences features

Classifierfeatures classes, explanation

OUTPUTclasses

Analysis: INPUT

>Seq 1DTILNINFQCAYPLDMKVSLQAALQPIVSSLNVSVDG...>Seq 2AVELSVESVLYVGAILEQGDTSRFNLVLRNCYATPTE...>Seq 3HVEENGQSSESRFSVQMFMFAGHYDLVFLHCEIHLCD... ...

Analysis: INPUT

>Seq 1DTILNINFQCAYPLDMKVSLQAALQPIVSSLNVSVDG...>Seq 2AVELSVESVLYVGAILEQGDTSRFNLVLRNCYATPTE...>Seq 3HVEENGQSSESRFSVQMFMFAGHYDLVFLHCEIHLCD... ...

protein sequences

Analysis: PA Tools

sequences features

Analysis: PA Tools

sequences featuresHomology Tools (BLAST)

sequence homologueshomologues annotationsannotations features

Pattern Tools (PFAM, ProSite, …)sequences motifsmotifs features

Analysis: Classification

features classes

Analysis: Classification

features classesnaïve Bayes

returns probabilities of each class for each sequence

efficient (for high-throughput)easy to interpret

Analysis: Classification

features classes, explanation

Analysis: Classification

features classes, explanation

Analysis: Classification

features classes, explanation

Analysis: Classification

features classes, explanation

Analysis: Classification

features classes, explanation

Results: General Function

GeneQuiz classification5-fold x-val accuracy on 14 classes

Results: General Function

GeneQuiz classification5-fold x-val accuracy on 14 classes

E. Coli (2370) 82.5%

Yeast (2359) 78.8%

Fly (3842) 76.6%

Results: Specific Function

K+ Ion Channel Proteins5-fold x-val accuracy on

78 sequences, 4 classes

Results: Specific Function

K+ Ion Channel Proteins5-fold x-val accuracy on

78 sequences, 4 classes

Accuracy

1st effort 97.4%

2nd effort 100%

Results: Localization

Sub-cellular localization prediction 3146 sequences from 10 classes

Results: Localization

Sub-cellular localization prediction 3146 sequences from 10 classes

Accuracy Coverage

Nair and Rost 81.5% 36.9%

Proteome Analyst 87.8% 100%

Results

Sub-cellular localization prediction 3146 sequences from 10 classes

Accuracy Coverage

Nair and Rost 81.5% 36.9%

Proteome Analyst 87.8% 100%

Proteome Analyst

High-throughput Transparent Prediction of

Protein FunctionProtein LocalizationCustom Classification

Acknowledgements

Student developers Cynthia Luk Samer Nassar Kevin McKee

Biologists Warren Gallin Kathy Magor

Data Nair and Rost

Acknowledgements

FundingPENCE – Protein Engineering

Network of Centres of ExcellenceNSERC - National Science and

Engineering Research CouncilSun MicrosystemsAICML - Alberta Ingenuity Centre for

Machine Learning

Acknowledgements

Many ‘-ome’ jokesmy wife, Jen

Contact

http://www.cs.ualberta.ca/~bioinfo/PA

poulin@cs.ualberta.ca

top related