ensemble learning model for ms/ms

Ensemble Learning Model to Estimate the Accuracy ofPeptide Identifications Made by MS/MS

Qiang Kou Shan Xiao Xiaohui Yao Zongliang Yue

qkou@umail.iu.edu

April 29, 2014

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 1/25 April 29, 2014 1 / 25

Background

Mass spectrometry has become the most widely used tool for thecharacterization of proteins

Many database searching softwares and algorithms have beendeveloped, including SEQUEST, MASCOT, X!tandem, InsPecT,MS-Align+

Scores always have significant overlap between correct and incorrectidentification

Background

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

“Correct”

“Incorrect”

Descriminant Score (D)

From Brian C. SearleQiang Kou (qkou@umail.iu.edu) Ensemble Learning 3/25 April 29, 2014 3 / 25

Background

Available Software

PeptideProphet [1]

F (x1, x2, . . . , xn) = c0 +∑n

i=1 cixip(+|F ) = p(F |+)p(+)

p(F |+)p(+)+p(F |−)p(−)

Percolator [2, 3]

F (x) =∑

i wihi (x) + b, where hk(x) = tanh((wk)Tx + bk)sigmod loss function: L(F (x), y) = 1/exp(1 + F (x))

Background

Trans-Proteomic Pipeline

PeptideProphet

mzXML X!Tandem Percolator ProteinProphet Proteins

EnsembleLearning

Ensemble Learning

Homogeneous: learners from the same category

boostingbaggingrandom forest

Heterogeneous: learners from different categories

Ensemble Learning

Example of Ensemble Learning

Two real variables

Two random pseudo variables

Three methods: linear model,SVM and random forest

Ensemble Learning

Results of Three Methods

Ensemble Learning

Average of Three Methods

Ensemble Strategy

Non-negative Least Squares and Logistic Regression

Non-negative least squares regression

fe(X ) =k∑

αi fi (X ),∑

αi = 1, αi ≥ 0

Non-negative logistic regression

fe(X ) =1

1 + exp(−∑k

i αi fi (X )), αi ≥ 0

Ensemble Strategy

Greedy Strategy

1 Start with the empty ensemble;

2 Add the model which can maximize the ensemble’s classificationresult on the training dataset;

3 Repeat Step 2 for a fixed number of iterations;

4 Return the final ensemble.

Application in MS/MS

Available Features

Symbol Description

mass precursor neutral masstime retention time∆M mass difference#match numuber of matched ionspepLen peptide lengthcharge charge stateexp E-value#missed #missed cleavagesenzN if prceded by an enzymatic siteenzC if there is an enzymatic C-terminus#consistent #peptide termini consistent with cleavage#ions #fragment ions predicted for peptide#proteins #proteins containing peptideArg,. . . ,Val # each kind of amino acidHyperscore, Nextscore, BScore, YScore scoring functions in X!tandem

Weights in Regularized Generalized Linear Model

Description Weights Description Weights

#missed -1.923 Arg -1.321charge 1.246 Cys -1.062Lys -0.990 His 0.790Trp 0.726 #consistent -0.494Pro 0.407 Asp 0.388Met -0.369 Val 0.350bscore -0.347 Tyr 0.238#ions 0.210

Model Used

Algorithm Description R Package

glm linear model statsrandomForest random forest randomForestknn k-nearest neighbour statsglmnet elastic net glmnetsvm SVM e1071step stepwise glm stats

Training and Testing Dataset

Paola Picotti, et al. Nature 494:266-270, 2013 [4]

ROC Curves

False positive rate

0.0 0.2 0.4 0.6 0.8 1.0

Ensemble Learning 0.873Percolator 0.821PeptideProphet 0.789

Relation between FDR and Ensemble Score with LOESS

●●

●●●

●●

●●●

●●

●●●

●●●●●●●●0.00

0.00 0.25 0.50 0.75 1.00Ensemble Score

Relation between FDR and Ensemble Score with LOESS

●●

●●●

●●

●●●

●●

●●●

●●●●●●●●0.00

0.00 0.25 0.50 0.75 1.00Ensemble Score

Number of Correct/Incorrect Identifications with 0.05 FDR

PeptideProphet Percolator Ensemblemethods

ber variable

correct

incorrect

Some Conclusion

Ensemble learning methods often have better results

Very easy to over fit on training data

Time-consuming for model training

Some Conclusion

References

Andrew Keller, Alexey I Nesvizhskii, Eugene Kolker, and Ruedi Aebersold.Empirical statistical model to estimate the accuracy of peptide identifications made byMS/MS and database search.Analytical Chemistry, 74(20), 2002.

Lukas Kall, Jesse D Canterbury, Jason Weston, William Stafford Noble, and Michael JMacCoss.Semi-supervised learning for peptide identification from shotgun proteomics datasets.Nature Methods, 4(11), 2007.

Marina Spivak, Jason Weston, Lon Bottou, Lukas Kll, and William Stafford Noble.Improvements to the percolator algorithm for peptide identification from shotgunproteomics data sets.Journal of Proteome Research, 8(7):3737–3745, 2009.

Paola Picotti, Mathieu Clment-Ziza, Henry Lam, David S Campbell, Alexander Schmidt,Eric W Deutsch, Hannes Rst, Zhi Sun, Oliver Rinner, Lukas Reiter, Qin Shen, Jacob JMichaelson, Andreas Frei, Simon Alberti, Ulrike Kusebauch, Bernd Wollscheid, Robert LMoritz, Andreas Beyer, and Ruedi Aebersold.A complete mass-spectrometric map of the yeast proteome applied to quantitative traitanalysis.Nature, 494(7436), 2013.

Thank you

Thank you!http://qkou.info/sl.pdf

ensemble learning model for ms/ms

Education

lec 8: classiﬁcation trees and ensemble learning€¦ ·...

deep ensemble learning of sparse regression models for...

ensemble deep learning models for forecasting

ensemble learning - textmining.biz: text-mining in

ensemble: mobile learning to promote social integration

deep transfer learning ensemble for classi cation

tutorial on ensemble learning -...

mothernets: rapid deep ensemble learning

dd generalized optimal kernel-based ensemble learning for hs...

bias-variance analysis of ensemble learning

ms logique parties ensemble

positive-unlabeled ensemble learning for kinase substrate...

sparse vs. ensemble approaches to supervised learning

ensemble. presentation at media&learning 2010, bruxelles

ensemble learning targeted maximum likelihood estimation

ensemble learning

ensemble learning: an introduction

statistical inference and ensemble machine learning for...

competitive learning neural network ensemble...

enhanced ensemble learning technique - computational vision