ensemble learning model for ms/ms
Post on 26-Jan-2015
104 Views
Preview:
DESCRIPTION
TRANSCRIPT
Ensemble Learning Model to Estimate the Accuracy ofPeptide Identifications Made by MS/MS
Qiang Kou Shan Xiao Xiaohui Yao Zongliang Yue
qkou@umail.iu.edu
April 29, 2014
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 1/25 April 29, 2014 1 / 25
Background
Background
Mass spectrometry has become the most widely used tool for thecharacterization of proteins
Many database searching softwares and algorithms have beendeveloped, including SEQUEST, MASCOT, X!tandem, InsPecT,MS-Align+
Scores always have significant overlap between correct and incorrectidentification
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 2/25 April 29, 2014 2 / 25
Background
Background
0
20
40
60
80
100
120
140
160
180
200
-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3
“Correct”
“Incorrect”
Descriminant Score (D)
Num
ber o
f Spe
ctra
in E
ach
Bin
From Brian C. SearleQiang Kou (qkou@umail.iu.edu) Ensemble Learning 3/25 April 29, 2014 3 / 25
Background
Available Software
PeptideProphet [1]
F (x1, x2, . . . , xn) = c0 +∑n
i=1 cixip(+|F ) = p(F |+)p(+)
p(F |+)p(+)+p(F |−)p(−)
Percolator [2, 3]
F (x) =∑
i wihi (x) + b, where hk(x) = tanh((wk)Tx + bk)sigmod loss function: L(F (x), y) = 1/exp(1 + F (x))
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 4/25 April 29, 2014 4 / 25
Background
Trans-Proteomic Pipeline
PeptideProphet
mzXML X!Tandem Percolator ProteinProphet Proteins
EnsembleLearning
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 5/25 April 29, 2014 5 / 25
Ensemble Learning
Ensemble Learning
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 6/25 April 29, 2014 6 / 25
Ensemble Learning
Ensemble Learning
Homogeneous: learners from the same category
boostingbaggingrandom forest
Heterogeneous: learners from different categories
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 7/25 April 29, 2014 7 / 25
Ensemble Learning
Example of Ensemble Learning
Two real variables
Two random pseudo variables
Three methods: linear model,SVM and random forest
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 8/25 April 29, 2014 8 / 25
Ensemble Learning
Results of Three Methods
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 9/25 April 29, 2014 9 / 25
Ensemble Learning
Average of Three Methods
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 10/25 April 29, 2014 10 / 25
Ensemble Strategy
Ensemble Strategy
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 11/25 April 29, 2014 11 / 25
Ensemble Strategy
Non-negative Least Squares and Logistic Regression
Non-negative least squares regression
fe(X ) =k∑
i=1
αi fi (X ),∑
αi = 1, αi ≥ 0
Non-negative logistic regression
fe(X ) =1
1 + exp(−∑k
i αi fi (X )), αi ≥ 0
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 12/25 April 29, 2014 12 / 25
Ensemble Strategy
Greedy Strategy
1 Start with the empty ensemble;
2 Add the model which can maximize the ensemble’s classificationresult on the training dataset;
3 Repeat Step 2 for a fixed number of iterations;
4 Return the final ensemble.
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 13/25 April 29, 2014 13 / 25
Application in MS/MS
Application in MS/MS
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 14/25 April 29, 2014 14 / 25
Application in MS/MS
Available Features
Symbol Description
mass precursor neutral masstime retention time∆M mass difference#match numuber of matched ionspepLen peptide lengthcharge charge stateexp E-value#missed #missed cleavagesenzN if prceded by an enzymatic siteenzC if there is an enzymatic C-terminus#consistent #peptide termini consistent with cleavage#ions #fragment ions predicted for peptide#proteins #proteins containing peptideArg,. . . ,Val # each kind of amino acidHyperscore, Nextscore, BScore, YScore scoring functions in X!tandem
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 15/25 April 29, 2014 15 / 25
Application in MS/MS
Weights in Regularized Generalized Linear Model
Description Weights Description Weights
#missed -1.923 Arg -1.321charge 1.246 Cys -1.062Lys -0.990 His 0.790Trp 0.726 #consistent -0.494Pro 0.407 Asp 0.388Met -0.369 Val 0.350bscore -0.347 Tyr 0.238#ions 0.210
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 16/25 April 29, 2014 16 / 25
Application in MS/MS
Model Used
Algorithm Description R Package
glm linear model statsrandomForest random forest randomForestknn k-nearest neighbour statsglmnet elastic net glmnetsvm SVM e1071step stepwise glm stats
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 17/25 April 29, 2014 17 / 25
Application in MS/MS
Training and Testing Dataset
Paola Picotti, et al. Nature 494:266-270, 2013 [4]
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 18/25 April 29, 2014 18 / 25
Application in MS/MS
ROC Curves
False positive rate
True
pos
itive
rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Ensemble Learning 0.873Percolator 0.821PeptideProphet 0.789
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 19/25 April 29, 2014 19 / 25
Application in MS/MS
Relation between FDR and Ensemble Score with LOESS
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●●●●●●●0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Ensemble Score
FD
R
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 20/25 April 29, 2014 20 / 25
Application in MS/MS
Relation between FDR and Ensemble Score with LOESS
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●●●●●●●0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Ensemble Score
FD
R
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 21/25 April 29, 2014 21 / 25
Application in MS/MS
Number of Correct/Incorrect Identifications with 0.05 FDR
0
500
1000
PeptideProphet Percolator Ensemblemethods
num
ber variable
correct
incorrect
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 22/25 April 29, 2014 22 / 25
Application in MS/MS
Some Conclusion
Ensemble learning methods often have better results
Very easy to over fit on training data
Time-consuming for model training
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 23/25 April 29, 2014 23 / 25
Application in MS/MS
Some Conclusion
Ensemble learning methods often have better results
Very easy to over fit on training data
Time-consuming for model training
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 23/25 April 29, 2014 23 / 25
Application in MS/MS
Some Conclusion
Ensemble learning methods often have better results
Very easy to over fit on training data
Time-consuming for model training
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 23/25 April 29, 2014 23 / 25
Application in MS/MS
References
Andrew Keller, Alexey I Nesvizhskii, Eugene Kolker, and Ruedi Aebersold.Empirical statistical model to estimate the accuracy of peptide identifications made byMS/MS and database search.Analytical Chemistry, 74(20), 2002.
Lukas Kall, Jesse D Canterbury, Jason Weston, William Stafford Noble, and Michael JMacCoss.Semi-supervised learning for peptide identification from shotgun proteomics datasets.Nature Methods, 4(11), 2007.
Marina Spivak, Jason Weston, Lon Bottou, Lukas Kll, and William Stafford Noble.Improvements to the percolator algorithm for peptide identification from shotgunproteomics data sets.Journal of Proteome Research, 8(7):3737–3745, 2009.
Paola Picotti, Mathieu Clment-Ziza, Henry Lam, David S Campbell, Alexander Schmidt,Eric W Deutsch, Hannes Rst, Zhi Sun, Oliver Rinner, Lukas Reiter, Qin Shen, Jacob JMichaelson, Andreas Frei, Simon Alberti, Ulrike Kusebauch, Bernd Wollscheid, Robert LMoritz, Andreas Beyer, and Ruedi Aebersold.A complete mass-spectrometric map of the yeast proteome applied to quantitative traitanalysis.Nature, 494(7436), 2013.
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 24/25 April 29, 2014 24 / 25
Thank you
Thank you
Thank you!http://qkou.info/sl.pdf
Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 25/25 April 29, 2014 25 / 25
top related