bits - search engines for mass spec data
DESCRIPTION
This is the third presentation of the BITS training on 'Mass spec data processing'. It reviews the methods for matching mass spectrometry data with protein sequences, with review of useful tools.Thanks to the Compomics Lab of the VIB for contribution.TRANSCRIPT
![Page 1: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/1.jpg)
http://www.bits.vib.be/training
![Page 2: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/2.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Lennart MARTENS [email protected]
Proteomics Services Group European Bioinformatics Institute
Hinxton, Cambridge United Kingdom www.ebi.ac.uk
search engines
lennart martens
Computational Omics and Systems Biology Group
Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University
Ghent, Belgium
![Page 3: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/3.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
THREE TYPICAL PRE-PROCESSING STEPS
![Page 4: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/4.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Global thresholding
Local thresholding
precursor
precursor
Noise thresholding
![Page 5: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/5.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg
Charge deconvolution (peptides)
![Page 6: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/6.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
From: Gill et al, EMBO Journal, 2000
Charge deconvolution (proteins)
![Page 7: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/7.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
x x
Monoisotopic mass Average mass
Centroiding (peak picking)
![Page 8: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/8.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
From: Last et al, Nature Rev. Mol. Cell Bio., 2007
A total ion current chromatogram, corrected by typical pre-processing steps.
Combined results
![Page 9: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/9.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
51.4
25.8
0.7 0.3
24.5 23.7
0.2 0.10
10
20
30
40
50
60
RAW RAW GZIPped Peak lists Peak lists GZIPpedData type
File
siz
e (M
B)
Q-TOF I Esquire HCT
Data type
File size (MB)
Q-TOF I Esquire HCT
See: Martens et al., Proteomics, 2005
Data size reduction
![Page 10: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/10.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
MS/MS IDENTIFICATION
PEPTIDE FRAGMENTATION FINGERPRINTING
![Page 11: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/11.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
L E N N A R T
L LE
LEN
LENN
LENNA
LENNAR
LENNART
E N N A R T L
T
RT
ART
NART NNART
ENNART
LENNART
m/z
intensity
Peptide sequences and MS/MS spectra
![Page 12: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/12.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
protein sequence database
in silico
digest
YSFVATAER
HETSINGK
MILQEESTVYYR
SEFASTPINK
…
peptide sequences
m/z
Int
m/z
Int
m/z
Int m/z
Int in silico
MS/MS
theoretical MS/MS spectra
experimental MS/MS spectrum
in silico
matching
1) YSFVATAER 34 2) YSFVSAIR 12 3) FFLIGGGGK 12
peptide scores
Peptide fragment fingerprinting (PFF)
![Page 13: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/13.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Spectral comparison
Sequencial comparison
Threading comparison
database sequence theoretical spectrum
experimental spectrum
compare
database sequence experimental spectrum
compare de novo sequence
database sequence experimental spectrum
thread
From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007
Three types of PFF identification
![Page 14: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/14.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
• MASCOT (Matrix Science) http://www.matrixscience.com • SEQUEST (Scripps, Thermo Fisher Scientific) http://fields.scripps.edu/sequest • X!Tandem (The Global Proteome Machine Organization) http://www.thegpm.org/TANDEM • OMSSA (NCBI) http://pubchem.ncbi.nlm.nih.gov/omssa/
The most popular algorithms
![Page 15: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/15.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Incorrect identifications
Correct identifications
False positives False negatives
Threshold score
Adapted from: www.proteomesoftware.com – Wiki pages
Overall concept of scores and cut-offs
![Page 16: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/16.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
0%
1%
2%
3%
4%
5%
6%
p=0.05 p=0.01 p=0.005 p=0.00050%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
false positives
identifications
higher stringency
Playing with probabilistic cut-off scores
![Page 17: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/17.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
• Very well established search engine
• Can be used for MS/MS (PFF) identifications
• Based on a cross-correlation score (includes peak height)
• Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994
• Provides preliminary (Sp) score, rank, cross-correlation score (XCorr),
and score difference between the top tow ranks (deltaCn, ∆Cn)
• Thresholding is up to the user, and is commonly done per charge state
• Many extensions exist to perform a more automatic validation of results
SEQUEST
XCorr = deltaCn= XCorr1− XCorr 2
XCorr1𝑅0 −
1151
� 𝑅𝑅+75
𝑖=−75
𝑅𝑖 = �𝑥𝑗 ∙ 𝑦(𝑗+𝑖)
𝑛
𝑗=1
![Page 18: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/18.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
From: MacCoss et al., Anal. Chem. 2002
From: Peng et al., J. Prot. Res.. 2002
SEQUEST: some additional pictures
![Page 19: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/19.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
• Very well established search engine, Perkins, Electrophoresis 1999
• Can do MS (PMF) and MS/MS (PFF) identifications
• Based on the MOWSE score,
• Unpublished core algorithm (trade secret)
• Predicts an a priori threshold score that identifications need to pass
• From version 2.2, Mascot allows integrated decoy searches
• Provides rank, score, threshold and expectation value per identification
• Customizable confidence level for the threshold score
Mascot
![Page 20: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/20.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
y = 8.3761x - 34.089R2 = 0.9985
0
5
10
15
20
25
30
35
40
6.50 7.00 7.50 8.00 8.50log10(number of AA)
Ave
rage
iden
tity
thre
shol
dA
vera
ge id
enti
ty t
hres
hold
Mascot: some additional pictures
0%
1%
2%
3%
4%
5%
6%
p=0.05 p=0.01 p=0.005 p=0.00050%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
false positives
identifications
![Page 21: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/21.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
• A successful open source search engine, Craig and Beavis, RCMS 2003
• Can be used for MS/MS (PFF) identifications
• Based on a hyperscore (Pi is either 0 or 1):
• Relies on a hypergeometric distribution (hence hyperscore)
• Published core algorithm, and is freely available
• Provides hyperscore and expectancy score (the discriminating one)
• X!Tandem is fast and can handle modifications in an iterative fashion
• Has rapidly gained popularity as (auxiliary) search engine
X!Tandem
*0
* !* !n
i i b yi
HyperScore I P N N=
= ∑
![Page 22: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/22.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
-10
-8
-6
-4
-2
0
2
4
6
0 20 40 60 80 100
hyperscore
log(
# re
sults
)
log(
# re
sults
)
0
0.5
1
1.5
2
2.5
3
3.5
4
20 25 30 35 40 45 50
hyperscore 0
10
20
30
40
50
60
0 20 40 60 80 100
hyperscore
# re
sults
Adapted from: Brian Searle, ProteomeSoftware, http://www.proteomesoftware.com/XTandem_edited.pdf
significance threshold
E-value=e-8.2
X!Tandem: some additional pictures
![Page 23: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/23.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
A note on how the scores differ
X! T
ande
m
SEQ
UES
T
XCorr
HyperScore
DeltaCn
E-Value
Accuracy Score Relative Score
Adapted from: Brian Searle, ProteomeSoftware
![Page 24: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/24.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
• A successful open source search engine, Geer, JPR 2004
• Can be used for MS/MS (PFF) identifications
• Relies on a Poisson distribution
• Published core algorithm, and is freely available
• Provides an expectancy score, similar to the BLAST E-value
• OMSSA was recently upgraded to take peak intensity into account
• Good really good marks in a recently published comparative study
OMSSA
![Page 25: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/25.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Yeast lysate spectrum, m/z matches of fragment peak matches versus all NCBI nr sequence library. Poisson distribution fitted.
Validation of the Poisson distribution model: mean number of modelled and measured
matching peaks (against the NCBI nr database) for two mass tolerances.
Adapted from: Geer et al., J. Prot. Res., 2004
OMSSA: some additional pictures
![Page 26: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/26.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
COMPARATIVE STUDIES
![Page 27: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/27.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Kapp et al., Proteomics, 2005
![Page 28: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/28.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
1.6x more?!
Balgley et al., Mol. Cell. Proteomics, 2007
![Page 29: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/29.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
1776
Mascot SEQUEST
Phenyx
ProteinSolver
501
40
212 (+4,2%)
486 (+9,6%)
329 (+6,5%)
380 (+7,5%)
3203
3229 3792
3186 168
348
179
96
146
139 77 195
Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center, Ruhr-Universität Bochum; Human Brain Proteome Project
Combining the output of search algorithms
![Page 30: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/30.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
SEQUENCIAL COMPARISON
ALGORITHMS
![Page 31: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/31.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Image from: Matthias Wilm, EMBL Heidelberg, Germany http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html
sequence tag
The concept of sequence tags was introduced by Mann and Wilm (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399).
Sequence tags
![Page 32: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/32.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
• Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010
• Recent implementations of the sequence tag approach
• Refine hits by peak mapping in a second stage to resolve ambiguities
• Rely on a empirical fragmentation model
• Published core algorithms, DirecTag and TagRecon freely available
• Most useful to retrieve unexpected peptides (modifications, variations)
• Entire workflows exist (e.g., combination with IDPicker)
GutenTag, DirecTag, TagRecon
![Page 33: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/33.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
From: Tabb et al., Anal. Chem., 2003
GutenTag: some additional pictures
![Page 34: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/34.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Example of a manual de novo of an MS/MS spectrum No more database necessary to extract a sequence!
Algorithms
Lutefisk Sherenga
PEAKS PepNovo
…
References
Dancik 1999, Taylor 2000 Fernandez-de-Cossio 2000
Ma 2003, Zhang 2004 Frank 2005, Grossmann 2005
…
De novo compared to sequence tags
![Page 35: BITS - Search engines for mass spec data](https://reader034.vdocument.in/reader034/viewer/2022042713/548531aeb4af9f910d8b4d0d/html5/thumbnails/35.jpg)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens [email protected]
Thank you!
Questions?