bits - search engines for mass spec data
Post on 05-Dec-2014
1.207 Views
Preview:
DESCRIPTION
TRANSCRIPT
http://www.bits.vib.be/training
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Lennart MARTENS lennart.martens@ebi.ac.uk
Proteomics Services Group European Bioinformatics Institute
Hinxton, Cambridge United Kingdom www.ebi.ac.uk
search engines
lennart martens
lennart.martens@ugent.be
Computational Omics and Systems Biology Group
Department of Medical Protein Research, VIB Department of Biochemistry, Ghent University
Ghent, Belgium
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
THREE TYPICAL PRE-PROCESSING STEPS
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Global thresholding
Local thresholding
precursor
precursor
Noise thresholding
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: http://www.purdue.edu/dp/bioscience/images/spectrum.jpg
Charge deconvolution (peptides)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: Gill et al, EMBO Journal, 2000
Charge deconvolution (proteins)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
x x
Monoisotopic mass Average mass
Centroiding (peak picking)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: Last et al, Nature Rev. Mol. Cell Bio., 2007
A total ion current chromatogram, corrected by typical pre-processing steps.
Combined results
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
51.4
25.8
0.7 0.3
24.5 23.7
0.2 0.10
10
20
30
40
50
60
RAW RAW GZIPped Peak lists Peak lists GZIPpedData type
File
siz
e (M
B)
Q-TOF I Esquire HCT
Data type
File size (MB)
Q-TOF I Esquire HCT
See: Martens et al., Proteomics, 2005
Data size reduction
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
MS/MS IDENTIFICATION
PEPTIDE FRAGMENTATION FINGERPRINTING
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
L E N N A R T
L LE
LEN
LENN
LENNA
LENNAR
LENNART
E N N A R T L
T
RT
ART
NART NNART
ENNART
LENNART
m/z
intensity
Peptide sequences and MS/MS spectra
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
protein sequence database
in silico
digest
YSFVATAER
HETSINGK
MILQEESTVYYR
SEFASTPINK
…
peptide sequences
m/z
Int
m/z
Int
m/z
Int m/z
Int in silico
MS/MS
theoretical MS/MS spectra
experimental MS/MS spectrum
in silico
matching
1) YSFVATAER 34 2) YSFVSAIR 12 3) FFLIGGGGK 12
peptide scores
Peptide fragment fingerprinting (PFF)
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Spectral comparison
Sequencial comparison
Threading comparison
database sequence theoretical spectrum
experimental spectrum
compare
database sequence experimental spectrum
compare de novo sequence
database sequence experimental spectrum
thread
From: Eidhammer, Flikka, Martens, Mikalsen – Wiley 2007
Three types of PFF identification
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• MASCOT (Matrix Science) http://www.matrixscience.com • SEQUEST (Scripps, Thermo Fisher Scientific) http://fields.scripps.edu/sequest • X!Tandem (The Global Proteome Machine Organization) http://www.thegpm.org/TANDEM • OMSSA (NCBI) http://pubchem.ncbi.nlm.nih.gov/omssa/
The most popular algorithms
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Incorrect identifications
Correct identifications
False positives False negatives
Threshold score
Adapted from: www.proteomesoftware.com – Wiki pages
Overall concept of scores and cut-offs
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
0%
1%
2%
3%
4%
5%
6%
p=0.05 p=0.01 p=0.005 p=0.00050%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
false positives
identifications
higher stringency
Playing with probabilistic cut-off scores
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• Very well established search engine
• Can be used for MS/MS (PFF) identifications
• Based on a cross-correlation score (includes peak height)
• Published core algorithm (patented, licensed to Thermo), Eng, JASMS 1994
• Provides preliminary (Sp) score, rank, cross-correlation score (XCorr),
and score difference between the top tow ranks (deltaCn, ∆Cn)
• Thresholding is up to the user, and is commonly done per charge state
• Many extensions exist to perform a more automatic validation of results
SEQUEST
XCorr = deltaCn= XCorr1− XCorr 2
XCorr1𝑅0 −
1151
� 𝑅𝑅+75
𝑖=−75
𝑅𝑖 = �𝑥𝑗 ∙ 𝑦(𝑗+𝑖)
𝑛
𝑗=1
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: MacCoss et al., Anal. Chem. 2002
From: Peng et al., J. Prot. Res.. 2002
SEQUEST: some additional pictures
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• Very well established search engine, Perkins, Electrophoresis 1999
• Can do MS (PMF) and MS/MS (PFF) identifications
• Based on the MOWSE score,
• Unpublished core algorithm (trade secret)
• Predicts an a priori threshold score that identifications need to pass
• From version 2.2, Mascot allows integrated decoy searches
• Provides rank, score, threshold and expectation value per identification
• Customizable confidence level for the threshold score
Mascot
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
y = 8.3761x - 34.089R2 = 0.9985
0
5
10
15
20
25
30
35
40
6.50 7.00 7.50 8.00 8.50log10(number of AA)
Ave
rage
iden
tity
thre
shol
dA
vera
ge id
enti
ty t
hres
hold
Mascot: some additional pictures
0%
1%
2%
3%
4%
5%
6%
p=0.05 p=0.01 p=0.005 p=0.00050%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
false positives
identifications
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• A successful open source search engine, Craig and Beavis, RCMS 2003
• Can be used for MS/MS (PFF) identifications
• Based on a hyperscore (Pi is either 0 or 1):
• Relies on a hypergeometric distribution (hence hyperscore)
• Published core algorithm, and is freely available
• Provides hyperscore and expectancy score (the discriminating one)
• X!Tandem is fast and can handle modifications in an iterative fashion
• Has rapidly gained popularity as (auxiliary) search engine
X!Tandem
*0
* !* !n
i i b yi
HyperScore I P N N=
= ∑
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
-10
-8
-6
-4
-2
0
2
4
6
0 20 40 60 80 100
hyperscore
log(
# re
sults
)
log(
# re
sults
)
0
0.5
1
1.5
2
2.5
3
3.5
4
20 25 30 35 40 45 50
hyperscore 0
10
20
30
40
50
60
0 20 40 60 80 100
hyperscore
# re
sults
Adapted from: Brian Searle, ProteomeSoftware, http://www.proteomesoftware.com/XTandem_edited.pdf
significance threshold
E-value=e-8.2
X!Tandem: some additional pictures
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
A note on how the scores differ
X! T
ande
m
SEQ
UES
T
XCorr
HyperScore
DeltaCn
E-Value
Accuracy Score Relative Score
Adapted from: Brian Searle, ProteomeSoftware
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• A successful open source search engine, Geer, JPR 2004
• Can be used for MS/MS (PFF) identifications
• Relies on a Poisson distribution
• Published core algorithm, and is freely available
• Provides an expectancy score, similar to the BLAST E-value
• OMSSA was recently upgraded to take peak intensity into account
• Good really good marks in a recently published comparative study
OMSSA
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Yeast lysate spectrum, m/z matches of fragment peak matches versus all NCBI nr sequence library. Poisson distribution fitted.
Validation of the Poisson distribution model: mean number of modelled and measured
matching peaks (against the NCBI nr database) for two mass tolerances.
Adapted from: Geer et al., J. Prot. Res., 2004
OMSSA: some additional pictures
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
COMPARATIVE STUDIES
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Kapp et al., Proteomics, 2005
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
1.6x more?!
Balgley et al., Mol. Cell. Proteomics, 2007
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
1776
Mascot SEQUEST
Phenyx
ProteinSolver
501
40
212 (+4,2%)
486 (+9,6%)
329 (+6,5%)
380 (+7,5%)
3203
3229 3792
3186 168
348
179
96
146
139 77 195
Figure courtesy of Dr. Christian Stephan, Medizinisches Proteom-Center, Ruhr-Universität Bochum; Human Brain Proteome Project
Combining the output of search algorithms
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
SEQUENCIAL COMPARISON
ALGORITHMS
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Image from: Matthias Wilm, EMBL Heidelberg, Germany http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/activities/SeqTag.html
sequence tag
The concept of sequence tags was introduced by Mann and Wilm (Mann,and Wilm, Anal. Chem. 1994, 66: 4390-4399).
Sequence tags
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
• Tabb, Anal. Chem. 2003, Tabb, JPR 2008, Dasari, JPR 2010
• Recent implementations of the sequence tag approach
• Refine hits by peak mapping in a second stage to resolve ambiguities
• Rely on a empirical fragmentation model
• Published core algorithms, DirecTag and TagRecon freely available
• Most useful to retrieve unexpected peptides (modifications, variations)
• Entire workflows exist (e.g., combination with IDPicker)
GutenTag, DirecTag, TagRecon
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
From: Tabb et al., Anal. Chem., 2003
GutenTag: some additional pictures
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Example of a manual de novo of an MS/MS spectrum No more database necessary to extract a sequence!
Algorithms
Lutefisk Sherenga
PEAKS PepNovo
…
References
Dancik 1999, Taylor 2000 Fernandez-de-Cossio 2000
Ma 2003, Zhang 2004 Frank 2005, Grossmann 2005
…
De novo compared to sequence tags
BITS MS Data Processing – Search Engines UGent, Gent, Belgium – 19 September 2011
Lennart Martens lennart.martens@UGent.be
Thank you!
Questions?
top related