review unrestricted identification of modified proteins...
TRANSCRIPT
REVIEW
Unrestricted identification of modified proteins using
MS/MS
Erik Ahrne, Markus M .uller� and Frederique Lisacek
Swiss Institute of Bioinformatics, Proteome Informatics Group, Geneva, Switzerland
Received: July 13, 2009
Revised: September 21, 2009
Accepted: October 19, 2009
Proteins undergo PTM, which modulates their structure and regulates their function. Esti-
mates of the PTM occurrence vary but it is safe to assume that there is an important gap
between what is currently known and what remains to be discovered. The highest throughput
and most comprehensive efforts to catalogue protein mixtures have so far been using MS-
based shotgun proteomics. The standard approach to analyse MS/MS data is to use Peptide
Fragment Fingerprinting tools such as Sequest, MASCOT or Phenyx. These tools commonly
identify 5–30% of the spectra in an MS/MS data set while only a limited list of predefined
protein modifications can be screened. An important part of the unidentified spectra is likely
to be spectra of peptides carrying modifications not considered in the search. Bioinformatics
for PTM discovery is an active area of research. In this review we focus on software solutions
developed for unrestricted identification of modifications in MS/MS data, here referred to as
open modification search tools. We give an overview of the conceptually different algorithmic
solutions to evaluate the large number of candidate peptides per spectrum when accounting
for modifications of unrestricted size and demonstrate the value of results of large-scale open
modification search studies. Efficient and easy-to-use tools for protein modification discovery
should prove valuable in the quest for mapping the dynamics of proteomes.
Keywords:
Bioinformatics / MS/MS / Protein identification / PTM
1 Introduction
Proteins undergo PTM, which modulates their structure
and regulates their function. The identification of protein
modifications is of paramount importance to understand
the regulation and dynamics of a proteome. A range of
methodologies have been designed for the discovery of
PTMs in the past decades. The detection of a single PTM
has hinged on structural methods like X-ray or NMR and on
chemical methods involving labeling and separation tech-
niques (e.g. LC). Besides, PTM annotation in protein
sequences can also be produced with algorithms that
attempt to predict the presence of certain modifications
based on sequence patterns (see http://www.expasy.org/
tools/]ptm, for a comprehensive list of such tools).
Today MS is a central technology for the identification of
PTMs [1–4]. The highest throughput and most compre-
hensive efforts to catalog protein mixtures, including the
identification of PTMs, have so far been based on shotgun
proteomics [5]. For instance protein phosphorylation,
playing a major role in signaling networks, was exten-
sively mapped in large-scale MS studies [6–8]. Similarly,
the role of glycosylation as a functional modulation of
secreted or membrane proteins has been investigated using
MS/MS [9].
In a study by MacCoss et al. [10] it was estimated that
proteins on an average carry three PTMs. In another paperAbbreviations: CAD, collision-activated dissociation; ECD, elec-
tron capture dissociation; ETD, electron transfer dissociation;
FDR, false discovery rate; HCD, higher energy C-trap dissocia-
tion; OMS, open modification search; PFF, peptide fragment
fingerprinting; PSM, peptide spectrum match; SIMS, sequential
interval motif search
�Additional corresponding author: Dr. Markus M .uller
E-mail: [email protected]
Correspondence: Erik Ahrne, Swiss Institute of Bioinformatics,
1 rue Michel Servet, CH-1211 Geneve 4, Switzerland
E-mail: [email protected]
Fax: 141-22-379-58-58
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Proteomics 2010, 10, 671–686 671DOI 10.1002/pmic.200900502
[11] the number of modified variants in proteomic samples
was predicted to be as many as 8–12 per unmodified peptide
although most of these modified species are presumed to be
present at very low concentration. On the other hand, less
than 1% of all proteins in UniProtKB/Swiss-Prot are anno-
tated with a PTM [12]. The protein modification databases
Unimod [13] and RESID [14] contain approximately 500
different modification entries each. Estimates of the PTM
occurrence vary but it is safe to assume that there is an
important gap between what is currently known and what
remains to be discovered.
The analysis of high-throughput data depends heavily on
bioinformatics. The standard approach to analyse MS/MS
data is to use a peptide fragment fingerprinting (PFF) tool
such as Sequest, MASCOT, Phenyx, X!Tandem, Sonar or
OMSSA [15–20]. These tools all have the limitation that the
user has to define potential modifications prior to the
search, and therefore often fail to identify an important
fraction of the MS/MS data set.
In this review we focus on software solutions developed
for unrestricted identification of modifications, here referred
to as open modification search tools (OMS), where no apriori assumptions on the modification state of the sample
needs to be made by the user. These tools are designed to
identify already known modification types annotated in
databases as well as previously unknown post-translational
and chemically induced modifications.
We will first raise issues relating to experimental set-ups
for PTM detection as well as to conventional identification
methods. We will then detail the various OMS strategies
defined by different authors to discover PTMs in high-
throughput data. Finally we will discuss the results of some
large-scale OMS studies.
2 Background
2.1 Finding PTMs using high-throughput MS/MS
In a standard bottom-up shotgun MS/MS experiment the
protein sample is fractionated. The proteins are excised and
typically digested into peptides using a protease such as tryp-
sin. In the next step, peptides in the peptide mixture are
usually separated by reversed-phase LC coupled on line to a
mass spectrometer where the peptides are ionized and some of
them fragmented by collision-activated dissociation, CAD (or
CID) MS/MS. The mass to charge ratios of possible peptide
fragments (annotated as b- and y-ions. etc.), predominately
formed through backbone cleavage at the amide bond, are
then calculated and matched against the experimental spectra.
In an MS/MS experiment a modified variant of a peptide can
be distinguished from the unmodified variant. For a peptide
with one modified amino acid typically 50% of the peaks in the
mass spectrum will be shifted by the m/z value of the modi-
fication compared with the spectrum of the unmodified
peptide (see Fig. 1A). The spectrum of a modified peptide may
also contain modification specific neutral loss peaks and
diagnostic ions (see Fig. 1B) [21].
2.2 Problems detecting PTMs with MS
Important limitations exist when it comes to detecting post-
translationally modified proteins using MS/MS. Many
PTMs are only present at low concentrations and the mass
spectrometer may fail to select these peptides for fragmen-
tation [22]. Some modifications are known to hamper the
enzymatic protein cleavage leading to the generation of long
and highly charged peptides and consequently spectra,
which are difficult to interpret, e.g. when glucose attaches to
K or R these tryptic cleavage sites are likely to be missed.
Furthermore some modifications induce an unexpected
fragmentation of the peptide and mass spectra that are
difficult to analyse [23, 24].
A shotgun experiment produces too much data for
manual interpretation of each spectrum. Several algorithms
have been developed to automate the analysis of the MS
output data where peptide candidates are assigned to an
experimental spectrum ranked by an empirical or statistical
peptide spectrum match (PSM) score [10–34]. An MS/MS
spectrum can typically be explained by an unmodified
peptide, a peptide modified during the sample preparation
or a peptide carrying one or more PTMs. As will be
discussed later in this review the presence of modifications
complicates the bioinformatic analysis, partly because the
number of candidates per spectrum increases dramatically
and partly because the fragmentation patterns of certain
modified peptides is difficult to predict.
3 Limitations in the classic approach toMS/MS data analysis
3.1 Restricted modification searches
Commonly used restricted PFF search tools screen the
experimental MS/MS data against a user-selected protein
database. The protein sequences are digested into peptides insilico in accordance with the cleavage rules of the protease
used in the sample preparation step of the experimental
workflow. For each peptide a theoretical spectrum is gener-
ated and the similarity between an experimental spectrum
and all candidate theoretical spectra is calculated. This
approach has proven to be very successful for the identifica-
tion of unmodified peptides and their corresponding proteins.
Classical PFF tools can screen for a restricted list of
modifications. Before initiating the search the user is asked
to specify a list of search parameters. These parameters may
include protein sequence database, taxonomy, precursor
mass tolerance, peptide fragment mass tolerance, etc. The
user may then configure the search tool to look for a list of
known amino acid modifications, where a modification can
672 E. Ahrne et al. Proteomics 2010, 10, 671–686
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
A
B
Figure 1. A spectrum of the non-modified peptide ADLMLYVSK (top) aligned with a spectrum of a modified variant of the same peptide
(bottom), carrying a methionine oxidation (16 Da). (A) b4 to b8-ions and y6- to y8-ions, in the modified spectrum, annotated with an � are
shifted with the m/z of the modification relative to the corresponding non-modified fragment ions. (B) displays a spectrum of the non-
modified peptide VSFELFADK (top) aligned with a spectrum of a modified variant of the same peptide (bottom), phosphorylated at serine
(80 Da). Spectra of phosphorylated peptides are typically dominated by ions resulting from the neutral loss of phosphoric acid, whereas
sequence-specific fragment ions formed through cleavage of the peptide backbone amide bonds are of low intensity. Spectra down-
loaded from http://www.peptideatlas.org/speclib/ (ISB_Hs_plasma_20070706_PUBLIC.zip and ISB_Hs-phospho_20080428.zip).
Proteomics 2010, 10, 671–686 673
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
be specified as fixed or variable. A fixed modification is
assumed to be present on all instances of the residue that
carries it. Typically cysteines are carboxyamidomethylated
during the sample preparation and the reaction is close to
100% and therefore this modification should be specified as
fixed. Including fixed modifications does not increase the
complexity of the search. In contrast, variable modifications
are not necessarily present on a specific residue. In almost
all cases one would have to specify methionine oxidation as
a variable modification. When setting up the search tool to
look for variable modifications more candidate peptides will
be considered per spectrum, leading to longer search times
and possibly more false-positive identifications [35], as will
be described in more detail. In a typical LC/MS experiment
where the data is analysed with a classical PFF tool, 5–30%
of the spectra are expected to be identified [36–38]. Many
reasons could explain the failure of a software-driven inter-
pretation among which the most common are:
1 noisy spectra or spectra from impurities
2 database error (erroneous or missing sequences, incorrect
annotation, etc)
3 unexpected large parent mass measurement error, e.g. by
detecting the wrong isotope
4 unusual enzymatic cleavage
5 modification/mutations. In fact, an important part of the
unidentified spectra could be spectra of peptides carrying
modifications or mutations, not considered in the search.
3.2 The search space explosion
There are limitations on the number of variable modifications
that can reasonably be included when using a standard PFF
tool. If we were to look for modifications in a more or less
unsupervised way one could simply imagine including all
known modifications as variable. However, this becomes
problematic for two reasons: first, search times scale linearly
with the number of candidate peptides considered
per experimental spectrum and would become much larger.
Second, more candidates generate more random high-scoring
matches leading to a worse separation between the score
distribution of true and false-positive matches. A large overlap
between these distributions means less identifications at a
given false discovery rate (FDR) [39]. The search space
explosion limits the use of a typical PFF tool for PTM
discovery and is a major issue to be tackled when designing an
MS/MS identification algorithm that can identify protein
modifications in an unsupervised manner. Figure 2 illustrates
how allowing for more variable modifications may lead to a
loss in the total number of identified spectra when calculating
the global FDR based on a typical decoy search [40]. A
manually annotated test data set of 3269 spectra, from a
sample containing over 200 yeast proteins, acquired on a
QqTOF instrument [41] was analysed with MASCOT search-
ing a concatenated target decoy database. Six searches were
performed allowing for none to five common variable modi-
fications, and the total number of confident PSMs for each
individual search was registered at an estimated FDR of 0.05.
4 Unrestricted identification of modifiedpeptides
4.1 Software workflow approaches to PTM
discovery
An extensive collection of identification tools has been
developed to perform OMS. A comprehensive selection of
these tools is presented in Table 1. As will be described in
the following sections a workflow approach is commonly
used when searching for modifications in an unrestricted
manner. Some of the software developed for this purpose is
fully integrated in MS/MS peptide identification platforms
including multiple steps of an identification workflow,
whereas other tools are isolated modules handling only a
part of the identification process.
Figure 3 shows the main three steps of a typical PTM
identification workflow. The first step includes a database
reduction, where the original database is reduced to a list of
candidate peptides or proteins that are potentially present in
the sample. Next, the spectra are matched and scored
against this database. A delicate part of this step is assigning
Figure 2. Allowing for more variable modifications may lead to a
loss in the total number of identified spectra when calculating
the global FDR based on a typical decoy search. A manually
annotated test data set of 3269 spectra, from a sample contain-
ing over 200 yeast proteins, acquired on a QqTOF instrument
was analysed with MASCOT searching a concatenated target
decoy database. Six searches were performed allowing for none
to up to five common variable modifications and the total
number of confident PSMs for each individual search was
registered at an estimated FDR of 0.05.
674 E. Ahrne et al. Proteomics 2010, 10, 671–686
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
the modification to the correct residue. In a third
post-processing step the results are either manually or
automatically validated.
Below we review the different parts of an OMS workflow
and present the various solutions suggested by the tools
listed in Table 1. Our selection of OMS tools reflects on the
one hand tool popularity and on the other hand the extent of
conceptually different algorithmic strategies implemented
for the MS/MS data analysis.
4.2 Step 1: Database reduction
Screening a large experimental data set for modifications of
unrestricted mass against the full proteome of an organism
would in most cases be impractical in terms of both
computational time and FDR, as described above. It is
therefore meaningful to limit the search to a list of peptides
or proteins that are likely candidates for modifications. A
drastic but accurate filtering allows for more sophisticated
and computationally intensive scoring of the experimental
spectrum to the few remaining candidates. We refer to two
filtering strategies: sequence tag extraction and multiple
round processing.
4.2.1 Sequence tag extraction
Mann et al. [42] presented a filtering approach based on
sequence tag extraction. Tag extraction is a database-inde-
pendent peptide sequencing strategy where peptide sub-
sequences are derived directly from the spectrum by linking
peaks with a mass difference corresponding to the mass of
an amino acid. In theory the full peptide sequence could be
found in this manner, a technique known as de novosequencing [43–49], but the accuracy is strongly limited by
the quality of the data. However, extracted peptide sub-
Table 1. OMS tools
Search tool Filtering Modif.specificscoring
Test dataa) Download Citationsb)
Popitam [53] Multiple roundsc)1tags No Q-TOF www.expasy.ch/tools/popitam/ ���
MODi [55] Multiple roundsc)1tags No Q-TOF prix.uos.ac.kr/modi/ �
InsPecTd) [56] Multiple rounds1tags Yes Q-TOF, ion-trap proteomics.ucsd.edu/Software/Inspect.html
����
P-Mod [65] Multiple roundsc) No Ion-trap, SIM data www.mc.vanderbilt.edu/lieblerlab/p-mod.php
���
Modiro [71] Multiple roundsc) Yes Ion-trap www.modiro.com:8080/licenseserver/home.seam
�
VEMS 3.0 [21] Multiple rounds Yes Q-TOF yass.sdu.dk ���
ProteinProspector[72]
Multiple rounds Yes QSTAR, Q-TOF,ion-trap
prospector.ucsf.edu �
Bonanza [62] Multiple roundsc)�
spectral libraryNo Q-TOF Contact authors �
ModifiComb [37] Multiple roundsc)�
peptide DBNo LTQ-FT www.bmms.uu.se/Software.htm ���
SeMoP [69] Multiple rounds No Ion-trap biomed.umit.at/upload/semop.zip �
TwinPeaks [68] – No Ion-trap www.utoronto.ca/emililab/twinpeaks.htm
�
Interrogator [70] – No QSTAR Contact authors ��
OpenSea [54] Tags No Q-TOF, ion-trap Contact authors ���
SIMS [57] Peak intervals, tagsc) No Q-TOF, ion-trap webprod1.ccbr.utoronto.ca �
a) Refers to the data type the tool was tested on in the original publication.b) Reflects the number of citations per year since the year of the original publication (�,��,���,����) (0�4, 5�9, 10�14,15�).c) Filtering with external tool.d) With MS alignment
Figure 3. A typical OMS workflow includes a
database reduction step followed by
enumeration and scoring of all peptide
candidates. A list of post-processing algo-
rithms has been developed to further refine
the search results output.
Proteomics 2010, 10, 671–686 675
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
sequences of three to four amino acids, so called sequence
tags, have proven to be effective filters in order to reduce the
number of candidate peptides for a spectrum [42–52]. The
OMS tools Popitam [53], OpenSea [54] and MODi [55] use
tag extraction to narrow down the list of candidate peptides
per spectrum. To take full advantage of tag extraction
filtering the retrieval of peptides from the reference database
matching the sequence tags has to be fast. InsPecT [56] a
highly cited identification platform implements an efficient
tag extraction algorithm followed by a rapid trie-based scan
of the database, to extract the peptide candidates for a
spectrum.
The recently published OMS algorithm sequential interval
motif search, SIMS, filters the database based on ranked
single amino acid inter-peak intervals, ordered by the inten-
sities of the two associated peaks [57]. Every spectrum is
subjected to strict filtering, keeping only the high intensity
peaks, and converted to a symmetrical spectrum by introdu-
cing ‘‘ghost’’ peaks to generate more complete b- or y-ion
ladders. In default mode 400 intervals are extracted per spec-
trum used to retrieve the 500 most probable peptide candi-
dates per spectrum from the reference protein database. The
authors demonstrate that this filtering process can be further
improved in combination with tag extraction coupling SIMS
with the publicly available PepNovo algorithm [47].
4.2.2 Multiple round processing
Another common database filtering technique is to discard
all proteins that cannot be identified in a very sensitive,
fast and restricted PFF search [58–60]. In a first step the
data set is screened against the full database with strict
search parameters meaning at most one missed cleavage
and one or two variable modifications. A reduced database
is compiled from all proteins with at least one confidently
matched peptide in this first round search. A second
round search where PTM parameters are loosened can
then be launched. Most of the OMS tools listed in Table 1
employ this strategy, and some in combination with tag
extraction.
ModifiComb [61] a simple and fast OMS tool uses an
even stronger database filter and suggests to screen for
modifications only with peptides confidently identified in a
first round search. The underlying assumption is that most
PTMs are present at sub-stoichiometric ratios. Consequently
no new peptide species will be identified in the modification
search. Similar approaches were used by the Bonanza
algorithm [62] and presented in Ahrne et al. [63], which takes
full advantage of the fact that the fragmentation patterns of
many modified peptides are similar to that of the unmodi-
fied variants (see Fig. 1A). Here, a spectral library, which is a
list of annotated spectra identified in a prior PFF search, is
exhaustively screened for modifications.
Several of the OMS tools do not include a database-
filtering step, but it is then recommended that this is done
externally. The Swiss Protein Identification Toolbox, Swis-
sPIT [64], provides an automated solution where several
identification tools can be combined in multiple-round
workflows. The results are merged to create a reduced but
comprehensive protein database, which is passed on to a
second identification step where it is explored for modifi-
cations with the OMS tools Popitam [53] and InsPecT [56].
Efficient filtering will speed up the OMS search while
boosting its discriminatory power. It is important to point
out that filtering criteria largely influence the outcome of the
modification search. When multiple round processing is
used, the overall results of the OMS workflow will depend
on the quality of the reduced size protein database. An
important question that arises, but which is given little
attention in the literature, is what selection criteria to apply
on the proteins included in this database. The identification
of modified peptides may increase the sequence coverage of
individual proteins, thus validating new proteins and resol-
ving ambiguities where single peptides map to multiple
protein entries. Using strict protein selection criteria means
ignoring modifications that may occur on certain proteins
while limiting protein validation based on the discovery of
modified peptides. In contrast, loose selection criteria may
lead to an increased number of false-positive PSMs and
longer search times. How this trade-off should be dealt with
remains to be investigated.
4.2.3 Enumerating the candidate peptides
For most OMS tools the mass range of the modifications to
be included in the search is a user-defined variable. Typically
the mass range of a modification search is set between �100
and 300 Da. For each experimental spectrum all peptides in
the database within the given modification mass range are
evaluated. It is assumed that the difference in precursor
mass between the query spectrum and the unmodified
candidate peptide corresponds to the mass of one or more
modifications. However, allowing for more than one modi-
fication per peptide is not recommended as many spectra
can find a high-scoring match by chance if multiple
unrestricted modifications are included. An OMS in its
simplest form enumerates all modification scenarios given a
peptide sequence and a modification mass, where a modi-
fication scenario corresponds to every possible location of
the modification along the peptide sequence. Next, a theo-
retical spectrum is generated for each modified peptide
candidate and scored against the experimental spectrum
(see Fig. 4). This exhaustive approach is slow and the time
complexity for a single modification is quadratic in the
length of the peptide. Some tools apply empirical rules in
order to restrict the number of considered scenarios. For
example, P-mod [65] does not consider scenarios where the
absolute value of a negative mass shift is greater than the
amino acid side chain mass at a specific sequence location.
MS-Alignment [66], the OMS algorithm of the InsPecT
676 E. Ahrne et al. Proteomics 2010, 10, 671–686
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
platform, speeds up the search for the optimal attachment
position of a PTM using dynamic programming (imple-
mented in linear time), based on an improved version of the
spectral alignment algorithm introduced by Pevzner et al.[67].
When tag extraction filtering has been used the number
of scenarios can be limited based on the extracted sequence
tags. Popitam [53] uses sequence tags and full candidate
peptide sequences to construct a spectrum graph where a
path represents a possible modification or mutation
scenario. Each path of the graph is evaluated in order to find
the optimal peptide candidate. Another approach employed
by OpenSea [54] and MODi [55] aligns multiple sequence
tags extracted from a single spectrum and regions that
cannot be matched to a database sequence are assumed to
be either amino acid substitutions or modifications.
TwinPeaks [68] and the recently published search tool
SeMoP, Search for Modified Peptides, [69] use a concep-
tually different algorithm where constant mass shifts
between the peaks of an experimental and a theoretical
spectrum are sought for, which would indicate that the
candidate peptide is modified at a given position.
The assignment of a modification to a specific amino acid
residue is prone to errors. It is common that a spectrum
does not contain enough information to pinpoint the posi-
tion of a modification and the positional assignment leads to
so-called delta correct identifications: correct peptide and
correct modification mass but erroneous site. An early OMS
tool, Interrogator [70], designed for fast processing by
effectively indexing a sequence database, focused on
assigning a modification to a region of the peptide sequence
rather than a specific amino acid. Labile modifications are in
general difficult to position without further empirical data
since the resulting spectra often contain few shifted peaks
relative to the unmodified peptide spectrum. Examples are
O-glycosylation, sulphation and phosphorylation that are
commonly eliminated as neutral losses during fragmenta-
tion.
4.3 Step 2: Matching
4.3.1 Similarity scoring
A number of scoring algorithms have been developed in
order to determine the similarity between an experimental
spectrum and a theoretical spectrum. In its simplest form
the theoretical spectrum contains the calculated b- and y-ion
A
B C
Figure 4. An illustration of a
simple exhaustive OMS
search. The list of peptides
within the specified modifi-
cation mass tolerance, typi-
cally [�150, 300] Da are
extracted (I). A tag extrac-
tion step will further narrow
down the number of candi-
date peptides per spectrum.
An OMS in its simplest form
enumerates all modification
scenarios given a peptide
sequence and a modifica-
tion mass, where a modifi-
cation scenario corresponds
to every possible location of
the modification along the
peptide sequence (II). Next,
a theoretical spectrum is
generated for each modified
peptide candidate and
scored against the experi-
mental spectrum (III).
Proteomics 2010, 10, 671–686 677
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
fragments and the similarity score is based on the shared
peak count between the compared spectra [20, 60]. In
contrast, some scoring schemes include a multitude of
fragment types including a-ions and x-ions internal frag-
ments and extract several features in addition to the shared
peak count such as the ratio of experimental to theoretical b-
and y-ions, the length of continuous ion-series, matching
peak intensity, etc. A score based on the combined measure
of these features is then derived [15, 17].
Spectral library search tools typically use scoring schemes
taking into account the intensity of the spectrum peaks.
Bonanza [62] uses a dot product-based scoring, which is the
normalised scalar product of the compared spectra repre-
sented as multidimensional vectors.
4.3.2 Modification specific scoring
As mentioned earlier, simply assuming that a modification
induces a shift of fragment masses is not a valid model
for all types of modifications. Although a lot remains to be
understood when it comes to peptide fragmentation
patterns, more sophisticated scoring schemes should be
defined, especially when considering that different peptides
may have the same, or very similar, theoretical b- and
y-ion series, e.g. the mass of a methylated asparatic acid
equals the mass of glutamic acid. Some software tools
include modification-specific scoring that takes into
account fragment types associated with a particular modi-
fication. The commercially distributed OMS tool Modiro
(formerly PTM-Explorer) (Protagen AG) [71] uses predefined
search strategies to look for some of the common modifi-
cations annotated in Unimod [13]. A specialised phosphor-
ylation scoring considers the presence of the usually
observed neutral loss signals of the phosphate group in
the fragmentation spectrum. VEMS 3.0 [21] is another
identification algorithm that considers an extensive list
of PTM-specific neutral losses and diagnostic fragment
ions, designed to distinguish between near isobaric modi-
fications such as Lysine acetylation and Lysine tri-methyla-
tion. Lysine acetylation exhibits a diagnostic ion at m/z126.0913 whereas the spectrum of a Lysine tri-methylated
peptide commonly contains a neutral loss peak at m/z59.0735. Other tools like ProteinProspector [72] can be
configured to look for unknown modifications while
targeting labile modifications. InsPecT also has a sophisti-
cated scoring algorithm accounting for the fragmentation
probabilities of different instrument types and the effects of
certain PTMs.
4.4 Step 3: Post-processing
Most identification tools attempt to assign a statistical
quality measure to a PSM such as a p-value/e-value or FDR.
These measures are commonly estimated by screening the
experimental data set against a database of randomised
peptide sequences. The performance of a tool can be eval-
uated based on the trade-off between error rate and sensi-
tivity, often visualised in a receiver operator curve [17].
Despite efforts to reduce the number of candidate peptides
in the database and the development of improved scoring
algorithms, high-scoring random matches remain a
problem when looking for PTMs in an unsupervised
manner. Strict error rate thresholds often lead to important
loss of sensitivity and the opposite will return disputable
matches that have to be manually validated.
A range of post-processing algorithms has been devel-
oped to further refine the search results. In addition to
presenting a sophisticated alignment algorithm to compare
theoretical and experimental spectra Tsur et al. [66] proposed
a new way to tackle loss of discriminatory power in open
modification searches. Their approach relies on the tabula-
tion of all the mass shifts reported by the software for each
amino acid in a large data set. Assuming that incorrect
modification mass assignments will distribute randomly
across all amino acids, those matches containing modifica-
tions of residues reported multiple times are more likely to
represent true modified peptides.
PTMFinder [73] elaborates on the same idea of studying
the global evidence for modifications found in the experi-
mental data. Here the focus is shifted from the significance
of individual PSMs to modification site scoring. The authors
acknowledge that open modification searches are error
prone, but try to make use of the fact that large data sets
tend to contain a lot of redundant and complementary
spectra. The post-processing tool is designed to be a plug-
gable module that can handle the output of any OMS tool,
although demonstrated in combination with MS-Alignment
and the InsPecT scoring. The basic idea is to group the
identification results by modification site and extract the
evidence for each occurrence. Features such as the number
of overlapping peptides carrying the same modification and
the same modified peptide found in multiple charge states
are used to train a Support Vector Machine used to distin-
guish between false and correct modification site assign-
ments.
ComByne [74] is another interesting post-processing
module scoring modification sites. In a first step, peptide
match probabilities are adjusted for peptide length, missed
cleavages and modifications. The rationale here is that
chances of randomly matching a semi-tryptic, modified, and
unmodified peptide with an elevated score differ, since, e.g.the database may contain substantially more modified
peptide candidates than unmodified peptides. Furthermore,
short peptides may randomly be assigned high scores solely
based on correct prefix or suffix amino acids. Similar
reasoning is used by other post-processing tools such as
PeptideProphet [32] and Panoramics [75]. A novelty in
ComByne is that it refines the p-value of a peptide match
based on the difference in measured retention time and the
predicted retention time of the peptide candidate. In addi-
678 E. Ahrne et al. Proteomics 2010, 10, 671–686
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
tion, p-values are adjusted based on corroboration, where in
analogy with PTMFinder the global evidence of a modifi-
cation site is accounted for. The discovery of overlapping
peptides would boost the p-value of both peptides. Similarly,
when the unmodified and modified variants of the same
peptide species are found the p-values are recalculated.
ComByne can also score phosphorylation sites. An identi-
fication of ASLGS[�18]LEGEASSPK becomes more believ-
able if another spectrum matches ASLGS[180]
LEGEASSPK, because phosphorylated serine has a common
neutral loss of 98 Da.
5 Improving modification discovery
5.1 Mass-accuracy and modification discovery
The mass spectrometer type used to analyse samples
dramatically influences the number of candidate peptides
per spectrum in a restricted PFF search. An experimental
spectrum with a precursor mass of 1000.48 Da has
approximately 90 times more unmodified peptide candi-
dates in UniProtKB/Swiss-Prot (Yeast) if produced on a low
mass accuracy ion-trap instrument (precursor mass accu-
racy 1/� 2 Da) compared with an FT-ICR (precursor mass
accuracy 1/� 0.006 Da) (See Table 2). Exact precursor mass
measurements naturally speeds up the bioinformatic part of
an LC-MS workflow, when analysing the data with a clas-
sical PFF tool. However, substantially higher precursor
mass precision does not by default lead to a dramatic
increase in the number of identified peptides and proteins.
Haas et al. [76] investigated the benefits of high-mass
accuracy measurements analysing different data sets with
an LTQ-FT instrument where the FT-ICR part of the
instrument was either not exploited or used for MS scan
survey. The MS/MS data of a complex peptide mixture
derived from the yeast proteome was submitted to
SEQUEST [15] with search parameters adapted to the
instrumentation. Interestingly, despite a dramatic search
space reduction only 10% more peptide identifications were
produced when the FT was turned on. The advantage of the
FT-ICR was proved important mainly for assigning MS/MS
spectra with low signal-to-noise ratios. SEQUEST returned
100% more confident peptide matches in the high mass
accuracy data when analysing a yeast sample enriched for
phosphorylated peptides. These spectra are often dominated
by ions resulting from the neutral loss of phosphoric acid (as
seen in Fig. 1B), whereas sequence-specific fragment ions
formed through cleavage of the peptide backbone amide
bonds can be of low intensity. In an OMS, higher precursor
mass accuracy does not reduce the number of candidate
peptides per spectrum. Consequently search times are not
reduced but more discriminative grouping of PSMs with the
same modification mass can be obtained and modification
masses can be accurately mapped to known modifications
annotated in databases. To the best of our knowledge no
study has been published investigating the benefits of high
precursor mass accuracy measurements and OMS.
High mass accuracy measurements of fragment ions have
been shown to be of great importance for peptide identifica-
tion based on de novo sequencing [49, 77]. In a recent publi-
cation [78], the benefits of fragment ions acquired at high
mass accuracy, using an LTQ-Orbitrap, was further investi-
gated. Data acquired on an Orbitrap and linear ion trap data,
from the same Pseudomonas aeruginosa sample, was analysed
in restricted searches with Phenyx [33] (Genebio SA), setting
the fragment mass tolerance to 12 ppm and 0.5 Da, respec-
tively. As the overall search space was increased, by allowing
for multiple missed tryptic cleavage sites and a variable
(methyl-ester) modification on amino acids D, E, S and T,
high accuracy fragment mass measurements resulted in
more identified spectra, at low FDRs. The spectra not confi-
dently identified in the Phenyx searches were extracted and
submitted to the OMS tool Popitam [53] searching the
Orbitrap data with a fragment mass tolerance of 0.01 Da
(minimum accepted by the software tool) and 0.5 Da for the
ion-trap data. While no additional confident identifications
were found in the ion-trap data set, peptides with mass shifts
corresponding to oxidation, adduction of sodium, methyla-
tion and dethiomethylation were frequently observed in the
high fragment mass accuracy Orbitrap data.
Fortunately, the use of high mass accuracy instruments
becomes more and more common in proteomics labs. For
an in-depth review covering the topic of accurate mass
accuracy in proteomics experiments we recommend [79].
5.2 Extended identification workflows
A number of clever data pre-processing steps have been
proposed. They are worth considering for reducing
computational time and possibly increasing PTM identifi-
cation rate. MS/MS experiments often generate redundant
data sets containing multiple spectra of the same peptides.
On this basis, a fast clustering algorithm was presented,
Table 2. A comparison of the number of precursors consideredfor three types of searchesa)
Instrument Unmodified Five variablemodificationsb)
OMSc)
Ion-trapd) 442 2838 232 216FTe) 10 110 232 216
a) Listing the number of fully tryptic candidate peptides, in theUniprot database (Yeast, 6594 proteins), for a 1000.48 Daprecursor, produced on an ion-trap and a high mass accuracyLTQ-FT instrument, respectively, and analysed in three typesof searches.
b) Deamidation (N, Q), Methylation (H, K), Oxidation (M, W),Acetylation (L), Sodium adduct (D, E).
c) 1/� 100 Da modification mass tolerance.d) 1/� 2 Da precursor mass tolerance.e) 1/� 0.006 Da precursor mass tolerance.
Proteomics 2010, 10, 671–686 679
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
grouping similar spectra and replacing them with a single
representative spectrum [80]. It is demonstrated how data
sets of over ten million spectra could be reduced by a factor
of ten, significantly speeding up the following database
search. This of course is especially meaningful when using
OMS tools as the analysis time per spectrum is much larger
than when performing a restricted search with a classical
PFF tool.
Bern et al. [81] describe a spectrum quality assessment
tool and show how it may be of particular interest in a pre-
processing step preceding a modification search: spectra
assigned a high quality but not identified in a restricted
search can often be explained by modified peptides,
although many modifications produce low-quality spectra.
We strongly recommend reading Tanner et al. [82]
providing a well-written user manual of the InsPecT soft-
ware platform. The authors suggest that an OMS could be
succeeded by a restrictive follow-up search. Here the data is
further explored for some of the more frequent or especially
interesting modifications listed in the OMS output. The
restrictive search can be more sensitive in detecting modi-
fications with known effects on fragmentation and multiple
modifications per peptide are allowed. SeMoP [69] automates
such a three-step strategy. First, a standard database search
is performed with SEQUEST [15]. Second, all peptides
corresponding to the identified proteins or only the identi-
fied peptides from step one are exhaustively explored for
modifications. Finally, data are re-submitted to SEQUEST
for a targeted search for specific modifications found in the
unrestricted search, allowing for multiple modifications perpeptide.
5.3 Experimental set-up for PTM identification
Studying the results presented in the papers describing the
OMS tools discussed above it becomes clear that the vast
majority of the modifications identified are in fact not post-
translational but rather chemical modifications induced by
sample preparation such as Cysteine Carbox-
yamidomethylation, N-terminus and Lysine Carbamylation,
Oxidation of Methionine and Sodium and Potassium
adducts. PTMs especially those previously unknown can be
expected to be poorly abundant. This is illustrated in Fig. 5
showing the modification mass distribution of a typical
OMS search. The histogram displays the confident PSMs
returned when screening a human blood plasma sample
data set produced on an Orbitrap instrument, and analysed
with a novel library search-based OMS tool, QuickMod
(Ahrne et al. manuscript in preparation).
The detection of low-abundant PTM peptides is very
limited when analysing complex mixtures because these
peptides are overshadowed by unmodified peptides and
peptides modified during the sample preparation. In addi-
tion, post-processing algorithms tend to favour the identifi-
cation of abundant modifications for which extensive
evidence can be found. These factors complicate the
successful discovery of rare PTMs. The detection problem
can be partly improved by various promising sample
preparation techniques such as anti-phosphoamino acid
antibodies for protein isolation [83, 7] and affinity-based
enrichment of modified proteins or peptides [6, 84, 85].
Seo et al. present a protocol targeting low-abundance
PTMs [86]. The authors describe a clever LC-MS workflow,
Selectively Excluded Mass Screening Analysis (SEMSA)
where samples are analysed by an LC-ESI-qTOF in multiple
rounds. For each round, the precursor masses of spectra
confidently identified, using MODi [55], are added to a mass
exclusion list allowing for the fragmentation of precursor
ions of low intensities. A similar set-up, presented in
Schmidt et al. [87] where unidentified MS features were
added to an inclusion list for targeted fragmentation, leads
to extensive identification of phosphorylation sites in a
protein mixture obtained from Drosophila melanogasterlysates.
Another interesting LC-MS identification workflow was
recently published by Carapito et al. [88] where spectra are
acquired under different collision conditions and a peptide
mass inclusion list is compiled based on the detection of
modification specific neutral loss fragments and reporter
ions in combination with ion signals corresponding to the
modified and unmodified peptide masses. In a final step the
peptides on the mass inclusion list are sequenced in a
directed MS/MS mode.
Figure 5. The vast majority of the modifications identified in a
typical OMS are in fact not post-translational but rather modifi-
cations induced by the sample preparation such as Cysteine
Carboxyamidomethylation, N-terminus and Lysine Carbamyla-
tion, Oxidation of Methionine and Sodium and Potassium
adducts. PTMs especially those previously unknown can be
expected to be low abundant. The histogram displays the
confident modified PSM returned when screening a human
blood plasma sample data set produced on an Orbitrap instru-
ment, and analysed with a novel library search based OMS tool,
QuickMod (manuscript in preparation).
680 E. Ahrne et al. Proteomics 2010, 10, 671–686
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Barsnes et al. [89] developed a software tool, Mass-
ShiftFinder, tackling the detection problem by screening
MALDI-TOF data for potentially modified peptides that then
can be selected for subsequent TOF-TOF analysis. Their
algorithm performs a blind search for modifications using
peptide mass fingerprints from two proteases with different
cleavage specificities. If the same mass shift relative to the
unmodified theoretical values is observed for both proteases,
and the peptides are overlapping, the mass shift can corre-
spond to a modification or a substitution.
Working with more than one protease is in general a
good idea in order to increase the sequence coverage of PTM
sites in the results of analysis [10]. Strong b- or y-ion peaks
on either side of a modified residue are the best evidence for
site specificity. Digesting the samples with multiple
proteases also improves the chances of producing such a
spectrum, facilitating the modification localisation problem.
Furthermore, the confident identification of low-abundance
peptides generally requires multiple replicate analyses of the
same LC-MS/MS of similar or replicate samples [65].
As mentioned earlier in this review, an additional
problem that makes the identification of real PTMs parti-
cularly tricky is the fact that some modified peptides frag-
ment poorly in the mass spectrometer. Low-energy CAD/
CID MS/MS has been, by far, the most common method
used to dissociate peptide ions for subsequent sequence
analysis. Ideally, the peptide is cleaved randomly at the
amide bonds along its backbone to produce a homologous
series of b and y-type fragment ions. The presence of
multiple basic residues prevents full fragmentation upon
collision activation/induction and directs the backbone bond
dissociation to specific sites and therefore inhibits the
production of a sufficiently diverse set of sequence ions.
Further, PTMs such as phosphorylation, sulfonation, nitro-
sylation and O- and N-linked glycosylation may similarly
redirect the sites of preferred cleavage. Often the modified
moiety is cleaved off and the peptide backbone is left more
or less intact. The resulting spectra tend to contain little
peptide sequence information and may not allow for
successful identification. In this regard, CAD/CID is most
effective for short, low-charged unmodified peptides. New
instrumentation technologies support alternative solutions
for data generation that have the potential to improve
peptide and protein identification, in particular the identi-
fication of peptides carrying labile modifications.
As n-dimensional MS has become more practical new
techniques to identify modified peptides have been
developed making up for the limitations of CAD/CID
fragmentation. Newer ion-trap instruments provide the
option of collecting MS3 spectra of abundant MS2 peaks.
Peptides carrying labile modifications have been analysed by
automated data-dependent triggering of MS3 acquisition
whenever the dominant neutral loss ion of the appropriate
mass is detected in an MS2 spectrum [8, 90, 91]. By sepa-
rately fragmenting the neutral loss ion a sequence infor-
mation-rich MS3 spectrum can be produced. Different
approaches have been tested to combine MS2 and MS3
spectra from the same peptide to improve peptide identifi-
cation [92–94].
Other methods to generate higher quality spectra of
peptides carrying labile modifications rely on new frag-
mentation techniques altogether. Electron capture dissocia-
tion (ECD) is a method for peptide dissociation, which is
relatively indifferent to peptide sequence and length while
avoiding the loss of labile modifications during fragmenta-
tion [95]. However, ECD requires an FT-ICR mass spectro-
meter. Syka et al. [96] introduced electron transfer
dissociation (ETD), which has proven useful for the identi-
fication of modified peptides and peptides with basic resi-
dues. ETD fragments peptides at the Ca-N bond by
transferring an electron from a radical anion to a protonated
peptide inducing similar fragmentation patterns to ECD,
but can be used on more widely accessible ion-trap or
Orbitrap mass spectrometers [97]. Olsen et al. [98] demon-
strated a third new PTM-friendly fragmentation technology
that takes advantage of the Orbitrap’s architecture; Higher
energy C-trap dissociation (HCD). HCD spectra show richer
fragmentation than typical CAD/CID spectra especially in
the low-mass region of the spectrum including a2, b2, y1, y2
ions and immonium ions of histidine and modified residues
such as the immonium ion of phosphotyrosine.
Many more experimental protocols have been described
in the literature aiming to increase identification of PTMs.
For details we refer to an excellent review on this topic [3].
6 OMS studies
Typical PFF tools successfully explore the unmodified frac-
tion of the experimental data but often fail to identify an
important part of the fragmentation spectra. The use of
recently developed OMS software enables a more complete
annotation of MS/MS data sets and refines our under-
standing of the biological system under investigation.
PTMFinder was evaluated on an impressively large data
set of 18 million spectra from a whole-lysate extract of
HEK293 human embryonic kidney cells. Hundreds of
previously uncharacterised modification sites were found,
most of them were phosphorylations, acetylations or
methylations, in addition to more than 900 already docu-
mented modification sites [73]. In the same publication the
authors reported the discovery of several modification sites
conserved between protein orthologues in humans and
protists based on the additional analysis of Dictyosteliumdiscoideum samples.
The ocular lens is another suitable testing ground as it is
a particularly rich source of PTMs. Since the proteins in the
mature lens fibre cells do not turnover during its long life-
time, it is expected that a wide variety of PTMs will accu-
mulate on a large number of residues. This makes lens
samples excellent candidates for exploring and evaluating
the ability of OMS tools to detect protein modifications.
Proteomics 2010, 10, 671–686 681
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Several of the recently developed OMS tools have been
tested on such data sets including InsPecT [56], PTMEx-
plorer [71], SIMS [57] and SwissPIT [64]. Willmarth et al. [99]
identified a total of 155 modification sites in crystallins
analysing a human lens data set produced on two different
instruments; LCG Classic ion trap and a Q-TOF hybrid
mass spectrometer, using the InsPecT software suite. Of
these, 77 were previously reported sites and 78 newly
detected, including carboxymethyl lysine (158 Da), carbox-
yethyl lysine (172 Da) and an arginine modification of 1
55 Da. PTM-Explorer was tested on lens protein samples
from a 100-wk-old mouse [71]. Approximately 30% of all
identified peptides were found to carry modifications other
than the common sample preparation artefacts propiona-
mide on cysteines and methionine oxidation, mainly phos-
phorylation, acetylation and sodium adducts. The developers
of SIMS benchmarked their tool against InsPecT on a small
human lens protein test data set of 243 high-resolution
spectra, of modified peptides, generated on a QTOF
instrument [57]. The two algorithms returned identical
results for 80% of the spectra. Of the spectra, 17% were
identified to the same peptide and modification mass but
disagreeing on the modification site.
Other benchmarking studies also show that different
OMS tools often agree on peptide identification and modi-
fication mass but the site assignment may differ. Protein-
Prospector [72] was compared with InsPecT on a publicly
available data set (regis-web.systemsbiology.net/PublicDa-
tasets/ mix2) of 3734 spectra produced on a QSTAR
instrument, from a protein mixture of 18 standard proteins.
The two tools reported the same peptides with the same
modification mass for 1102 spectra, but approximately half
of the modification sites did not align.
Modificomb [61] was evaluated on high-accuracy FT data
where two complementary fragmentation techniques were
used; CAD and ECD revealing several previously unknown
modifications, later confirmed by MASCOT in targeted
searches, including a frequent 12 Da proline modification
detected in human saliva and a 98 Da modification on
histidine found in an E. coli sample.
7 Concluding remarks
As shown in this review bioinformatics for PTM discovery is
an active area of research. A wide range of OMS software
has been developed in recent years and the results from
various studies described in the previous section demon-
strate their capacity to analyse large data sets from complex
protein samples. OMS tools provide efficient means to
evaluate the quality of a sample by revealing modifications
induced by sample handling and preparation such as
oxidation, pyro-Glu and salt adducts. More importantly,
these tools are capable of identifying known and previously
unknown PTMs. Studies where unrestricted modification
searches were included in the data analysis pipeline show
that there is an important discrepancy between what is
documented in public databases and what remains to be
found about protein modifications.
In order to fully benefit from modification tolerant soft-
ware, the use of these analysis tools should be combined
with appropriate experimental set-ups allowing for the
fragmentation of low-abundance peptides. Employing
complementary peptide fragmentation techniques to CAD/
CID such as HCD and ETD, is desirable, as higher quality
spectra of peptides carrying labile modifications are
produced. Furthermore it is meaningful to combine
unrestricted searches for modifications with targeted studies
such as multiple reaction monitoring [100] for confirmation
and quantification of interesting modifications. Carefully
designed experimental protocols and unsupervised data
analysis in combination with verification experiments paves
the road for the study of modified protein forms as
biomarkers of disease.
Further improvements of OMS studies can be envisioned
in the years to come. Users will benefit from enhanced
computational resources enabling faster and larger scale
analysis. Part per million mass accuracy instruments
become more and more common in proteomics laboratories
and OMS tools can be further refined by taking full advan-
tage of precise peptide precursor and fragment mass
measurements. A better understanding of the effects of
modifications on peptide fragmentation should lead to more
accurate identification as better theoretical models of
modified spectra can be used for peptide matching. Inte-
grating algorithms for sequence-based prediction of modi-
fication sites such as the AutoMotif Server (AMS 2.0) [101],
as a part of an OMS workflow, may reinforce the positioning
accuracy of modified residues. It has been shown fruitful to
combine the results of multiple classical PFF tools in order
to maximise peptide discovery [102]. Therefore, parallel
searches with two or more OMS tools could similarly
improve the identification rates of modified peptides, but
strategies for unifying the search results of multiple tools
need to be evaluated. Effective filtering reducing the number
of candidate peptides per spectrum is an important part of
an OMS workflow. As discussed earlier, a thorough inves-
tigation of appropriate filtering criteria may contribute to a
better trade-off between sensitivity and error rate in OMS
searches.
In order to provide valuable guidance to investigators
interested in modification tolerant data analysis, we encou-
rage software developers to benchmark new OMS tools on
standard data sets. The Peptide Atlas Data Repository
(http://www.peptideatlas.org/repository/) provides many
high quality data sets produced on different instrument
types. A modification-rich human lens data set (PAe000316,
Wilmarth_human_lens) should be a good candidate.
Another useful resource for testing purposes is a large
collection of annotated ion-trap spectra of modified lens
peptides, downloadable at http://bioinfo2.ucsd.edu/
ModdedSpectra.html.
682 E. Ahrne et al. Proteomics 2010, 10, 671–686
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
By improving the protein modification discovery in large
proteomics data sets OMS tools should prove valuable in the
quest for mapping the regulation and dynamics of proteomes.
The authors’ related work is part of a collaborative projectsupported by Microsoft Research.
The authors have declared no conflict of interest.
8 References
[1] Mann, M., Jensen, O. N., Proteomic analysis of post-
translational modifications. Nat. Biotechnol. 2003, 21,
255–261.
[2] Jensen, O. N., Modification-specific Proteomics: char-
acterization of post-translational modifications by mass
spectrometry. Curr. Opin. Chem. Biol. 2004, 8, 33–41.
[3] Witze, E. S., Old, W. M., Resing, K. A., Ahn, N. G. et al.
Mapping protein post-translational modifications with
mass spectrometry. Nat. Methods 2007, 4, 798–806.
[4] Pang, C. N. I., Hayen, A., Wilkins, M. R., Surface acce-
ssibility of protein post-translational modifications.
J. Proteome Res. 2007, 6, 1833–1845.
[5] Aebersold, R., Mann, M., Mass spectrometry-based
Proteomics. Nature 2003, 422, 198–207.
[6] Ficarro, S. B., McCleland, M. L., Stukenberg, P. T., Burke,
D. J. et al. Phosphoproteome analysis by mass spectro-
metry and its application to Saccharomyces cerevisiae.
Nat. Biotechnol. 2002, 20, 301–305.
[7] Steen, H., Kuster, B., Fernandez, M., Pandey, A. et al.
Tyrosine phosphorylation mapping of the epidermal
growth factor receptor signaling pathway. J. Biol. Chem.
2002, 277, 1031–1039.
[8] Beausoleil, S. A., Jedrychowski, M., Schwartz, D., Elias,
J. E. et al. Large-scale characterization of HeLa cell nuclear
phosphoproteins. Proc. Natl. Acad. Sci. USA 2004, 101,
12130–12135.
[9] Tissot, B., North, S. J., Ceroni, A., Pang, P. et al. Glyco-
Proteomics: past, present and future. FEBS Lett. 2009, 583,
1728–1735.
[10] MacCoss, M. J., McDonald, W. H., Saraf, A., Sadygov, R.
et al. Shotgun identification of protein modifications from
protein complexes and lens tissue. Proc. Natl. Acad. Sci.
USA 2002, 99, 7900–7905.
[11] Nielsen, M. L., Savitski, M. M., Zubarev, R. A., Extent of
modifications in human proteome samples and their effect
on dynamic range of analysis in shotgun proteomics. Mol.
Cell. Proteomics 2006, 5, 2384–2391.
[12] Wu, C. H., Apweiler, R., Bairoch, A., Natale, D. A. et al. The
Universal Protein Resource (UniProt): an expanding
universe of protein information. Nucleic Acids Res. 2006,
34, D187–D191.
[13] Creasy, D. M., Cottrell, J. S., Unimod: Protein modifications
for mass spectrometry. Proteomics 2004, 4, 1534–1536.
[14] Garavelli, J. S., The RESID Database of Protein Modifica-
tions as a resource and annotation tool. Proteomics 2004,
4, 1527–1533.
[15] Eng, J. K., McCormack, A. L., Yates, J. R., An approach to
correlate tandem mass spectral data of peptides with
amino acid sequences in a protein database. J. Am. Soc.
Mass Spectrom. 1994, 5, 976–989.
[16] Perkins, D. N., Pappin, D. J., Creasy, D. M., Cottrell, J. S.,
Probability-based protein identification by searching
sequence databases using mass spectrometry data. Elec-
trophoresis 1999, 20, 3551–3567.
[17] Colinge, J., Masselot, A., Giron, M., Dessingy, T. et al.
OLAV: towards high-throughput tandem mass spectro-
metry data identification. Proteomics 2003, 3, 1454–1463.
[18] Craig, R., Beavis, R. C., TANDEM: matching proteins with
tandem mass spectra. Bioinformatics 2004, 20, 1466–1467.
[19] Field, H. I., Fenyo, D., Beavis, R. C., RADARS, a bioinfor-
matics solution that automates proteome mass spectral
analysis, optimises protein identification, and archives
data in a relational database. Proteomics 2002, 2, 36–47.
[20] Geer, L. Y., Markey, S. P., Kowalak, J. A., Wagner, L. et al.
Open mass spectrometry search algorithm. J. Proteome
Res. 2004, 3, 958–964.
[21] Matthiesen, R., Trelle, M. B., Hojrup, P., Bunkenborg, J.
et al. VEMS 3.0: algorithms and computational tools for
tandem mass spectrometry based identification of post-
translational modifications in proteins. J. Proteome Res.
2005, 4, 2338–2347.
[22] Corthals, G. L., Wasinger, V. C., Hochstrasser, D. F.,
Sanchez, J. C. et al. The dynamic range of protein
expression: a challenge for proteomic research. Electro-
phoresis 2000, 21, 1104–1115.
[23] Leitner, A., Foettinger, A., Lindner, W., Improving frag-
mentation of poorly fragmenting peptides and phospho-
peptides during collision-induced dissociation by
malondialdehyde modification of arginine residues.
J. Mass Spectrom. 2007, 42, 950–959.
[24] Ghesquiere, B., Damme, J. V., Martens, L., Vandekerc-
khove, J. et al. Proteome-wide characterization of
N-glycosylation events by diagonal chromatography.
J. Proteome Res. 2006, 5, 2438–2447.
[25] MacCoss, M. J., Wu, C. C., Liu, H., Sadygov, R. et al.
A correlation algorithm for the automated quantitative
analysis of shotgun Proteomics data. Anal. Chem. 2003,
75, 6912–6921.
[26] Nesvizhskii, A. I., Keller, A., Kolker, E., Aebersold, R. et al.
A statistical model for identifying proteins by tandem
mass spectrometry. Anal. Chem. 2003, 75, 4646–4658.
[27] Nesvizhskii, A. I., Aebersold, R., Interpretation of shotgun
proteomic data: the protein inference problem. Mol. Cell.
Proteomics 2005, 4, 1419–1440.
[28] Nesvizhskii, A. I., Roos, F. F., Grossmann, J., Vogelzang, M.
et al. Dynamic spectrum quality assessment and iterative
computational analysis of shotgun proteomic data: toward
more efficient identification of post-translational modifi-
cations, sequence polymorphisms, and novel peptides.
Mol. Cell. Proteomics 2006, 5, 652–670.
Proteomics 2010, 10, 671–686 683
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
[29] Sadygov, R. G., Yates, J. R., A hypergeometric probability
model for protein identification and validation using
tandem mass spectral data and protein sequence data-
bases. Anal. Chem. 2003, 75, 3792–3798.
[30] Sadygov, R. G., Liu, H., Yates, J. R., Statistical models for
protein validation using tandem mass spectral data and
protein amino acid sequence databases. Anal. Chem. 2004,
76, 1664–1671.
[31] Sadygov, R., Wohlschlegel, J., Park, S. K., Xu, T. et al.
Central limit theorem as an approximation for intensity-
based scoring function. Anal. Chem. 2006, 78, 89–95.
[32] Keller, A., Nesvizhskii, A. I., Kolker, E., Aebersold, R. et al.
Empirical statistical model to estimate the accuracy of
peptide identifications made by MS/MS and database
search. Anal. Chem. 2002, 74, 5383–5392.
[33] Colinge, J., Masselot, A., Cusin, I., Mahe, E. et al. High-
performance peptide identification by tandem mass spec-
trometry allows reliable automatic data processing in
Proteomics. Proteomics 2004, 4, 1977–1984.
[34] Wan, Y., Yang, A., Chen, T., PepHMM: a hidden Markov
model based scoring function for mass spectrometry
database search. Anal. Chem. 2006, 78, 432–437.
[35] Ong, S., Mittler, G., Mann, M., Identifying and quantifying
in vivo methylation sites by heavy methyl SILAC. Nat.
Methods 2004, 1, 119–126.
[36] Shevchenko, A., Loboda, A., Shevchenko, A., Ens, W. et al.
MALDI quadrupole time-of-flight mass spectrometry: a
powerful tool for proteomic research. Anal. Chem. 2000,
72, 2132–2141.
[37] Savitski, M. M., Nielsen, M. L., Zubarev, R. A., New data
base-independent, sequence tag-based scoring of peptide
MS/MS data validates Mowse scores, recovers below
threshold data, singles out modified peptides, and asses-
ses the quality of MS/MS techniques. Mol. Cell. Proteomics
2005, 4, 1180–1188.
[38] MacCoss, M. J., Computational analysis of shotgun
Proteomics data. Curr. Opin. Chem. Biol. 2005, 9, 88–94.
[39] K .all, L., Storey, J. D., MacCoss, M. J., Noble, W. S. et al.
Assigning significance to peptides identified by tandem
mass spectrometry using decoy databases. J. Proteome
Res. 2008, 7, 29–34.
[40] Elias, J. E., Gygi, S. P., Target-decoy search strategy for
increased confidence in large-scale protein identifications
by mass spectrometry. Nat. Methods 2007, 4, 207–214.
[41] Chalkley, R. J., Baker, P. R., Hansen, K. C., Medzihradszky,
K. F. et al. Comprehensive analysis of a multidimensional
liquid chromatography mass spectrometry dataset
acquired on a quadrupole selecting, quadrupole collision
cell, time-of-flight mass spectrometer: I.How much of the
data is theoretically interpretable by search engines? Mol.
Cell. Proteomics 2005, 4, 1189–1193.
[42] Mann, M., Wilm, M., Error-tolerant identification of
peptides in sequence databases by peptide sequence tags.
Anal. Chem. 1994, 66, 4390–4399.
[43] Dancik, V., Addona, T. A., Clauser, K. R., Vath, J. E. et al. De
novo peptide sequencing via tandem mass spectrometry.
J. Comput. Biol. 1999, 6, 327–342.
[44] Fernandez-de-Cossio, J., Gonzalez, J., Satomi, Y., Shima,
T. et al. Automated interpretation of low-energy collision-
induced dissociation spectra by SeqMS, a software aid for
de novo sequencing by tandem mass spectrometry. Elec-
trophoresis 2000, 21, 1694–1699.
[45] Ma, B., Zhang, K., Hendrie, C., Liang, C. et al. PEAKS:
powerful software for peptide de novo sequencing by
tandem mass spectrometry. Rapid Commun. Mass Spec-
trom. 2003, 17, 2337–2342.
[46] Johnson, R. S., Taylor, J. A., Searching sequence data-
bases via de novo peptide sequencing by tandem mass
spectrometry. Mol. Biotechnol. 2002, 22, 301–315.
[47] Frank, A., Pevzner, P., PepNovo: de novo peptide sequen-
cing via probabilistic network modeling. Anal. Chem. 2005,
77, 964–973.
[48] Searle, B. C., Dasari, S., Turner, M., Reddy, A. P. et al. High-
throughput identification of proteins and unanticipated
sequence modifications using a mass-based alignment
algorithm for MS/MS de novo sequencing results. Anal.
Chem. 2004, 76, 2220–2230.
[49] Savitski, M. M., Nielsen, M. L., Kjeldsen, F., Zubarev, R. A.
et al. Proteomics-grade de novo sequencing approach.
J. Proteome Res. 2005, 4, 2348–2354.
[50] Sunyaev, S., Liska, A. J., Golod, A., Shevchenko, A. et al.
MultiTag: multiple error-tolerant sequence tag search for
the sequence-similarity identification of proteins by mass
spectrometry. Anal. Chem. 2003, 75, 1307–1315.
[51] Tabb, D. L., Saraf, A., Yates, J. R., GutenTag: high-
throughput sequence tagging via an empirically derived
fragmentation model. Anal. Chem. 2003, 75, 6415–6421.
[52] Tabb, D. L., Ma, Z., Martin, D. B., Ham, A. L. et al. DirecTag:
accurate sequence tags from peptide MS/MS through
statistical scoring. J. Proteome Res. 2008, 7, 3838–3846.
[53] Hernandez, P., Gras, R., Frey, J., Appel, R. D. et al. Popitam:
towards new heuristic strategies to improve protein iden-
tification from tandem mass spectrometry data. Proteo-
mics 2003, 3, 870–878.
[54] Searle, B. C., Dasari, S., Wilmarth, P. A., Turner, M. et al.
Identification of protein modifications using MS/MS de
novo sequencing and the OpenSea alignment algorithm.
J. Proteome Res. 2005, 4, 546–554.
[55] Na, S., Jeong, J., Park, H., Lee, K. et al. Unrestrictive
identification of multiple post-translational modifications
from tandem mass spectrometry using an error-tolerant
algorithm based on an extended sequence tag approach.
Mol. Cell. Proteomics 2008, 7, 2452–2463.
[56] Tanner, S., Shu, H., Frank, A., Wang, L. et al. InsPecT:
identification of posttranslationally modified peptides from
tandem mass spectra. Anal. Chem. 2005, 77, 4626–4639.
[57] Liu, J., Erassov, A., Halina, P., Canete, M. et al. Sequential
interval motif search: unrestricted database surveys of global
MS/MS data sets for detection of putative post-translational
modifications. Anal. Chem. 2008, 80, 7846–7854.
[58] Pevzner, P. A., Mulyukov, Z., Dancik, V., Tang, C. L. et al.
Efficiency of database search for identification of mutated
and modified proteins via mass spectrometry. Genome
Res. 2001, 11, 290–299.
684 E. Ahrne et al. Proteomics 2010, 10, 671–686
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
[59] Creasy, D. M., Cottrell, J. S., Error tolerant searching of
uninterpreted tandem mass spectrometry data. Proteo-
mics 2002, 2, 1426–1434.
[60] Craig, R., Beavis, R. C., A method for reducing the time
required to match protein sequences with tandem mass
spectra. Rapid Commun. Mass Spectrom. 2003, 17,
2310–2316.
[61] Savitski, M. M., Nielsen, M. L., Zubarev, R. A., ModifiComb,
a new proteomic tool for mapping substoichiometric post-
translational modifications, finding novel types of modifi-
cations, and fingerprinting complex protein mixtures. Mol.
Cell. Proteomics 2006, 5, 935–948.
[62] Falkner, J. A., Falkner, J. W., Yocum, A. K., Andrews, P. C.
et al. A spectral clustering approach to MS/MS identifica-
tion of post-translational modifications. J. Proteome Res.
2008, 7, 4614–4622.
[63] Ahrne, E., Masselot, A., Binz, P., M .uller, M. et al. A simple
workflow to increase MS2 identification rate by subse-
quent spectral library search. Proteomics 2009, 9,
1731–1736.
[64] Quandt, A., Masselot, A., Hernandez, P., Hernandez, C.
et al. SwissPIT: An workflow-based platform for analyzing
tandem-MS spectra using the Grid. Proteomics 2009, 9,
2648–2655.
[65] Hansen, B. T., Davey, S. W., Ham, A. L., Liebler, D. C. et al.
P-Mod: an algorithm and software to map modifications to
peptide sequences using tandem MS data. J. Proteome
Res. 2005, 4, 358–368.
[66] Tsur, D., Tanner, S., Zandi, E., Bafna, V. et al. Identification
of post-translational modifications by blind search of mass
spectra. Nat. Biotechnol. 2005, 23, 1562–1567.
[67] Pevzner, P. A., Dancik, V., Tang, C. L., Mutation-tolerant
protein identification by mass spectrometry. J. Comput.
Biol. 2000, 7, 777–787.
[68] Havilio, M., Wool, A., Large-scale unrestricted identifica-
tion of post-translation modifications using tandem mass
spectrometry. Anal. Chem. 2007, 79, 1362–1368.
[69] Baumgartner, C., Rejtar, T., Kullolli, M., Akella, L. M. et al.
SeMoP: a new computational strategy for the unrestricted
search for modified peptides using LC-MS/MS data.
J. Proteome Res. 2008, 7, 4199–4208.
[70] Tang, W. H., Halpern, B. R., Shilov, I. V., Seymour, S. L.
et al. Discovering known and unanticipated protein modi-
fications using MS/MS database searching. Anal. Chem.
2005, 77, 3931–3946.
[71] Chamrad, D. C., Korting, G., Sch .afer, H., Stephan, C.
et al. Gaining knowledge from previously unexplained
spectra-application of the PTM-Explorer software to detect
PTM in HUPO BPP MS/MS data. Proteomics 2006, 6,
5048–5058.
[72] Chalkley, R. J., Baker, P. R., Medzihradszky, K. F., Lynn,
A. J. et al. In-depth analysis of tandem mass spectrometry
data from disparate instrument types. Mol. Cell. Proteo-
mics 2008, 7, 2386–2398.
[73] Tanner, S., Payne, S. H., Dasari, S., Shen, Z. et al. Accurate
annotation of peptide modifications through unrestrictive
database search. J. Proteome Res. 2008, 7, 170–181.
[74] Bern, M., Goldberg, D., Improved ranking functions for
protein and modification-site identifications. J. Comput.
Biol. 2008, 15, 705–719.
[75] Feng, J., Naiman, D. Q., Cooper, B., Probability model for
assessing proteins assembled from peptide sequences
inferred from tandem mass spectrometry data. Anal.
Chem. 2007, 79, 3901–3911.
[76] Haas, W., Faherty, B. K., Gerber, S. A., Elias, J. E. et al.
Optimization and use of peptide mass measurement
accuracy in shotgun Proteomics. Mol. Cell. Proteomics
2006, 5, 1326–1337.
[77] Spengler, B., De novo sequencing, peptide composition
analysis, and composition-based sequencing: a new
strategy employing accurate mass determination by four-
ier transform ion cyclotron resonance mass spectrometry.
J. Am. Soc. Mass Spectrom. 2004, 15, 703–714.
[78] Scherl, A., Shaffer, S. A., Taylor, G. K., Hernandez, P. et al.
On the benefits of acquiring peptide fragment ions at high
measured mass accuracy. J. Am. Soc. Mass Spectrom.
2008, 19, 891–901.
[79] Liu, T., Belov, M. E., Jaitly, N., Qian, W. et al. Accurate
mass measurements in Proteomics. Chem. Rev. 2007, 107,
3621–3653.
[80] Frank, A. M., Bandeira, N., Shen, Z., Tanner, S. et al.
Clustering millions of tandem mass spectra. J. Proteome
Res. 2008, 7, 113–122.
[81] Bern, M., Goldberg, D., McDonald, W. H., Yates, J. R. et al.
Automatic quality assessment of peptide tandem mass
spectra. Bioinformatics 2004, 20, i49–i54.
[82] Tanner, S., Pevzner, P. A., Bafna, V., Unrestrictive identifi-
cation of post-translational modifications through peptide
mass spectrometry. Nat. Protoc. 2006, 1, 67–72.
[83] Pandey, A., Podtelejnikov, A. V., Blagoev, B., Bustelo, X. R.
et al. Analysis of receptor signaling pathways by mass
spectrometry: identification of vav-2 as a substrate of the
epidermal and platelet-derived growth factor receptors.
Proc. Natl. Acad. Sci. USA 2000, 97, 179–184.
[84] Peng, J., Gygi, S. P., Proteomics: the move to mixtures.
J. Mass Spectrom. 2001, 36, 1083–1091.
[85] Bodenmiller, B., Mueller, L. N., Mueller, M., Domon, B.
et al. Reproducible isolation of distinct, overlapping
segments of the phosphoproteome. Nat. Methods 2007, 4,
231–237.
[86] Seo, J., Jeong, J., Kim, Y. M., Hwang, N. et al. Strategy
for comprehensive identification of post-translational
modifications in cellular proteins, including low
abundant modifications: application to glyceraldehyde-
3-phosphate dehydrogenase. J. Proteome Res. 2008, 7,
587–602.
[87] Schmidt, A., Gehlenborg, N., Bodenmiller, B., Mueller,
L. N. et al. An integrated, directed mass spectrometric
approach for in-depth characterization of complex peptide
mixtures. Mol. Cell. Proteomics 2008, 7, 2138–2150.
[88] Carapito, C., Klemm, C., Aebersold, R., Domon, B. et al.
Systematic LC-MS analysis of labile post-translational
modifications in complex mixtures. J. Proteome Res. 2009,
8, 2608–2614.
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
Proteomics 2010, 10, 671–686 685
[89] Barsnes, H., Mikalsen, S. O, Eidhammer, I., Blind search for
post-translational modifications and amino acid substitu-
tions using peptide mass fingerprints from two proteases.
BMC Res. Notes 2008, 1, 130.
[90] Bodenmiller, B., Mueller, L. N., Pedrioli, P. G. A., Pflieger,
D. et al. An integrated chemical, mass spectrometric and
computational strategy for (quantitative) phosphoPro-
teomics: application to Drosophila melanogaster Kc167
cells. Mol. Biosyst. 2007, 3, 275–286.
[91] Gruhler, A., Olsen, J. V., Mohammed, S., Mortensen, P.
et al. Quantitative phosphoProteomics applied to the yeast
pheromone signaling pathway. Mol. Cell. Proteomics
2005, 4, 310–327.
[92] Zhang, Z., McElvain, J. S., De novo peptide sequencing by
two-dimensional fragment correlation mass spectrometry.
Anal. Chem. 2000, 72, 2337–2350.
[93] Olsen, J. V., Mann, M., Improved peptide identification in
Proteomics by two consecutive stages of mass spectro-
metric fragmentation. Proc. Natl. Acad. Sci. USA 2004, 101,
13417–13422.
[94] Ulintz, P. J., Bodenmiller, B., Andrews, P. C., Aebersold, R.
et al. Investigating MS2/MS3 matching statistics: a model
for coupling consecutive stage mass spectrometry data for
increased peptide identification confidence. Mol. Cell.
Proteomics 2008, 7, 71–87.
[95] Kelleher, N. L., Zubarev, R. A., Bush, K., Furie, B. et al.
Localization of labile posttranslational modifications by
electron capture dissociation: the case of gamma-carbox-
yglutamic acid. Anal. Chem. 1999, 71, 4250–4253.
[96] Syka, J. E. P., Coon, J. J., Schroeder, M. J., Shabanowitz, J.
et al. Peptide and protein sequence analysis by electron
transfer dissociation mass spectrometry. Proc. Natl. Acad.
Sci. USA 2004, 101, 9528–9533.
[97] Mikesh, L. M., Ueberheide, B., Chi, A., Coon, J. J. et al. The
utility of ETD mass spectrometry in proteomic analysis.
Biochim. Biophys. Acta 2006, 1764, 1811–1822.
[98] Olsen, J. V., Macek, B., Lange, O., Makarov, A. et al. Higher-
energy C-trap dissociation for peptide modification analy-
sis. Nat. Methods 2007, 4, 709–712.
[99] Wilmarth, P. A., Tanner, S., Dasari, S., Nagalla, S. R. et al.
Age-related changes in human crystallins determined from
comparative analysis of post-translational modifications in
young and aged lens: does deamidation contribute to
crystallin insolubility? J. Proteome Res. 2006, 5,
2554–2566.
[100] Anderson, L., Hunter, C. L., Quantitative mass spec-
trometric multiple reaction monitoring assays for
major plasma proteins. Mol. Cell. Proteomics 2006, 5,
573–588.
[101] Plewczynski, D., Tkacz, A., Wyrwicz, L. S., Rychlewski, L.
et al. AutoMotif Server for prediction of phosphorylation
sites in proteins using support vector machine: 2007
update. J. Mol. Model 2008, 14, 69–76.
[102] Kapp, E. A., Sch .utz, F., Connolly, L. M., Chakel, J. A. et al.
An evaluation, comparison, and accurate benchmarking of
several publicly available MS/MS search algorithms:
sensitivity and specificity analysis. Proteomics 2005, 5,
3475–3490.
& 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com
686 E. Ahrne et al. Proteomics 2010, 10, 671–686