-
Computational Methods For Identification Of Cyclic Peptides
Using Mass Spectrometry
Julio NgBioinformatics Program, UCSD
March, 26th 2010
-
Outline
• Importance of natural products• Mass spectrometry on cyclic peptides• Computational methods to analyze MS data• Demo
-
Natural Products
• In 1928, A. Fleming discovered antibiotic activity of penicillin
• The beginning of the modernera of drug discovery
Alexander Fleming
-
Natural Products
• Chemical compound biological activity• Antibiotics (colistin)• Immunosuppressors (cyclosporin)• Antiviral agents (luzopeptin A) • Antitumor agents (phakellistatin)• Toxins (amanitin)
-
Natural Products
Natural Products as Sources of New Drugs over the Last 25 Years!
David J. Newman* and Gordon M. Cragg
Natural Products Branch, DeVelopmental Therapeutics Program, DiVision of Cancer Treatment and Diagnosis, National CancerInstitute-Frederick, P.O. Box B, Frederick, Maryland 21702
ReceiVed October 10, 2006
This review is an updated and expanded version of two prior reviews that were published in this journal in 1997 and2003. In the case of all approved agents the time frame has been extended to include the 251/2 years from 01/1981 to06/2006 for all diseases worldwide and from 1950 (earliest so far identified) to 06/2006 for all approved antitumordrugs worldwide. We have continued to utilize our secondary subdivision of a “natural product mimic” or “NM” to jointhe original primary divisions. From the data presented, the utility of natural products as sources of novel structures, butnot necessarily the final drug entity, is still alive and well. Thus, in the area of cancer, over the time frame from aroundthe 1940s to date, of the 155 small molecules, 73% are other than “S” (synthetic), with 47% actually being eithernatural products or directly derived therefrom. In other areas, the influence of natural product structures is quite marked,with, as expected from prior information, the antiinfective area being dependent on natural products and their structures.Although combinatorial chemistry techniques have succeeded as methods of optimizing structures and have, in fact,been used in the optimization of many recently approved agents, we are able to identify only one de noVo combinatorialcompound approved as a drug in this 25 plus year time frame. We wish to draw the attention of readers to the rapidlyevolving recognition that a significant number of natural product drugs/leads are actually produced by microbes and/ormicrobial interactions with the “host from whence it was isolated”, and therefore we consider that this area of naturalproduct research should be expanded significantly.
It is over nine years since the publication of our first,1 and three
years since the second,2 analysis of the sources of new and approved
drugs for the treatment of human diseases, both of which indicated
that natural products continued to play a highly significant role in
the drug discovery and development process.
That this influence of Nature in one guise or another has
continued is shown by inspection of the information given below,
where with the advantage of now over 25 years of data, we have
been able to refine the system, eliminating a few duplicative entries
that crept into the original data sets. In particular, as behooves
authors from the National Cancer Institute (NCI), in the specific
case of cancer treatments, we have gone back to consult the records
of the FDA and added to these, comments from investigators who
have informed us over the past two years of compounds that may
have been approved in other countries and that were not captured
in our earlier searches. These cancer data will be presented as a
stand-alone section as well as including the last 25 years of data in
the overall discussion.
As we mentioned in our 2003 review,2 the development of high-
throughput screens based on molecular targets had led to a demand
for the generation of large libraries of compounds to satisfy the
enormous capacities of these screens. As we mentioned at that time,
the shift away from large combinatorial libraries has continued,
with the emphasis now being on small, focused (100 to ∼3000)collections that contain much of the “structural aspects” of natural
products. Various names have been given to this process, including
“Diversity Oriented Syntheses”,3-6 but we prefer to simply say
“more natural product-like”, in terms of their combinations of
heteroatoms and significant numbers of chiral centers within a single
molecule,7 or even “natural product mimics” if they happen to be
direct competitive inhibitors of the natural substrate. It should also
be pointed out that Lipinski’s fifth rule effectively states that the
first four rules do not apply to natural products or to any molecule
that is recognized by an active transport system when considering
“druggable chemical entities”.8-10
Although combinatorial chemistry in one or more of its
manifestations has now been used as a discovery source for
approximately 70% of the time covered by this review, to date, we
can find only one de noVo new chemical entity (NCE) reported inthe public domain as resulting from this method of chemical
discovery and approved for drug use anywhere. This is the antitumor
compound known as sorafenib (Nexavar, 1) from Bayer, approved
by the FDA in 2005. It was known during development as BAY-
43-9006 and is a multikinase inhibitor, targeting several serine/
threonine and receptor tyrosine kinases (RAF kinase, VEGFR-2,
VEGFR-3, PDGFR-beta, KIT, and FLT-3) and is in multiple clinical
trials as both combination and single-agent therapies at the present
time, a common practice once approved for one class of cancer
treatment.
As mentioned by the authors in prior reviews on this topic and
others, the developmental capability of combinatorial chemistry as
a means for structural optimization once an active skeleton has been
identified is without par. The expected surge in productivity,
however, has not materialized; thus, the number of new active
substances (NASs), also known as New Chemical Entities (NCEs),
which we consider to encompass all molecules, including biologics
and vaccines, from our data set hit a 24-year low of 25 in 2004
(though 28% of these were assigned to the ND category), with a
rebound to 54 in 2005, with 24% being N or ND and 37% being
biologics (B) or vaccines (V). Fortunately, however, research being
conducted by groups such as Danishefsky’s, Ganesan’s, Nicolaou’s,
Porco’s, Quinn’s, Schreiber’s, Shair’s, Waldmann’s, and Wipf’s is
continuing the modification of active natural product skeletons as
leads to novel agents, so in due course, the numbers of materials
developed by linking Mother Nature to combinatorial synthetic
techniques should increase. This aspect, plus the potential contribu-
tions from the utilization of genetic analyses of microbes, will be
discussed at the end of this review.
Against this backdrop, we now present an updated analysis of
the role of natural products in the drug discovery and development
process, dating from 01/1981 through 06/2006. As in our earlier
! Dedicated to the late Dr. Kenneth L. Rinehart of the University ofIllinois at Urbana-Champaign for his pioneering work on bioactive naturalproducts.* To whom correspondence should be addressed. Tel: (301) 846-5387.
Fax: (301) 846-6178. E-mail: [email protected].
461J. Nat. Prod. 2007, 70, 461-477
10.1021/np068054v This article not subject to U.S. Copyright. Published 2007 by the Am. Chem. Soc. and the Am. Soc. of Pharmacogn.Published on Web 02/20/2007
-
Natural Products
• Searching for natural products• Plants• Micro-organisms• Marine organisms• Animal
• A large subclass of natural productsare nonribosomal peptides
-
Central Dogma of Biology
NRP
-
Non-ribosomal Protein Synthetase (NRPS)
Sieber and Marahiel 2005
2. NRPS Factory
Although structurally diverse, most biologicallyproduced peptides share a common mode of synthe-sis, the multienzyme thiotemplate mechanism.2,6,40According to this model peptide bond formation takesplace on large multienzyme complexes, which simul-taneously represent template and biosynthetic ma-chinery. Sequencing of genes encoding NRPSs ofbacterial and fungal origin provided insights intomolecular architecture and revealed a modular or-ganization.6 A module is a distinct section of themultienzyme that is responsible for the incorporationof one specific amino acid into the final product.3,6,41It is further subdivided into a catalytically indepen-dent set of domains responsible for substrate recogni-tion, activation, binding, modification, elongation,and release. Domains can be identified at the proteinlevel by characteristic highly conserved sequencemotifs. Thus far, 10 different domains are knownwithin NRPS templates which catalyze independentchemical reactions and will be introduced in moredetail in the following sections. As an example toillustrate basic principles, Figure 2 shows a prototype
NRPS assembly line for the cyclic lipoheptapeptidesurfactin.42
The carboxy group of amino acid building blocksis first activated by ATP hydrolysis to afford thecorresponding aminoacyl-adenylate. This reactiveintermediate is transferred onto the free thiol groupof an enzyme-bound 4!-phosphopantetheinyl cofactor(ppan), establishing a covalent linkage betweenenzyme and substrate. At this stage the substratecan undergo modifications such as epimerization orN-methylation. Assembly of the final product thenoccurs by a series of peptide bond formation steps(elongation) between the downstream building blockwith its free amine and the carboxy-thioester of theupstream substrate. The ppan cofactor facilitates theordered transfer of thioester substrates betweencatalytically active units with all intermediates co-valently tethered to the multienzyme until the prod-uct is released by the action of the C-terminalthioesterase (TE) domain (termination). This strategyminimizes side reactions as well as diffusion times.Type I polyketide synthases (PKS) and fatty acidsynthases (FAS) similarly display a multienzymatic
Figure 2. Surfactin assembly line. The multienzyme complex consists of seven modules (grey and red) which are specificfor the incorporation of seven amino acids. Twenty-four domains of five different types (C, A, PCP, E, and TE) are responsiblefor the catalysis of 24 chemical reactions. Twenty-three reactions are required for peptide elongation, while the last domainis unique and required for peptide release by cyclization.
718 Chemical Reviews, 2005, Vol. 105, No. 2 Sieber and Marahiel
-
Non-ribosomal Protein Synthetase (NRPS)
Sieber and Marahiel 2005
-
Thioesterase Domain
Sieber and Marahiel 2005
-
Cyclosporin is highly lipophilic, and 7 of its 11 aminoacids are N-methylated. This high degree of meth-ylation protects the peptide from proteolytic digestionbut complicates chemical synthesis due to low coup-ling yields and side reactions.35 In an iron-deficientenvironment some bacteria such as E. coli, B. subtilis,and Vibrio cholerae synthesize and secrete iron-chelating molecules known as siderophores thatscavenge Fe3+ with picomolar affinity, important forhost survival.36,37 Three catechol ligands derived from2,3-dihydroxybenzoyl (DHB) building blocks in bacil-libactin, enterobactin, and vibriobactin complex ironby forming intramolecular octahedra.
Many nonribosomal peptide products presentedhere show distinct chemical modifications, importantto specifically interact and inhibit certain cellularfunctions, which are essential for survival. The hightoxicity of the peptide products could therefore alsobecome a problem for the producer organism unlessstrategies for its own protection and immunity havebeen coevolved with antibiotic biosynthesis. Thisimmunity is achieved by several strategies including
efflux pumps, temporary product inactivation, andmodifications of the target in the producer strain.3The latter strategy is used by vancomycin-producingStreptomycetes by changing the D-Ala-D-Ala terminusof the peptidoglycan pentapeptide precursor to aD-Ala-D-lactate terminus, which reduces binding af-finity to vancomycin 1000-fold.12
Due to their exceptional pharmacological activities,many compounds such as cyclosporin and vancomy-cin have been synthesized nonenzymatically.38,39 Re-gio- and stereoselective reactions require the use ofprotecting groups as well as chiral catalysts. More-over, macrocyclization and coupling of N-methylatedpeptide bonds are difficult to achieve in satisfyingyields, indicating an advantage of natural vs syn-thetic strategies. Structural peculiarities of thesecomplex peptide products suggested early on a nucleic-acid-independent biosynthesis facilitated by multiplecatalytic domains expressed as a single multidomainprotein. The diverse chemical reactions mediated bydistinct enzymatic units will be the focus of thefollowing sections.
Figure 1. Natural peptidic products. A selection of nonribosomally synthesized peptides. Characteristic structural featuresare highlighted.
Approaches to New Antibiotics Chemical Reviews, 2005, Vol. 105, No. 2 717
Special Characteristics
• Heterocyclic elements• D-amino acids• Glycosylated residues • N-methylated residues• Non-standard amino acids• Cyclic backbone
-
Cyclic Peptides
http://bioinfo.lifl.fr/norine/ *
out of 1122 entries in the database
*Caboche et al, 2008
-
Mass Spectrometer
Measures m/z
-
Sample Preparation (Protein Analysis)
Enzymatic Digestionand
Fractionation
-
Multi-Stage Mass Spectrometry
Secondary Fragmentation
Ionized parent peptide
Mass Spectrometer
-
Fragmentation
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
Ri-1
Ri R
i+1
AA residuei-1
AA residuei AA residuei+1
N-terminus C-terminus
H+
-
Identification of Linear Mass Spectra
MS/MS spectrum
: b
y:
PM
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, LARGE, HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
GALE
K
M
NR
E
EY
LGALR
E
Database search
De novo sequencing
LARGE
-
Challenges in Identification of Cyclic Peptide MS
• Extensively modified amino acids• Non-standard amino acids• Cyclic backbone• Databases cannot be readily derived from
genomic data
-
Cyclic Peptide Mass Spectrum
MS1 – Mass of the
intact cyclic peptide
MS2 – Mass of the
intact linear peptides
MS3 – Masses of the
peptide fragments
-
Ms Mixture! " # $ % &'()
!"
#"
$"
!" # $ % &'()
! " # $ %&'()
! "
#$ %
&'()
!
! "
"
#
# $
$%
%
&'()
&'()
&'()
!
"
#
$
%
%"
! " # $ % &'()
!"
#"
$"
!" # $ % &'()
! " # $ %&'()
! "
#$ %
&'()
!
! "
"
#
# $
$%
%
&'()
&'()
&'()
!
"
#
$
%
%"
Seglitide: somatostatin receptor antagonist, used experimentally to treat Alzheimer’s disease
-
Cyc(A+14YWKV)
! " # $ % &'()
!"
#"
$"
!" # $ % &'()
! " # $ %&'()
! "
#$ %
&'()
!
! "
"
#
# $
$%
%
&'()
&'()
&'()
!
"
#
$
%
%"
Cyclic Mass Spectrum
NRP-Dereplication
NRP-Tagging
NRP-Sequencing
Identification of Cyclic Mass Spectra
-
NRP-Dereplication
• Case 1:• There is a peptide in the database that matches the precursor mass of
the spectrum.
• Is this peptide a good match for the spectrum?
• Case 2:• No peptide in the database matches the precursor mass of the spectrum• Can we change a peptide in the database so it becomes a good match
for the given spectrum?
-
Simplified Dereplication Problem Formulation
• Input: MS3 spectrum, a Peptide Sequence and parameter k• Output: A new Peptide Sequence with k mutations away from
the original Peptide Sequence such that the new peptide explains best the experimental spectrum.
• In reality there many peptides in the database, so the dereplication needs to be done for each peptide
PEPTIDE
-
Tyrocidine A 99, 114, 113, 147, 97, 147, 147, 114, 128, 163
Tyrocidine A1 99, 128, 113, 147, 97, 147, 147, 114, 128, 163
Tyrocidine B 99, 114, 113, 147, 97, 186, 147, 114, 128, 163
Tyrocidine B1 99, 128, 113, 147, 97, 186, 147, 114, 128, 163
Tyrocidine C 99, 114, 113, 147, 97, 186, 186, 114, 128, 163
Tyrocidine C1 99, 128, 113, 147, 97, 186, 186, 114, 128, 163
Tyrocidine Family (Bacillus brevis)
-
Dereplication (k = 1)
A B C D E F
-
A B C D E F
A AB
NRP-Dereplication (k = 1)
-
A B C D E F
A AB
Δ
ABC-Δ ABCD-Δ ABCDE-Δ
NRP-Dereplication (k = 1)
-
A B C D E F
A AB
Δ
~DEF ~EF ~F
NRP-Dereplication (k = 1)
-
A B C D E F
A AB
Δ
~DEF ~EF ~F
FA E FA B D E F
A B C D E F
NRP-Dereplication (k = 1)
-
Dereplicating tyrocidine C and C1
• Experimental spectrum:• Tyrocidine C1
• Sequence:• Tyrocidine C VOLFPWWNQY
• Offset:• 14 Daltons (O -> K)
V-1 O-2 L-3 F-4 P-5 W-6 W-7 N-8 Q-9 Y-10
0
5
10
15
20a) b)
V
9.0
O
2.5
L
7.0
F
9.0
P
13.0
W
16.0
W
19.0
N
19.5
Q
18.0
Y
15.0
32
c)Coverage
Coverage
Peptide Tyrocidine A Tyrocidine B Tyrocidine B
Sequence VOLFPFFNQY VOLFPWFNQY VOLFPWFNQY
Dereplicated Tyrocidine A1 Tyrocidine B1 Tyrocidine C
Sequence VKLFPFFNQY VKLFPWFNQY VOLFPWWNQY
Coverage
V-1 O-2 L-3 F-4 P-5 F-6 F-7 N-8 Q-9 Y-10
0
5
10
15
20
25
V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10
0
5
10
15
20
V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10
0
5
10
15
20
1
Figure A-4: Dereplication results. a) NRP-Dereplication output for experimental spectrum of tyrocidine C1
(VKLFPWWNQY) given peptide sequence of tyrocidine C (VOLFPWWNQY). Concentric red-gray circles
represent 0-correlated subpeptide (with peptide shown red and its complement shown gray) and δ-correlatedsubpeptides (with peptide shown gray and its complement shown red). Given this coloring convention, the
amino acid coverage (number of red arcs covering an amino acid) represents supporting evidence that an
amino acids did not change from the known to the unknown compound. The thick black circle separates0-correlated subpeptides (shown inside) from δ-correlated subpeptides (shown outside). The outer countsrepresent the coverage for a given amino acid by red arcs and reveals the differing amino acid (O) as theamino acid with minimum coverage (2.5 vs. 7 for the next runner-up). The counts are normalized by the
number of subpeptides per peak. For example, if a peak has two alternative subpeptide annotations, it will
contribute12 to the coverage. The width of the arcs are proportional to this weighting factor. The number in
the center of the graphs is the total number of correlated subpeptides. b) Alternative representation of a)
as a histogram that reveals the changed amino acid O. c) Additional dereplication results for the tyrocidine
family.
20
V-1 O-2 L-3 F-4 P-5 W-6 W-7 N-8 Q-9 Y-10
0
5
10
15
20a) b)
V
9.0
O
2.5
L
7.0
F
9.0
P
13.0
W
16.0
W
19.0
N
19.5
Q
18.0
Y
15.0
32
c)
Coverage
Coverage
Peptide Tyrocidine A Tyrocidine B Tyrocidine B
Sequence VOLFPFFNQY VOLFPWFNQY VOLFPWFNQY
Dereplicated Tyrocidine A1 Tyrocidine B1 Tyrocidine C
Sequence VKLFPFFNQY VKLFPWFNQY VOLFPWWNQY
Coverage
V-1 O-2 L-3 F-4 P-5 F-6 F-7 N-8 Q-9 Y-10
0
5
10
15
20
25
V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10
0
5
10
15
20
V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10
0
5
10
15
20
1
Figure A-4: Dereplication results. a) NRP-Dereplication output for experimental spectrum of tyrocidine C1
(VKLFPWWNQY) given peptide sequence of tyrocidine C (VOLFPWWNQY). Concentric red-gray circles
represent 0-correlated subpeptide (with peptide shown red and its complement shown gray) and δ-correlatedsubpeptides (with peptide shown gray and its complement shown red). Given this coloring convention, the
amino acid coverage (number of red arcs covering an amino acid) represents supporting evidence that an
amino acids did not change from the known to the unknown compound. The thick black circle separates0-correlated subpeptides (shown inside) from δ-correlated subpeptides (shown outside). The outer countsrepresent the coverage for a given amino acid by red arcs and reveals the differing amino acid (O) as theamino acid with minimum coverage (2.5 vs. 7 for the next runner-up). The counts are normalized by the
number of subpeptides per peak. For example, if a peak has two alternative subpeptide annotations, it will
contribute12 to the coverage. The width of the arcs are proportional to this weighting factor. The number in
the center of the graphs is the total number of correlated subpeptides. b) Alternative representation of a)
as a histogram that reveals the changed amino acid O. c) Additional dereplication results for the tyrocidine
family.
20
-
Dereplication Results
-
NRP-Dereplication Results on NORINECompound Top Match(es) Dereplicated Compound Score
Destruxin A
Destruxin A[+14] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:1(3)-OH(2)[+14] 0.45HydroxyDestruxin B[-18] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2.3)[-18] 0.45
Destruxin D[-32] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2)-CA(4)[-32] 0.45Destruxin E diol[-20] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3.4)[-20] 0.45
Destruxin C[-18] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2.4)[-18] 0.45Destruxin F[-4] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3)[-4] 0.45Destruxin B[-2] Pro, Ile, NMe-Val, NMe-Ala, bAla, Hiv[-2] 0.45Destruxin E[-2] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2)-Ep(3)[-2] 0.45
Destruxin E chlorohydrin[-38] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3)-Cl(4)[-38] 0.45
Tyrocidine CTyrocidine C D-Phe, Pro, Trp, D-Trp, Asn, Gln, Tyr, Val, Orn, Leu 0.45
Tyrocidine B[+39] D-Phe, Pro, Trp, D-Phe[+39], Asn, Gln, Tyr, Val, Orn, Leu 0.45Tyrocidine D[-23] D-Phe, Pro, Trp, D-Trp, Asn, Gln, Trp[-23], Val, Orn, Leu 0.45
Tyrocidine B1 Tyrocidine B[+14] D-Phe, Pro, Trp, D-Phe, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.44Tyrocidine C1 Tyrocidine C[+14] D-Phe, Pro, Trp, D-Trp, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.40Tyrocidine A1 Tyrocidine A[+14] D-Phe, Pro, Phe, D-Phe, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.37
Tyrocidine BTyrocidine B D-Phe, Pro, Trp, D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.37
Tyrocidine A[+39] D-Phe, Pro, Phe[+39], D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.37Tyrocidine C[-39] D-Phe, Pro, Trp, D-Trp[-39], Asn, Gln, Tyr, Val, Orn, Leu 0.37
Tyrocidine ATyrocidine A D-Phe, Pro, Phe, D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.33
Tyrocidine B[-39] D-Phe, Pro, Trp[-39], D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.33Compound 879 Neoviridogrisein (Thr+Hpa), NMe-Ph-Gly, Ala, NMe-bMe-Leu, NMe-Gly, D-4OH-Pro, D-Leu 0.28
H8405 Beauverolide Ka[-18] C10:0-Me(4)-OH(3), Trp, Phe[-18], D-aIle 0.27BQ123 Halipeptin B[-20] C10:0-Me(2.2.4)-OH(3.7), Ala, aMe-Cys[-20], NMe-OH-Ile, Ala 0.26
H3526hymenistatin I Pro, Tyr, Val, Pro, Leu, Ile, Ile, Pro 0.25hymenamide G Pro, Tyr, Val, Pro, Leu, Ile, Leu, Pro 0.25
Cyanopeptide XMajusculamide C[-30] Map, Ala, Ibu, NMe-OMe-Tyr[-30], NMe-Val, Gly, NMe-Ile, Gly, Hmp 0.23
Dolastatin 11[-30] Gly, NMe-Val, NMe-OMe-Tyr[-30], Ibu, Ala, Map, Hmp, Gly, NMe-Leu 0.23
Microcystin LRMicrocystin LR D-Ala, Leu, D-bMe-Asp, Arg, Adda, D-Glu, NMe-Dha 0.20
[Dha7]microcystin-LR[+14] D-Ala[+14], Leu, D-bMe-Asp, Arg, Adda, D-Glu, dh-Ala 0.20Microcystin LAib[+71] D-Ala, Leu, D-bMe-Asp[+71], Aib, Adda, D-Glu, NMe-Dha 0.19
Seglitide Microsclerodermin F[-3] C12:3(7.9.11)-Me(6)-OH(2.4.5)-NH2(3)-Ph(12), Pyr[-3], NMe-Gly, D-Trp, Gly, OH-4Abu 0.13Cyclomarin C Aureobasin C[-60] D-Hmp, NMe-Val, Phe, NMe-Phe, Pro, Val, NMe-Val, Leu, bOH-NMe-Val[-60] 0.13Cyclomarin A Aureobasidin F[-44] D-Hmp, NMe-Val[-44], Phe, NMe-Phe, Pro, aIle, Val, Leu, bOH-NMe-Val 0.12
Dehydrocyclomarin A Hymenamide J[-74] Pro, Tyr, Asp, Phe, Trp[-74], Lys, Val, Tyr 0.12Dehydrocyclomarin C PF1022E[+44] D-Lac, NMe-Leu, 4OH-D-Ph-Lac, NMe-Leu, D-Lac, NMe-Leu[+44], D-Ph-Lac, NMe-Leu 0.11
Table 1: NRP-Dereplication results. The Score is defined as the product of the fraction of explained intensity and the fraction of explained fragmentmasses of a dereplicated peptide. Dereplicated matches have monomers (shown in red) where the candidate mutation is placed with the integer mass
of the offset enclosed in square brackets (Dereplicated Compounds column). See Table A-3 for the complete list of monomers. Compounds thatare in the database (tyrocidine A, B, C, H3526, microcystin LR and compound 879) or have a closely related compound (tyrocidines A1, B1, C1,
cyanopeptide X, destruxin A) have higher scores than compounds that are not in the database (seglitide, cyclomarin A, C and dehydrocyclomarin A,
C). Dereplicated compounds have the mass difference of the experimental spectrum and the mass of the peptide enclosed in square brackets next totheir name (Top Matches column). The compounds are sorted by score and the double horizontal line separates compounds in the database (or have
a close match) from the compounds that are not in the database (lower part of the table). Compounds H8405 and BQ123 (representing the shortest
peptides in the sample) returned incorrect matches (false positives). However, a close examination of the results revealed that these false positives
are nevertheless correlated with the correct peptide sequences. For H8405, the correct sequences is [113, 71, 129, 186, 113], while the database match
is [184, 186, 129, 113]. For BQ123, the correct masses are [113, 186, 115, 97, 99], while the database match is [71, 228, 71, 97, 143].
5
-
NRP-Dereplication
• Compound 879 was thought to be novel, but the compound neoviridogrisein was in NORINE*
• Cyanopeptide X was unknown in 2007, but majusculamide C was in the NORINE*. The compound was desmethoxymajusculamide C
*Caboche et al, 2008
-
Cyclic Peptide Identification Problem(De novo reconstruction)
• Input: MS3 spectrum of a cyclic peptide• Output: A ranked list of peptide
reconstructions sorted by a scoring
Similar to the Partial Digest Problem described by Skiena et al 1990. Shown to be NP-Hard for noisy inputs (Cielebak et al 2005)
Similar to the problem of sequencing linear peptides with internal fragments. Shown to be NP-Hard (Xu and Ma 2006)
-
Tag Generation ProblemNRP-Tagging
• Input: MS3 spectrum of a cyclic peptide• Output: A ranked list of gapped sequences
that explains the MS3 spectrum, sorted by a scoring function
99, 114, 113, 147, 97, 147, 147, 114, 128, 16399, 114, [113+147], [97+147], 147, 114, 128, 163
99, 114, 260, 244, 147, 114, 128, 163
-
NRP-Tagging
-
A B C D E
A B C DF
A B C DF
F
E
E
NRP-Tagging
-
A B C D E
A B C DF
A B C DF
F
E
E
NRP-Tagging
-
Tag Generation
A B C D E
A B C DF
A B C DF
F
E
F
E
E
NRP-Tagging
-
A B C D E
A B C DF
A B C DF
F
E
F
E
E
A B C D E F
NRP-Tagging
-
bins = []For each peak Pi For each peak Pj (i < j) peak_diff = Pj - Pi bins[peak_diff]++
Input: A mass spectrum
Output: A histogram of mass difference counts for a range of masses
Pevzner et al 2001
Single Self-Convolution
-
• Input: A mass spectrum• Output: A histogram of 2 consecutive mass
differences counts for a range of masses
bins = []
For each peak Pi
For each peak Pj (i < j)
For each peak Pk (j < k)
peak_diff_1 = Pj - Pi
peak_diff_2 = Pk - Pj
bins[peak_diff_1, peak_diff_2]++
Double Self-Convolution
-
• Self Double Convolution keeping track of the starting peak of each peak triplet
A B C D E
A B C DF
A B C DF
F
E
F
E
E
bins[B, C] = 3
NRP-Tagging
-
bins = double_convolution(S)for m_a, m_b in bins starts = starting positions of bin[m_a, m_b] for all combinations such that it is a subset of starts m_1 = c_1 m_i = c_i - c_j (j = i - 1) r = parent - c_n - m_a - m_b tag = [m_1, ... m_n, m_a, m_b, r] score(tag), store(tag)
A B C DE F
NRP-Tagging
-
bins = double_convolution(S)for m_a, m_b in bins starts = starting positions of bin[m_a, m_b] for all combinations such that it is a subset of starts m_1 = c_1 m_i = c_i - c_j (j = i - 1) r = parent - c_n - m_a - m_b tag = [m_1, ... m_n, m_a, m_b, r] score(tag), store(tag)
m_3 m_a m_b rm_1 m_2c_1 c_2 c_3 parent
NRP-Tagging
-
A B C D E
A B C D
A B CD
E
E
A B1 CD E B2
Gap Closing
-
Input: MS3 spectrum S of an (unknown) cyclic peptide, a minimum tag frequency, a recursion depth,and a scoring function score(S, peptide).Output: Ranked list of candidate gapped peptides
1. Find all tags in S:
tags(x, y) = {} for all 0 ≺ x, y ≺ 200for all s, s�, s�� ∈ S such that si ≺ sj ≺ sk do
mass1 = s� − smass2 = s�� − s�add s to tags(mass1,mass2)
end for
2. Generate gapped peptides from frequent tags:
gappedPeptides = {}for all mass1,mass2 with |tags(mass1,mass2)| > frequency do
for all {0 ≺ s1 ≺ . . . ≺ sn ≺ mass(S)−mass1 −mass2} ⊆ tags(mass1,mass2) dogappedPeptide = [m1, ..., mn,mass1,mass2, mn+1] where mi = si − si−1, for 2 ≤ i ≤ n,m1 = s1 and mn+1 = mass(S)−mass1 −mass2 − snAdd gappedPeptide to gappedPeptides
end for
end for
3. Iteratively attempt to split masses larger than 200 Da:
results = depth top-scoring peptides from gappedPeptidescandidates = resultsrepeat
sequences = {}for all gappedPeptide in candidates do
intermediates = {}for all mass > 200 Da in gappedPeptide do
for all mass1 such that 0 ≺ mass1 ≺ 200 Da dosplit mass in gappedPeptide into (mass1,mass−mass1) and add the resulting pep-tide to intermediates
end for
end for
add depth top-scoring peptides from intermediates to sequencesend for
candidates = sequencesAdd sequences to results
until sequences is emptyreturn results
Figure A-3: NRP-Tagging algorithm. tags(mass1,mass2) contains the starting positions of all tags formedby amino acids with masses mass1 and mass2. The notation |tags(mass1,mass2)| refers to the number oflocations of a 2-amino acid tag with masses (mass1,mass2). The notation x ≺ y denotes that y − x ≥57 (57 Da represents the mass of the smallest amino acid Gly). For a given set of starting positions intags(mass1,mass2), all possible combinations ({s1 ≺ . . . ≺ sn} ⊆ tags(mass1,mass2)) of starting positionsof tags are considered during the gapped peptide reconstruction. The precursor mass of S is denotedas mass(S). While the pseudocode above attempts to split each mass > 200 Da into all possible pairs(mass1,mass − mass1 with 0 ≺ mass1 ≺ 200, the real implementation only considers mass1 as a splittingmass if it is supported by some peaks in S. There are 2 threshold parameters, frequency (minimum numberof occurrences of a tag in S), and depth (limits the number of high scoring gapped peptides per an iterationof the mass splitting). The scoring function score(S, peptide) is used to rank the intermediate peptides andselect those for the next iteration.
19
NRP-Tagging
-
Compound Best reconstruction RankTyrocidine A 99, 114, 113, 147, 97, 147, 147, 114, 128, 163 3
Tyrocidine A1 99, 128, 113, 147, 97, 147, 147, 114, 128, 163 16
Tyrocidine B 99, 114, 113, 147, 97, 186, 147, 114, 128, 163 4
Tyrocidine B1 99, 128, 113, 147, 97, 186, 147, 114, 128, 163 1
Tyrocidine C 99, 114, 113, 147, 97, 186, 186, 114, 128, 163 4
Tyrocidine C1 99, 128, 113, 147, 97, 186, 186, 114, 128, 163 1
Seglitide 85, 163, 186, 128, 99, 147 1
Cyanopeptide X 57, 113, 161, 141, 71, 113, [114+57], 127 1
BQ123 113, 186, 115, 97, 99 2
Destruxin A 113, 113, 85, 71, [98+97] 2
H3526 97, 97, 163, 99, {97+1}, 113, {113-1}, 113 10H8405 129, 71, 113, 113, 186 2
Microcystin LR {[83+71]+1}, {113-1}, {129-1}, {156+1}, 313, 129 27Compound 879 113, 113, , {147+18}, 71, 141, 71 7Cyclomarin A 127, 139, , 143, 71, [177+99] 10
Dehydrocyclomarin A 127, 139, 268, 143, 71, 177, 99 27
Cyclomarin C 127, 139, 270, {143+32}, {[71+177]-32}, 99 >40Dehydrocyclomarin C Not generated -
Table 2: NRP-Tagging results. The reconstructed NRPs are represented as sequences of masses. For the
sake of brevity, masses are rounded to integers, e.g. NRP-Tagging reconstruction for Tyrocidine A is 99.06,
114.07, 113.07, 147.06, 97.05, 147.05, 147.05, 114.06, 128.03, 163.06, which is more accurate that the integer
representation given in the first row of the Table. Composite masses (2 or more amino acids) are enclosed
in square brackets. For example, [114+57] in cyanopeptide X means that NRP-Tagging returned 171 as
the mass of an amino acid instead of the correct masses 114 and 57 (Hmp and Gly). Incorrect masses
are enclosed in curly brackets and expressed in terms of their offses from correct masses. For example,{97+1} in H3526 means that NRP-Tagging returned 98 while the correct mass is 97 (Pro). In this case theisotopic peak (rather than a b-ion) was chosen as the best spectral interpretation. Lastly, cases in which the
algorithm splits a mass are enclosed in angle brackets with the correct mass followed by the masses returned
by the algorithm. A single mass 286 in cyclomarin A is split as 129, 157. A single mass 222-18 (water loss)
in compound 879 is split into 100 and 104. The reconstructions given in the table represent a complete
reconstruction of the compound, or a reconstruction with composite masses and/or masses with a known
offset. The “Best reconstruction” column presents the high-scoring peptide with a specified rank (“Rankcolumn”) that is selected from the list of all top-scoring peptides as the most similar to the correct peptide.
7
NRP-Tagging Results
-
• De novo sequencing of cyclic peptide spectra using self-alignment
NRP-Sequencing
-
A+14 Y W K V F
A+14Y W K V F
A+14 YW K V F
A+14 Y WK V F
A+14 Y W KV F
A+14 Y W K VF
A+14 Y W K V F
6 linear theoretical spectra of seglitide
-
A+14 Y W K V F
A+14Y W K V F
A+14 YW K V F
A+14 Y WK V F
A+14 Y W KV F
A+14 Y W K VF
A+14 Y W K V F
A+14
Y
W
K
V
F
Prefixes are horizontal linesSuffixes are vertical lines
-
A+14 Y W K V F
A+14Y W K V F
A+14 YW K V F
A+14 Y WK V F
A+14 Y W KV F
A+14 Y W K VF
A+14 Y W K V F
A+14
Y
W
K
V
F
Theoretical spectrum without annotations
-
A+14 Y W K V F
A+14Y W K V F
A+14 YW K V F
A+14 Y WK V F
A+14 Y W KV F
A+14 Y W K VF
A+14 Y W K V F
A+14
Y
W
K
V
F
Y W K V F
YWKVFOffset: 85
-
De novo sequence (anti symmetric path: Chen et al 2001)
-
• Self-alignment of spectrum using the highest scoring self-convolution value
• Use standard de novo reconstruction algorithms for linear peptide sequencing
• Rescore candidate reconstructions using MSn data
NRP-Sequencing
-
0
50
100
150
200
Cou
nt
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28Scores
Figure 4: NRP-Dereplication score distribution (search of compound 879 against NORINE) features excellent
separation between correct (score 0.28) and false (scores below 0.05) hits.
Compound Best reconstruction RankTyrocidine A [163+99], 114, [113+147], [147+147], 147, [114+128] 1
Tyrocidine A1 [163+99], 128, [113+147], [147+147], 147, [114+128] 1
Tyrocidine B [163+99], 114, [113+147], 97, [186+147], 114, 128 14
Tyrocidine B1 99, 128, [113+147], [97+186], 147, [114+128] 1
Tyrocidine C 113, 147, 97, 186, 186, 114, [128+163], [99+114] 125
Tyrocidine C1 [163+99], [128+113], 147, [97+186], 186, [114+128] 1
Seglitide 85, [163+186], 128, 99, 147 1
Cyanopeptide X 57, 113, 161, 141, 71, [113+114+57], 127 1
BQ123 113, 186, 115, [97+99] 1
H3526 97, [97+163], 99, [97+113], 113, 113 2
H8405 129, 71, 113, 113, 186 1
Table 2: NRP-Sequencing results. The reconstructed NRPs are represented as sequences of masses. For the
sake of brevity, masses are rounded to integers. Composite masses (2 or more aa) are enclosed in square
brackets. For example, [163+99] in tyrocidine A means that NRP-Sequencing returned 262 (composite mass
of 163 and 99 (Tyr and Val)). Best reconstruction is the highest scoring completely correct (i. e. no incorrect
b-ions) de novo sequence returned by NRP-Sequencing.
masses. For experimental spectrum of seglitide, the auto-alignment spectrum S85 contains all prefixand suffix (b/y) ions for the peptide YWKVF (x = 85 corrresponds to the most prominent peak inauto-convolution Conv(S, x)).
• De novo peptide sequencing. We solve the de novo peptide sequencing problem for the auto-alignment spectrum using the anti-symmetric path algorithm [4]. NRP-Sequencing generates all de
novo peptide reconstructions of Sx (for each of the top t auto-convolution masses x, where t is aparameter) with scores above p ·Score(P ), where p is a parameter and P is the highest scoring de novoreconstruction of Sx. We observed that t = 2 works well in most cases.
• Re-ranking candidate peptides using MSn spectra. NRP-Sequencing further scores each can-didate peptide by matching all MSn spectra against it and re-ranking candidate peptides according to
their matches to the MSn spectra. Peaks in de novo reconstructions were scored against MSn spectra
using a likelihood scoring scheme as described in [5]. De novo sequences derived from TOF MS3 spectra
were also cyclized and scored against the MS3 spectrum; MS3/MSn match scores and matched peak
intensities were combined using linear discriminant analysis.
The pseudocode for NRP-Sequencing is presented in Figure 6. Results of NRP-Sequencing are in Table 2.
7
NRP-Sequencing Results
-
ConclusionsDe novo Reconstructions
-
ConclusionsA de novo Reconstruction
-
ConclusionsCombining Reconstructions
-
Acknowledgments
• Computer Science Department, UCSD: Nuno Bandeira and Pavel Pevzner
• Department of Chemistry and Biochemistry, UCSD: Wei-Ting Liu, Dario Meluzzi, Majid Ghassemian and Pieter Dorrestein
• Scripps Institution of Oceanography, UCSD: Marcelino Gutierrez, Thomas Simmons, Andrew Schultz, Bradley Moore, William Gerwick, William Fenical and Katherine Maloney.
• Skaggs School of Pharmacy and Pharmaceutical Sciences, UCSD: Bradley Moore, William Gerwick and Pieter Dorrestein.
• Department of Chemistry, UCSC: Roger Linington
• Computer Science Laboratory of Lille, USTL: Gregory Kucherov and the NORINE team
-
Demo
• http://lol.ucsd.edu/ms-cpa_v1/Input.py (annotation only)• http://rofl.ucsd.edu/nrp (annotation and identification)• http://lmao.ucsd.edu/nrp (alpha site)