Download - Computational Methods For Identiﬁcation Of Cyclic Peptides ...bix.ucsd.edu/projects/recombcp10_tutorials/RECOMBCP...Julio Ng Bioinformatics Program, UCSD March, 26th 2010 Outline

Computational Methods For Identification Of Cyclic Peptides

Using Mass Spectrometry

Julio NgBioinformatics Program, UCSD

March, 26th 2010

Outline

• Importance of natural products• Mass spectrometry on cyclic peptides• Computational methods to analyze MS data• Demo

Natural Products

• In 1928, A. Fleming discovered antibiotic activity of penicillin

• The beginning of the modernera of drug discovery

Alexander Fleming

Natural Products

• Chemical compound biological activity• Antibiotics (colistin)• Immunosuppressors (cyclosporin)• Antiviral agents (luzopeptin A) • Antitumor agents (phakellistatin)• Toxins (amanitin)

Natural Products

Natural Products as Sources of New Drugs over the Last 25 Years!

David J. Newman* and Gordon M. Cragg

Natural Products Branch, DeVelopmental Therapeutics Program, DiVision of Cancer Treatment and Diagnosis, National CancerInstitute-Frederick, P.O. Box B, Frederick, Maryland 21702

ReceiVed October 10, 2006

This review is an updated and expanded version of two prior reviews that were published in this journal in 1997 and2003. In the case of all approved agents the time frame has been extended to include the 251/2 years from 01/1981 to06/2006 for all diseases worldwide and from 1950 (earliest so far identified) to 06/2006 for all approved antitumordrugs worldwide. We have continued to utilize our secondary subdivision of a “natural product mimic” or “NM” to jointhe original primary divisions. From the data presented, the utility of natural products as sources of novel structures, butnot necessarily the final drug entity, is still alive and well. Thus, in the area of cancer, over the time frame from aroundthe 1940s to date, of the 155 small molecules, 73% are other than “S” (synthetic), with 47% actually being eithernatural products or directly derived therefrom. In other areas, the influence of natural product structures is quite marked,with, as expected from prior information, the antiinfective area being dependent on natural products and their structures.Although combinatorial chemistry techniques have succeeded as methods of optimizing structures and have, in fact,been used in the optimization of many recently approved agents, we are able to identify only one de noVo combinatorialcompound approved as a drug in this 25 plus year time frame. We wish to draw the attention of readers to the rapidlyevolving recognition that a significant number of natural product drugs/leads are actually produced by microbes and/ormicrobial interactions with the “host from whence it was isolated”, and therefore we consider that this area of naturalproduct research should be expanded significantly.

It is over nine years since the publication of our first,1 and three

years since the second,2 analysis of the sources of new and approved

drugs for the treatment of human diseases, both of which indicated

that natural products continued to play a highly significant role in

the drug discovery and development process.

That this influence of Nature in one guise or another has

continued is shown by inspection of the information given below,

where with the advantage of now over 25 years of data, we have

been able to refine the system, eliminating a few duplicative entries

that crept into the original data sets. In particular, as behooves

authors from the National Cancer Institute (NCI), in the specific

case of cancer treatments, we have gone back to consult the records

of the FDA and added to these, comments from investigators who

have informed us over the past two years of compounds that may

have been approved in other countries and that were not captured

in our earlier searches. These cancer data will be presented as a

stand-alone section as well as including the last 25 years of data in

the overall discussion.

As we mentioned in our 2003 review,2 the development of high-

throughput screens based on molecular targets had led to a demand

for the generation of large libraries of compounds to satisfy the

enormous capacities of these screens. As we mentioned at that time,

the shift away from large combinatorial libraries has continued,

with the emphasis now being on small, focused (100 to ∼3000)collections that contain much of the “structural aspects” of natural

products. Various names have been given to this process, including

“Diversity Oriented Syntheses”,3-6 but we prefer to simply say

“more natural product-like”, in terms of their combinations of

heteroatoms and significant numbers of chiral centers within a single

molecule,7 or even “natural product mimics” if they happen to be

direct competitive inhibitors of the natural substrate. It should also

be pointed out that Lipinski’s fifth rule effectively states that the

first four rules do not apply to natural products or to any molecule

that is recognized by an active transport system when considering

“druggable chemical entities”.8-10

Although combinatorial chemistry in one or more of its

manifestations has now been used as a discovery source for

approximately 70% of the time covered by this review, to date, we

can find only one de noVo new chemical entity (NCE) reported inthe public domain as resulting from this method of chemical

discovery and approved for drug use anywhere. This is the antitumor

compound known as sorafenib (Nexavar, 1) from Bayer, approved

by the FDA in 2005. It was known during development as BAY-

43-9006 and is a multikinase inhibitor, targeting several serine/

threonine and receptor tyrosine kinases (RAF kinase, VEGFR-2,

VEGFR-3, PDGFR-beta, KIT, and FLT-3) and is in multiple clinical

trials as both combination and single-agent therapies at the present

time, a common practice once approved for one class of cancer

treatment.

As mentioned by the authors in prior reviews on this topic and

others, the developmental capability of combinatorial chemistry as

a means for structural optimization once an active skeleton has been

identified is without par. The expected surge in productivity,

however, has not materialized; thus, the number of new active

substances (NASs), also known as New Chemical Entities (NCEs),

which we consider to encompass all molecules, including biologics

and vaccines, from our data set hit a 24-year low of 25 in 2004

(though 28% of these were assigned to the ND category), with a

rebound to 54 in 2005, with 24% being N or ND and 37% being

biologics (B) or vaccines (V). Fortunately, however, research being

conducted by groups such as Danishefsky’s, Ganesan’s, Nicolaou’s,

Porco’s, Quinn’s, Schreiber’s, Shair’s, Waldmann’s, and Wipf’s is

continuing the modification of active natural product skeletons as

leads to novel agents, so in due course, the numbers of materials

developed by linking Mother Nature to combinatorial synthetic

techniques should increase. This aspect, plus the potential contribu-

tions from the utilization of genetic analyses of microbes, will be

discussed at the end of this review.

Against this backdrop, we now present an updated analysis of

the role of natural products in the drug discovery and development

process, dating from 01/1981 through 06/2006. As in our earlier

! Dedicated to the late Dr. Kenneth L. Rinehart of the University ofIllinois at Urbana-Champaign for his pioneering work on bioactive naturalproducts.* To whom correspondence should be addressed. Tel: (301) 846-5387.

Fax: (301) 846-6178. E-mail: [email protected].

461J. Nat. Prod. 2007, 70, 461-477

10.1021/np068054v This article not subject to U.S. Copyright. Published 2007 by the Am. Chem. Soc. and the Am. Soc. of Pharmacogn.Published on Web 02/20/2007

Natural Products

• Searching for natural products• Plants• Micro-organisms• Marine organisms• Animal

• A large subclass of natural productsare nonribosomal peptides

Central Dogma of Biology

NRP

Non-ribosomal Protein Synthetase (NRPS)

Sieber and Marahiel 2005

2. NRPS Factory

Although structurally diverse, most biologicallyproduced peptides share a common mode of synthe-sis, the multienzyme thiotemplate mechanism.2,6,40According to this model peptide bond formation takesplace on large multienzyme complexes, which simul-taneously represent template and biosynthetic ma-chinery. Sequencing of genes encoding NRPSs ofbacterial and fungal origin provided insights intomolecular architecture and revealed a modular or-ganization.6 A module is a distinct section of themultienzyme that is responsible for the incorporationof one specific amino acid into the final product.3,6,41It is further subdivided into a catalytically indepen-dent set of domains responsible for substrate recogni-tion, activation, binding, modification, elongation,and release. Domains can be identified at the proteinlevel by characteristic highly conserved sequencemotifs. Thus far, 10 different domains are knownwithin NRPS templates which catalyze independentchemical reactions and will be introduced in moredetail in the following sections. As an example toillustrate basic principles, Figure 2 shows a prototype

NRPS assembly line for the cyclic lipoheptapeptidesurfactin.42

The carboxy group of amino acid building blocksis first activated by ATP hydrolysis to afford thecorresponding aminoacyl-adenylate. This reactiveintermediate is transferred onto the free thiol groupof an enzyme-bound 4!-phosphopantetheinyl cofactor(ppan), establishing a covalent linkage betweenenzyme and substrate. At this stage the substratecan undergo modifications such as epimerization orN-methylation. Assembly of the final product thenoccurs by a series of peptide bond formation steps(elongation) between the downstream building blockwith its free amine and the carboxy-thioester of theupstream substrate. The ppan cofactor facilitates theordered transfer of thioester substrates betweencatalytically active units with all intermediates co-valently tethered to the multienzyme until the prod-uct is released by the action of the C-terminalthioesterase (TE) domain (termination). This strategyminimizes side reactions as well as diffusion times.Type I polyketide synthases (PKS) and fatty acidsynthases (FAS) similarly display a multienzymatic

Figure 2. Surfactin assembly line. The multienzyme complex consists of seven modules (grey and red) which are specificfor the incorporation of seven amino acids. Twenty-four domains of five different types (C, A, PCP, E, and TE) are responsiblefor the catalysis of 24 chemical reactions. Twenty-three reactions are required for peptide elongation, while the last domainis unique and required for peptide release by cyclization.

718 Chemical Reviews, 2005, Vol. 105, No. 2 Sieber and Marahiel

Non-ribosomal Protein Synthetase (NRPS)


Thioesterase Domain


Cyclosporin is highly lipophilic, and 7 of its 11 aminoacids are N-methylated. This high degree of meth-ylation protects the peptide from proteolytic digestionbut complicates chemical synthesis due to low coup-ling yields and side reactions.35 In an iron-deficientenvironment some bacteria such as E. coli, B. subtilis,and Vibrio cholerae synthesize and secrete iron-chelating molecules known as siderophores thatscavenge Fe3+ with picomolar affinity, important forhost survival.36,37 Three catechol ligands derived from2,3-dihydroxybenzoyl (DHB) building blocks in bacil-libactin, enterobactin, and vibriobactin complex ironby forming intramolecular octahedra.

Many nonribosomal peptide products presentedhere show distinct chemical modifications, importantto specifically interact and inhibit certain cellularfunctions, which are essential for survival. The hightoxicity of the peptide products could therefore alsobecome a problem for the producer organism unlessstrategies for its own protection and immunity havebeen coevolved with antibiotic biosynthesis. Thisimmunity is achieved by several strategies including

efflux pumps, temporary product inactivation, andmodifications of the target in the producer strain.3The latter strategy is used by vancomycin-producingStreptomycetes by changing the D-Ala-D-Ala terminusof the peptidoglycan pentapeptide precursor to aD-Ala-D-lactate terminus, which reduces binding af-finity to vancomycin 1000-fold.12

Due to their exceptional pharmacological activities,many compounds such as cyclosporin and vancomy-cin have been synthesized nonenzymatically.38,39 Re-gio- and stereoselective reactions require the use ofprotecting groups as well as chiral catalysts. More-over, macrocyclization and coupling of N-methylatedpeptide bonds are difficult to achieve in satisfyingyields, indicating an advantage of natural vs syn-thetic strategies. Structural peculiarities of thesecomplex peptide products suggested early on a nucleic-acid-independent biosynthesis facilitated by multiplecatalytic domains expressed as a single multidomainprotein. The diverse chemical reactions mediated bydistinct enzymatic units will be the focus of thefollowing sections.

Figure 1. Natural peptidic products. A selection of nonribosomally synthesized peptides. Characteristic structural featuresare highlighted.

Approaches to New Antibiotics Chemical Reviews, 2005, Vol. 105, No. 2 717

Special Characteristics

• Heterocyclic elements• D-amino acids• Glycosylated residues • N-methylated residues• Non-standard amino acids• Cyclic backbone

Cyclic Peptides

http://bioinfo.lifl.fr/norine/ *

out of 1122 entries in the database

*Caboche et al, 2008

Mass Spectrometer

Measures m/z

Sample Preparation (Protein Analysis)

Enzymatic Digestionand

Fractionation

Multi-Stage Mass Spectrometry

Secondary Fragmentation

Ionized parent peptide

Mass Spectrometer

Fragmentation

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1

Ri R

i+1

AA residuei-1

AA residuei AA residuei+1

N-terminus C-terminus

H+

Identification of Linear Mass Spectra

MS/MS spectrum

: b

y:

PM

Database of

known peptides

MDERHILNM, KLQWVCSDL,

PTYWASDL, ENQIKRSACVM,

TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA,

GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, LARGE, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

GALE

K

M

NR

E

EY

LGALR

E

Database search

De novo sequencing

LARGE

Challenges in Identification of Cyclic Peptide MS

• Extensively modified amino acids• Non-standard amino acids• Cyclic backbone• Databases cannot be readily derived from

genomic data

Cyclic Peptide Mass Spectrum

MS1 – Mass of the

intact cyclic peptide

MS2 – Mass of the

intact linear peptides

MS3 – Masses of the

peptide fragments

Ms Mixture! " # $ % &'()

!"

#"

$"

!" # $ % &'()

! " # $ %&'()

! "

#$ %

&'()

!

! "

"

#

# $

$%

%

&'()

&'()

&'()

!

"

#

$

%

%"

! " # $ % &'()

!"

#"

$"

!" # $ % &'()

! " # $ %&'()

! "

#$ %

&'()

!

! "

"

#

# $

$%

%

&'()

&'()

&'()

!

"

#

$

%

%"

Seglitide: somatostatin receptor antagonist, used experimentally to treat Alzheimer’s disease

Cyc(A+14YWKV)

! " # $ % &'()

!"

#"

$"

!" # $ % &'()

! " # $ %&'()

! "

#$ %

&'()

!

! "

"

#

# $

$%

%

&'()

&'()

&'()

!

"

#

$

%

%"

Cyclic Mass Spectrum

NRP-Dereplication

NRP-Tagging

NRP-Sequencing

Identification of Cyclic Mass Spectra

NRP-Dereplication

• Case 1:• There is a peptide in the database that matches the precursor mass of

the spectrum.

• Is this peptide a good match for the spectrum?

• Case 2:• No peptide in the database matches the precursor mass of the spectrum• Can we change a peptide in the database so it becomes a good match

for the given spectrum?

Simplified Dereplication Problem Formulation

• Input: MS3 spectrum, a Peptide Sequence and parameter k• Output: A new Peptide Sequence with k mutations away from

the original Peptide Sequence such that the new peptide explains best the experimental spectrum.

• In reality there many peptides in the database, so the dereplication needs to be done for each peptide

PEPTIDE

Tyrocidine A 99, 114, 113, 147, 97, 147, 147, 114, 128, 163

Tyrocidine A1 99, 128, 113, 147, 97, 147, 147, 114, 128, 163

Tyrocidine B 99, 114, 113, 147, 97, 186, 147, 114, 128, 163

Tyrocidine B1 99, 128, 113, 147, 97, 186, 147, 114, 128, 163

Tyrocidine C 99, 114, 113, 147, 97, 186, 186, 114, 128, 163

Tyrocidine C1 99, 128, 113, 147, 97, 186, 186, 114, 128, 163

Tyrocidine Family (Bacillus brevis)

Dereplication (k = 1)

A B C D E F

A B C D E F

A AB

NRP-Dereplication (k = 1)

A B C D E F

A AB

Δ

ABC-Δ ABCD-Δ ABCDE-Δ


A B C D E F

A AB

Δ

~DEF ~EF ~F


A B C D E F

A AB

Δ

~DEF ~EF ~F

FA E FA B D E F

A B C D E F


Dereplicating tyrocidine C and C1

• Experimental spectrum:• Tyrocidine C1

• Sequence:• Tyrocidine C VOLFPWWNQY

• Offset:• 14 Daltons (O -> K)

V-1 O-2 L-3 F-4 P-5 W-6 W-7 N-8 Q-9 Y-10

0

5

10

15

20a) b)

V

9.0

O

2.5

L

7.0

F

9.0

P

13.0

W

16.0

W

19.0

N

19.5

Q

18.0

Y

15.0

32

c)Coverage

Coverage

Peptide Tyrocidine A Tyrocidine B Tyrocidine B

Sequence VOLFPFFNQY VOLFPWFNQY VOLFPWFNQY

Dereplicated Tyrocidine A1 Tyrocidine B1 Tyrocidine C

Sequence VKLFPFFNQY VKLFPWFNQY VOLFPWWNQY

Coverage

V-1 O-2 L-3 F-4 P-5 F-6 F-7 N-8 Q-9 Y-10

0

5

10

15

20

25

V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10

0

5

10

15

20

V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10

0

5

10

15

20

1

Figure A-4: Dereplication results. a) NRP-Dereplication output for experimental spectrum of tyrocidine C1

(VKLFPWWNQY) given peptide sequence of tyrocidine C (VOLFPWWNQY). Concentric red-gray circles

represent 0-correlated subpeptide (with peptide shown red and its complement shown gray) and δ-correlatedsubpeptides (with peptide shown gray and its complement shown red). Given this coloring convention, the

amino acid coverage (number of red arcs covering an amino acid) represents supporting evidence that an

amino acids did not change from the known to the unknown compound. The thick black circle separates0-correlated subpeptides (shown inside) from δ-correlated subpeptides (shown outside). The outer countsrepresent the coverage for a given amino acid by red arcs and reveals the differing amino acid (O) as theamino acid with minimum coverage (2.5 vs. 7 for the next runner-up). The counts are normalized by the

number of subpeptides per peak. For example, if a peak has two alternative subpeptide annotations, it will

contribute12 to the coverage. The width of the arcs are proportional to this weighting factor. The number in

the center of the graphs is the total number of correlated subpeptides. b) Alternative representation of a)

as a histogram that reveals the changed amino acid O. c) Additional dereplication results for the tyrocidine

family.

20

V-1 O-2 L-3 F-4 P-5 W-6 W-7 N-8 Q-9 Y-10

0

5

10

15

20a) b)

V

9.0

O

2.5

L

7.0

F

9.0

P

13.0

W

16.0

W

19.0

N

19.5

Q

18.0

Y

15.0

32

c)

Coverage

Coverage

Peptide Tyrocidine A Tyrocidine B Tyrocidine B

Sequence VOLFPFFNQY VOLFPWFNQY VOLFPWFNQY

Dereplicated Tyrocidine A1 Tyrocidine B1 Tyrocidine C

Sequence VKLFPFFNQY VKLFPWFNQY VOLFPWWNQY

Coverage

V-1 O-2 L-3 F-4 P-5 F-6 F-7 N-8 Q-9 Y-10

0

5

10

15

20

25

V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10

0

5

10

15

20

V-1 O-2 L-3 F-4 P-5 W-6 F-7 N-8 Q-9 Y-10

0

5

10

15

20

1

Figure A-4: Dereplication results. a) NRP-Dereplication output for experimental spectrum of tyrocidine C1

(VKLFPWWNQY) given peptide sequence of tyrocidine C (VOLFPWWNQY). Concentric red-gray circles

represent 0-correlated subpeptide (with peptide shown red and its complement shown gray) and δ-correlatedsubpeptides (with peptide shown gray and its complement shown red). Given this coloring convention, the

amino acid coverage (number of red arcs covering an amino acid) represents supporting evidence that an

amino acids did not change from the known to the unknown compound. The thick black circle separates0-correlated subpeptides (shown inside) from δ-correlated subpeptides (shown outside). The outer countsrepresent the coverage for a given amino acid by red arcs and reveals the differing amino acid (O) as theamino acid with minimum coverage (2.5 vs. 7 for the next runner-up). The counts are normalized by the

number of subpeptides per peak. For example, if a peak has two alternative subpeptide annotations, it will

contribute12 to the coverage. The width of the arcs are proportional to this weighting factor. The number in

the center of the graphs is the total number of correlated subpeptides. b) Alternative representation of a)

as a histogram that reveals the changed amino acid O. c) Additional dereplication results for the tyrocidine

family.

20

Dereplication Results

NRP-Dereplication Results on NORINECompound Top Match(es) Dereplicated Compound Score

Destruxin A

Destruxin A[+14] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:1(3)-OH(2)[+14] 0.45HydroxyDestruxin B[-18] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2.3)[-18] 0.45

Destruxin D[-32] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2)-CA(4)[-32] 0.45Destruxin E diol[-20] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3.4)[-20] 0.45

Destruxin C[-18] Pro, Ile, NMe-Val, NMe-Ala, bAla, iC5:0-OH(2.4)[-18] 0.45Destruxin F[-4] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3)[-4] 0.45Destruxin B[-2] Pro, Ile, NMe-Val, NMe-Ala, bAla, Hiv[-2] 0.45Destruxin E[-2] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2)-Ep(3)[-2] 0.45

Destruxin E chlorohydrin[-38] Pro, Ile, NMe-Val, NMe-Ala, bAla, C4:0-OH(2.3)-Cl(4)[-38] 0.45

Tyrocidine CTyrocidine C D-Phe, Pro, Trp, D-Trp, Asn, Gln, Tyr, Val, Orn, Leu 0.45

Tyrocidine B[+39] D-Phe, Pro, Trp, D-Phe[+39], Asn, Gln, Tyr, Val, Orn, Leu 0.45Tyrocidine D[-23] D-Phe, Pro, Trp, D-Trp, Asn, Gln, Trp[-23], Val, Orn, Leu 0.45

Tyrocidine B1 Tyrocidine B[+14] D-Phe, Pro, Trp, D-Phe, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.44Tyrocidine C1 Tyrocidine C[+14] D-Phe, Pro, Trp, D-Trp, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.40Tyrocidine A1 Tyrocidine A[+14] D-Phe, Pro, Phe, D-Phe, Asn, Gln, Tyr, Val, Orn[+14], Leu 0.37

Tyrocidine BTyrocidine B D-Phe, Pro, Trp, D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.37

Tyrocidine A[+39] D-Phe, Pro, Phe[+39], D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.37Tyrocidine C[-39] D-Phe, Pro, Trp, D-Trp[-39], Asn, Gln, Tyr, Val, Orn, Leu 0.37

Tyrocidine ATyrocidine A D-Phe, Pro, Phe, D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.33

Tyrocidine B[-39] D-Phe, Pro, Trp[-39], D-Phe, Asn, Gln, Tyr, Val, Orn, Leu 0.33Compound 879 Neoviridogrisein (Thr+Hpa), NMe-Ph-Gly, Ala, NMe-bMe-Leu, NMe-Gly, D-4OH-Pro, D-Leu 0.28

H8405 Beauverolide Ka[-18] C10:0-Me(4)-OH(3), Trp, Phe[-18], D-aIle 0.27BQ123 Halipeptin B[-20] C10:0-Me(2.2.4)-OH(3.7), Ala, aMe-Cys[-20], NMe-OH-Ile, Ala 0.26

H3526hymenistatin I Pro, Tyr, Val, Pro, Leu, Ile, Ile, Pro 0.25hymenamide G Pro, Tyr, Val, Pro, Leu, Ile, Leu, Pro 0.25

Cyanopeptide XMajusculamide C[-30] Map, Ala, Ibu, NMe-OMe-Tyr[-30], NMe-Val, Gly, NMe-Ile, Gly, Hmp 0.23

Dolastatin 11[-30] Gly, NMe-Val, NMe-OMe-Tyr[-30], Ibu, Ala, Map, Hmp, Gly, NMe-Leu 0.23

Microcystin LRMicrocystin LR D-Ala, Leu, D-bMe-Asp, Arg, Adda, D-Glu, NMe-Dha 0.20

[Dha7]microcystin-LR[+14] D-Ala[+14], Leu, D-bMe-Asp, Arg, Adda, D-Glu, dh-Ala 0.20Microcystin LAib[+71] D-Ala, Leu, D-bMe-Asp[+71], Aib, Adda, D-Glu, NMe-Dha 0.19

Seglitide Microsclerodermin F[-3] C12:3(7.9.11)-Me(6)-OH(2.4.5)-NH2(3)-Ph(12), Pyr[-3], NMe-Gly, D-Trp, Gly, OH-4Abu 0.13Cyclomarin C Aureobasin C[-60] D-Hmp, NMe-Val, Phe, NMe-Phe, Pro, Val, NMe-Val, Leu, bOH-NMe-Val[-60] 0.13Cyclomarin A Aureobasidin F[-44] D-Hmp, NMe-Val[-44], Phe, NMe-Phe, Pro, aIle, Val, Leu, bOH-NMe-Val 0.12

Dehydrocyclomarin A Hymenamide J[-74] Pro, Tyr, Asp, Phe, Trp[-74], Lys, Val, Tyr 0.12Dehydrocyclomarin C PF1022E[+44] D-Lac, NMe-Leu, 4OH-D-Ph-Lac, NMe-Leu, D-Lac, NMe-Leu[+44], D-Ph-Lac, NMe-Leu 0.11

Table 1: NRP-Dereplication results. The Score is defined as the product of the fraction of explained intensity and the fraction of explained fragmentmasses of a dereplicated peptide. Dereplicated matches have monomers (shown in red) where the candidate mutation is placed with the integer mass

of the offset enclosed in square brackets (Dereplicated Compounds column). See Table A-3 for the complete list of monomers. Compounds thatare in the database (tyrocidine A, B, C, H3526, microcystin LR and compound 879) or have a closely related compound (tyrocidines A1, B1, C1,

cyanopeptide X, destruxin A) have higher scores than compounds that are not in the database (seglitide, cyclomarin A, C and dehydrocyclomarin A,

C). Dereplicated compounds have the mass difference of the experimental spectrum and the mass of the peptide enclosed in square brackets next totheir name (Top Matches column). The compounds are sorted by score and the double horizontal line separates compounds in the database (or have

a close match) from the compounds that are not in the database (lower part of the table). Compounds H8405 and BQ123 (representing the shortest

peptides in the sample) returned incorrect matches (false positives). However, a close examination of the results revealed that these false positives

are nevertheless correlated with the correct peptide sequences. For H8405, the correct sequences is [113, 71, 129, 186, 113], while the database match

is [184, 186, 129, 113]. For BQ123, the correct masses are [113, 186, 115, 97, 99], while the database match is [71, 228, 71, 97, 143].

5

NRP-Dereplication

• Compound 879 was thought to be novel, but the compound neoviridogrisein was in NORINE*

• Cyanopeptide X was unknown in 2007, but majusculamide C was in the NORINE*. The compound was desmethoxymajusculamide C

*Caboche et al, 2008

Cyclic Peptide Identification Problem(De novo reconstruction)

• Input: MS3 spectrum of a cyclic peptide• Output: A ranked list of peptide

reconstructions sorted by a scoring

Similar to the Partial Digest Problem described by Skiena et al 1990. Shown to be NP-Hard for noisy inputs (Cielebak et al 2005)

Similar to the problem of sequencing linear peptides with internal fragments. Shown to be NP-Hard (Xu and Ma 2006)

Tag Generation ProblemNRP-Tagging

• Input: MS3 spectrum of a cyclic peptide• Output: A ranked list of gapped sequences

that explains the MS3 spectrum, sorted by a scoring function

99, 114, 113, 147, 97, 147, 147, 114, 128, 16399, 114, [113+147], [97+147], 147, 114, 128, 163

99, 114, 260, 244, 147, 114, 128, 163

NRP-Tagging

A B C D E

A B C DF

A B C DF

F

E

E

NRP-Tagging

Tag Generation

A B C D E

A B C DF

A B C DF

F

E

F

E

E

NRP-Tagging

A B C D E

A B C DF

A B C DF

F

E

F

E

E

A B C D E F

NRP-Tagging

bins = []For each peak Pi For each peak Pj (i < j) peak_diff = Pj - Pi bins[peak_diff]++

Input: A mass spectrum

Output: A histogram of mass difference counts for a range of masses

Pevzner et al 2001

Single Self-Convolution

• Input: A mass spectrum• Output: A histogram of 2 consecutive mass

differences counts for a range of masses

bins = []

For each peak Pi

For each peak Pj (i < j)

For each peak Pk (j < k)

peak_diff_1 = Pj - Pi

peak_diff_2 = Pk - Pj

bins[peak_diff_1, peak_diff_2]++

Double Self-Convolution

• Self Double Convolution keeping track of the starting peak of each peak triplet

A B C D E

A B C DF

A B C DF

F

E

F

E

E

bins[B, C] = 3

NRP-Tagging

bins = double_convolution(S)for m_a, m_b in bins starts = starting positions of bin[m_a, m_b] for all combinations such that it is a subset of starts m_1 = c_1 m_i = c_i - c_j (j = i - 1) r = parent - c_n - m_a - m_b tag = [m_1, ... m_n, m_a, m_b, r] score(tag), store(tag)

A B C DE F

NRP-Tagging

bins = double_convolution(S)for m_a, m_b in bins starts = starting positions of bin[m_a, m_b] for all combinations such that it is a subset of starts m_1 = c_1 m_i = c_i - c_j (j = i - 1) r = parent - c_n - m_a - m_b tag = [m_1, ... m_n, m_a, m_b, r] score(tag), store(tag)

m_3 m_a m_b rm_1 m_2c_1 c_2 c_3 parent

NRP-Tagging

A B C D E

A B C D

A B CD

E

E

A B1 CD E B2

Gap Closing

Input: MS3 spectrum S of an (unknown) cyclic peptide, a minimum tag frequency, a recursion depth,and a scoring function score(S, peptide).Output: Ranked list of candidate gapped peptides

1. Find all tags in S:

tags(x, y) = {} for all 0 ≺ x, y ≺ 200for all s, s�, s�� ∈ S such that si ≺ sj ≺ sk do

mass1 = s� − smass2 = s�� − s�add s to tags(mass1,mass2)

end for

2. Generate gapped peptides from frequent tags:

gappedPeptides = {}for all mass1,mass2 with |tags(mass1,mass2)| > frequency do

for all {0 ≺ s1 ≺ . . . ≺ sn ≺ mass(S)−mass1 −mass2} ⊆ tags(mass1,mass2) dogappedPeptide = [m1, ..., mn,mass1,mass2, mn+1] where mi = si − si−1, for 2 ≤ i ≤ n,m1 = s1 and mn+1 = mass(S)−mass1 −mass2 − snAdd gappedPeptide to gappedPeptides

end for

end for

3. Iteratively attempt to split masses larger than 200 Da:

results = depth top-scoring peptides from gappedPeptidescandidates = resultsrepeat

sequences = {}for all gappedPeptide in candidates do

intermediates = {}for all mass > 200 Da in gappedPeptide do

for all mass1 such that 0 ≺ mass1 ≺ 200 Da dosplit mass in gappedPeptide into (mass1,mass−mass1) and add the resulting pep-tide to intermediates

end for

end for

add depth top-scoring peptides from intermediates to sequencesend for

candidates = sequencesAdd sequences to results

until sequences is emptyreturn results

Figure A-3: NRP-Tagging algorithm. tags(mass1,mass2) contains the starting positions of all tags formedby amino acids with masses mass1 and mass2. The notation |tags(mass1,mass2)| refers to the number oflocations of a 2-amino acid tag with masses (mass1,mass2). The notation x ≺ y denotes that y − x ≥57 (57 Da represents the mass of the smallest amino acid Gly). For a given set of starting positions intags(mass1,mass2), all possible combinations ({s1 ≺ . . . ≺ sn} ⊆ tags(mass1,mass2)) of starting positionsof tags are considered during the gapped peptide reconstruction. The precursor mass of S is denotedas mass(S). While the pseudocode above attempts to split each mass > 200 Da into all possible pairs(mass1,mass − mass1 with 0 ≺ mass1 ≺ 200, the real implementation only considers mass1 as a splittingmass if it is supported by some peaks in S. There are 2 threshold parameters, frequency (minimum numberof occurrences of a tag in S), and depth (limits the number of high scoring gapped peptides per an iterationof the mass splitting). The scoring function score(S, peptide) is used to rank the intermediate peptides andselect those for the next iteration.

19

NRP-Tagging

Compound Best reconstruction RankTyrocidine A 99, 114, 113, 147, 97, 147, 147, 114, 128, 163 3

Tyrocidine A1 99, 128, 113, 147, 97, 147, 147, 114, 128, 163 16

Tyrocidine B 99, 114, 113, 147, 97, 186, 147, 114, 128, 163 4

Tyrocidine B1 99, 128, 113, 147, 97, 186, 147, 114, 128, 163 1

Tyrocidine C 99, 114, 113, 147, 97, 186, 186, 114, 128, 163 4

Tyrocidine C1 99, 128, 113, 147, 97, 186, 186, 114, 128, 163 1

Seglitide 85, 163, 186, 128, 99, 147 1

Cyanopeptide X 57, 113, 161, 141, 71, 113, [114+57], 127 1

BQ123 113, 186, 115, 97, 99 2

Destruxin A 113, 113, 85, 71, [98+97] 2

H3526 97, 97, 163, 99, {97+1}, 113, {113-1}, 113 10H8405 129, 71, 113, 113, 186 2

Microcystin LR {[83+71]+1}, {113-1}, {129-1}, {156+1}, 313, 129 27Compound 879 113, 113, , {147+18}, 71, 141, 71 7Cyclomarin A 127, 139, , 143, 71, [177+99] 10

Dehydrocyclomarin A 127, 139, 268, 143, 71, 177, 99 27

Cyclomarin C 127, 139, 270, {143+32}, {[71+177]-32}, 99 >40Dehydrocyclomarin C Not generated -

Table 2: NRP-Tagging results. The reconstructed NRPs are represented as sequences of masses. For the

sake of brevity, masses are rounded to integers, e.g. NRP-Tagging reconstruction for Tyrocidine A is 99.06,

114.07, 113.07, 147.06, 97.05, 147.05, 147.05, 114.06, 128.03, 163.06, which is more accurate that the integer

representation given in the first row of the Table. Composite masses (2 or more amino acids) are enclosed

in square brackets. For example, [114+57] in cyanopeptide X means that NRP-Tagging returned 171 as

the mass of an amino acid instead of the correct masses 114 and 57 (Hmp and Gly). Incorrect masses

are enclosed in curly brackets and expressed in terms of their offses from correct masses. For example,{97+1} in H3526 means that NRP-Tagging returned 98 while the correct mass is 97 (Pro). In this case theisotopic peak (rather than a b-ion) was chosen as the best spectral interpretation. Lastly, cases in which the

algorithm splits a mass are enclosed in angle brackets with the correct mass followed by the masses returned

by the algorithm. A single mass 286 in cyclomarin A is split as 129, 157. A single mass 222-18 (water loss)

in compound 879 is split into 100 and 104. The reconstructions given in the table represent a complete

reconstruction of the compound, or a reconstruction with composite masses and/or masses with a known

offset. The “Best reconstruction” column presents the high-scoring peptide with a specified rank (“Rankcolumn”) that is selected from the list of all top-scoring peptides as the most similar to the correct peptide.

7

NRP-Tagging Results

• De novo sequencing of cyclic peptide spectra using self-alignment

NRP-Sequencing

A+14 Y W K V F

A+14Y W K V F

A+14 YW K V F

A+14 Y WK V F

A+14 Y W KV F

A+14 Y W K VF

A+14 Y W K V F

6 linear theoretical spectra of seglitide

A+14 Y W K V F

A+14Y W K V F

A+14 YW K V F

A+14 Y WK V F

A+14 Y W KV F

A+14 Y W K VF

A+14 Y W K V F

A+14

Y

W

K

V

F

Prefixes are horizontal linesSuffixes are vertical lines

A+14 Y W K V F

A+14Y W K V F

A+14 YW K V F

A+14 Y WK V F

A+14 Y W KV F

A+14 Y W K VF

A+14 Y W K V F

A+14

Y

W

K

V

F

Theoretical spectrum without annotations

A+14 Y W K V F

A+14Y W K V F

A+14 YW K V F

A+14 Y WK V F

A+14 Y W KV F

A+14 Y W K VF

A+14 Y W K V F

A+14

Y

W

K

V

F

Y W K V F

YWKVFOffset: 85

De novo sequence (anti symmetric path: Chen et al 2001)

• Self-alignment of spectrum using the highest scoring self-convolution value

• Use standard de novo reconstruction algorithms for linear peptide sequencing

• Rescore candidate reconstructions using MSn data

NRP-Sequencing

0

50

100

150

200

Cou

nt

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28Scores

Figure 4: NRP-Dereplication score distribution (search of compound 879 against NORINE) features excellent

separation between correct (score 0.28) and false (scores below 0.05) hits.

Compound Best reconstruction RankTyrocidine A [163+99], 114, [113+147], [147+147], 147, [114+128] 1

Tyrocidine A1 [163+99], 128, [113+147], [147+147], 147, [114+128] 1

Tyrocidine B [163+99], 114, [113+147], 97, [186+147], 114, 128 14

Tyrocidine B1 99, 128, [113+147], [97+186], 147, [114+128] 1

Tyrocidine C 113, 147, 97, 186, 186, 114, [128+163], [99+114] 125

Tyrocidine C1 [163+99], [128+113], 147, [97+186], 186, [114+128] 1

Seglitide 85, [163+186], 128, 99, 147 1

Cyanopeptide X 57, 113, 161, 141, 71, [113+114+57], 127 1

BQ123 113, 186, 115, [97+99] 1

H3526 97, [97+163], 99, [97+113], 113, 113 2

H8405 129, 71, 113, 113, 186 1

Table 2: NRP-Sequencing results. The reconstructed NRPs are represented as sequences of masses. For the

sake of brevity, masses are rounded to integers. Composite masses (2 or more aa) are enclosed in square

brackets. For example, [163+99] in tyrocidine A means that NRP-Sequencing returned 262 (composite mass

of 163 and 99 (Tyr and Val)). Best reconstruction is the highest scoring completely correct (i. e. no incorrect

b-ions) de novo sequence returned by NRP-Sequencing.

masses. For experimental spectrum of seglitide, the auto-alignment spectrum S85 contains all prefixand suffix (b/y) ions for the peptide YWKVF (x = 85 corrresponds to the most prominent peak inauto-convolution Conv(S, x)).

• De novo peptide sequencing. We solve the de novo peptide sequencing problem for the auto-alignment spectrum using the anti-symmetric path algorithm [4]. NRP-Sequencing generates all de

novo peptide reconstructions of Sx (for each of the top t auto-convolution masses x, where t is aparameter) with scores above p ·Score(P ), where p is a parameter and P is the highest scoring de novoreconstruction of Sx. We observed that t = 2 works well in most cases.

• Re-ranking candidate peptides using MSn spectra. NRP-Sequencing further scores each can-didate peptide by matching all MSn spectra against it and re-ranking candidate peptides according to

their matches to the MSn spectra. Peaks in de novo reconstructions were scored against MSn spectra

using a likelihood scoring scheme as described in [5]. De novo sequences derived from TOF MS3 spectra

were also cyclized and scored against the MS3 spectrum; MS3/MSn match scores and matched peak

intensities were combined using linear discriminant analysis.

The pseudocode for NRP-Sequencing is presented in Figure 6. Results of NRP-Sequencing are in Table 2.

7

NRP-Sequencing Results

ConclusionsDe novo Reconstructions

ConclusionsA de novo Reconstruction

ConclusionsCombining Reconstructions

Acknowledgments

• Computer Science Department, UCSD: Nuno Bandeira and Pavel Pevzner

• Department of Chemistry and Biochemistry, UCSD: Wei-Ting Liu, Dario Meluzzi, Majid Ghassemian and Pieter Dorrestein

• Scripps Institution of Oceanography, UCSD: Marcelino Gutierrez, Thomas Simmons, Andrew Schultz, Bradley Moore, William Gerwick, William Fenical and Katherine Maloney.

• Skaggs School of Pharmacy and Pharmaceutical Sciences, UCSD: Bradley Moore, William Gerwick and Pieter Dorrestein.

• Department of Chemistry, UCSC: Roger Linington

• Computer Science Laboratory of Lille, USTL: Gregory Kucherov and the NORINE team

Demo

• http://lol.ucsd.edu/ms-cpa_v1/Input.py (annotation only)• http://rofl.ucsd.edu/nrp (annotation and identification)• http://lmao.ucsd.edu/nrp (alpha site)

Download - Computational Methods For Identiﬁcation Of Cyclic Peptides ...bix.ucsd.edu/projects/recombcp10_tutorials/RECOMBCP...Julio Ng Bioinformatics Program, UCSD March, 26th 2010 Outline

Top Related