gnps: global natural products social molecular networking · spectral libraries coverage current...

GNPS: Global Natural Products Social

Molecular NetworkingDelivering data-enabled, community-driven research

Mingxun Wang1,2,4, Jeremy Carver1,4, Julie Wertz1,4, Laurence

Bernstein1,4, Seungjin Na1,2,4,

Nuno Bandeira1-4

http://gnps.ucsd.edu

1Center for Computational Mass Spectrometry2Computer Science and Engineering

3Skaggs School of Pharmacy and Pharmaceutical Sciences4University of California, San Diego

Center for

Computational

Mass

Spectrometry

http://gnps.ucsd.edu/

Center for

Computational

Mass

Spectrometryhttp://proteomics.ucsd.edu

Each gene may generate several proteins – the functional “workhorses” of the cell– Combinatorial splicing

– Sequence variation

– Post-Translational Modifications

• Hundreds of types and sites

– Quantification and turnover

– Protein Structure and interactions

– Endogenous and Immune peptides

– Microbiome: 100-300x more genes

Proteins run life

How much is in the genes?

– Human: ~20,000-22,000

– Mouse: ~20,000-22,000

– Worm: ~19,000 (C.Elegans)

– Rice/Corn: ~32,000 – 45,000

EMILG

EFILG

http://proteomics.ucsd.edu/

TRADITIONAL WORKFLOW

Design sample

MS Raw data“De novo”

interpretation

Reference

lookup

http://media.gizmodo.co.uk/wp-content/uploads/2011/11/ScientistAtAComputer.jpg

http://planetorbitrap.com/data/fe/image/Fusion_Image.jpg

Mass spectrometry generates

`fingerprints’ for biomolecules(10k-250k spectra, up to 100 GB per sample)

Identified peptide spectrum

determines protein presence

and abundance

http://media.gizmodo.co.uk/wp-content/uploads/2011/11/ScientistAtAComputer.jpg

http://planetorbitrap.com/data/fe/image/Fusion_Image.jpg

PARADIGM SHIFT → NETWORKED RESEARCH

Design sample

MS Raw data“De novo”

interpretation

Reference

lookup

Which spectra

were identified

before?What are the

similar datasets?Who analyzed

similar data?

How good is this

identification?

What are the most important or

common unidentified spectra?

“BIG” SOLUTIONS?Big Data

Big Compute

Big Algorithms

Big Community

MassIVEMass Spectrometry

Interactive Virtual Environment

ProteoSAFe

Proteomics Scalable, Accessible

and Flexible environment

Thousands of datasets,

hundreds of terabytes

50+ data analysis workflows

scalable to thousands of cores

Designed to build on rather than

just ‘tolerate’ big data

Empower and enable community-wide

sharing of knowledge

[271.1] F (SK) S G T E C R A S M S E C D P A E H C T G Q S

b-ions in each spectrum Mass difference between b-ions Oxidized Methionine

http://massive.ucsd.edu http://proteomics.ucsd.edu/software

http://proteomics.ucsd.edu/ProteoSAFe http://gnps.ucsd.edu

http://massive.ucsd.edu/

http://proteomics.ucsd.edu/software

http://proteomics.ucsd.edu/ProteoSAFe


“KNOWING” THE PROTEOME

To “know” the proteome we should be able to:

Present the evidence for any identified peptide, protein or post-translational modification

Basic: tandem MS evidence for claimed observations

Provenance: experimental conditions of data acquisition

Structure the knowledge in a reusable format

Personal knowledge is not community knowledge unless it is reusable in some way

Simplest form of reutilization is database that can be consulted – useful but low throughput

Searchable spectral libraries enable high-throughput automated reutilization

PNNL microbial reference – largest dataset to date (12 TB) includes 100+ searchable libraries

How much do we “know”?

http://proteomics.ucsd.edu http://massive.ucsd.edu

Spectral LibrariesCurated collections of identified

reference spectra



SPECTRAL LIBRARIES COVERAGE

Current coverage of 19,874 Human genes

12,773 with 1+ unique peptides (2,250 with shared peptides only)

4,851 with no peptides

Average protein sequence coverage is ~15%

Low coverage of active/functional sites

NIST SWATH Atlas

% Peptides 14.9% 12.0%

# Peptides 207K 152K


Spectral library: curated collection of identified reference spectra



BIG DATA TO THE RESCUE?

70 terabytes of Human mass spectrometry data available in 976 public ProteomeXchange datasets (~1,000 papers) containing 88,049 LCMS runs

How limited is the knowledge shared through ProteomeXchange?

86 % - partial submissions, no identifications

6 % - data format lost provenance metadata (e.g., only peptide identifications)

8 % - have results files with provenance informationoNo statistical controls – 6.7M IDs

oReport FDR > 1% - 4.5M IDs

oReport FDR ≤ 1% - 1.7M IDs + 14M IDs (CPTAC colorectal)

Final tally: ~4.7% of all data minimally ready to be reused

(<1% without CPTAC colorectal)http://proteomics.ucsd.edu http://massive.ucsd.edu





MASSIVE REANALYSIS

Community knowledge requires reproducible, well-characterized results

MS-GF+ standard database search

Search workflow, source code and exact search parameters available at http://proteomics.ucsd.edu/ProteoSAFe

Everyone can reproduce the searches and test the search protocol

Reanalyzed 14 TB of Human data with ~200M MS/MS spectra

95 million new statistically controlled identifications (~8-50x more)

4.2 million modified versions of 3.1 million unique peptide sequences

Search results are attached to each dataset

Interactive visualization, available for download in open formats

http://massive.ucsd.edu/ProteoSAFe/massive_search.jsp

http://massive.ucsd.edu


http://proteomics.ucsd.edu/ProteoSAFe

http://massive.ucsd.edu/ProteoSAFe/massive_search.jsp




http://massive.ucsd.edu

125 colorectal samples: 95 TCGA cases, 30 controls Imported CPTAC results

6.9M IDs

MS-GF+ database search

8.9M IDs, 70k variants (169k total)

Spectral library search (MSPLIT)

10M IDs, including 387K mixture spectra

Proteogenomics searches of TCGA transcriptomics sequences (Enosi)

6.8M total IDs, 19,728 proteogenomic events

Blind modification search (MODa)

7.8M IDs, 2.8M IDs for 221k variants (306k total)

203K new variants (unique modified peptides)

BROADER LOOK: CPTAC COLON CANCER

Enosi: K.K(VA)LGAFSDGLAHLDNLK.G

MODa: K.K(V,-28)LGAFSDGLAHLDNLK.G

http://proteomics.ucsd.edu



TRANSLATING GENOMICS ALGORITHMS

[271.1] F (SK) S G T E C R A S M S E C D P A E H C T G Q S

b-ions in each spectrum Mass difference between b-ions Oxidized Methionine

Spectral

NetworksMassIVE

1

2

3

457

8 6

1 KQGGTLDD LEE QAREL

2 KQGGTLDD LEE QARE

3 KQGGTLDD LEE QAR

4 KQGGTLDD LEE QA

5 KQGGTLDD LEE-18QAR

6 KQGGTLDD LEE-18Q

7 QGGTLDD LEE QAR

8 QGGTLDD-53LEE QAR

Spectrum Alignment

and Assembly

Global alignment

Local alignment

Sequence Alignment

and Assembly

SPECTRAL NETWORKS ANALYSIS

New information from algorithms that were not available at time of publication

(V,28)LGAFSDGLAHLDNLK

VLGAFSDGLAHLDNLK

K(V,-28)LGAFSDGLAHLDNLK

Ac-VLGAFSDGLAHLDNLK

SPECTRAL NETWORKS OF CATARACTOUS LENS

Tandem MS acquisition sensitivity is still very low for deep PTM analysis – we typically don’t see most peptides and most modifications for lysate experiments

Cataractous Lens data

Lens samples from 9 different patients

Spans stages of development, age and disease

Proteins do not turn over, accumulate modifications over time

Also technically useful

Sample complexity is low enough to allow for deep acquisition of low abundance peptide variants

Lens dataset is probably the most analyzed for discovery of modifications (original publication in 2005)

http://massive.ucsd.eduhttp://proteomics.ucsd.edu

Searle et al, J Proteome Res. 4(2):546-54, 2005



RAVEN ANALYSIS OF CATARACTOUS LENS

Improved spectrum identification

Discovery of proteome diversity

~100 different variations, defined as (mass,site) pairs

Every variation is supported by at least one spectral pair with corroborating fragmentation patterns

MODa5699

13032882

3432

4434

1002

3435

MS-GF+10462

RaVen11753

512

SpectralNetworks

MS-GF+ MODa

Identifications

15,771 10,462 6,355

Gain +51% +148%

VQDDFVEIHGK VQ(D,14)DFVEIHGK

Identifications also in

multiple charge states

z=1

z=3

VQ(D,14)DFVEIHGK

REVEALING PROTEOME DIVERSITY

Proteome diversity is determined by variation within protein regions

Spectral networks significantly improve the detection of variants

Highly-variant regions with 70+ variants

78 regions in 14 proteins with 10+ variants

34 regions in 10 proteins with 20+ variantsMODa1170

227

323

1251

244

475

MS-GF+1327

RaVen2275

224

305

RYRLPSNVDQSALSCSLSADGMLTFCGPK

RYRLPSNVDQSALSCSLSADGMLTFCGPK

YRLPSNVDQSAL

YRLPSNVDQSALS

YRLPSNVDQSALSC

YRLPSNVDQSALSCS

YRLPSNVDQSALSCSL

YRLPSNVDQSALSCSLS

YRLPSNVDQSALSCSLSA

YRLPSNVDQSALSCSLSAD

YRLPSNVDQSALSCSLSADGML

YRLPSNVDQSALSCSLSADGMLTF



YRLPSNVDQSALSCSLSADGMLTFCGPK




xYRLPSNVDQSALSCSLSADGMLTFCGPK


xYRLPSNVDQSALSCSLSADGMLTFCGPK



LPSNVDQSALSC

LPSNVDQSALSC

LPSNVDQSALSCSL

LPSNVDQSALSCSLS

LPSNVDQSALSCSLSA

LPSNVDQSALSCSLSADGMLTF




LPSNVDQSALSCSLSADGMLTFCGP




LPSNVDQSALSCSLSADGMLTFCGPK















xLPSNVDQSALSCSLSADGMLTFCGPK


LPSNVDQSALSCSLSADGMLTFCGPKI

SNVDQSALSCSLSADGMLTFCGPK



NVDQSALSCSLSADGMLTFCGPK

NVDQSALSCSLSADGMLTFCGPK

VDQSALSCSLSADGMLTFCGPK

QSALSCSLSADGMLTFCGPK




SALSCSLSADGMLTFCGPK

SCSLSADGMLTFCGPK

SLSADGMLTFCGP

SLSADGMLTFCGPK

SLSADGMLTFCGPK

SLSADGMLTFCGPK

SLSADGMLTFCGPK

xSLSADGMLTFCGPK

SLSADGMLTFCGPK

LSADGMLTFCGPK

SADGMLTFCGPK

SADGMLTFCGPK

ADGMLTFCGPK

GMLTFCGPK

Alpha-crystallin A

aa 119-145

78 variants



ALPHA-CRYSTALLIN A

Only present in cataracts group

VQDDFVEIHGK

VQ(D,14)DFVEIHGK

Cataract-specific modification (methylation) or polymorphism (DE, bovine homolog) on Alpha-crystallin A


New discoveries in

decade-old data


NATURAL PRODUCTS

DNA makes RNA makes PROTEIN makes … PEPTIDE

transcription translation non-ribosomal peptide synthesis

Mass spectrometry analysis of natural products differs from proteomics and ribosomal peptides in many respects: non-linear structures, e.g., cyclic and branch-cyclic peptides

they often contain non-standard amino acids increasing the

number of possible building blocks from 20 to 100+

they are often heavily modified with complex PTMs

they often have a non-standard backbone

Each of these complications (let alone all of them) invalidates traditional algorithms for MS-based peptide sequencing and identification.

Natural products account for most antibiotics in clinical use

and for 75% of antibacterials introduced in 1980-2000: antibiotics (penicillin, vancomicine, etc.),

immunosuppressants (cyclosporin),

antiviral agents (luzopeptin A),

antitumor agents (bleomycin), …

. MOLECULARSPECTRAL NETWORKS

subtilis ACN 0_1FA MS2_110124234050 1/24/2011 11:40:50 PM

subtilis ACN 0_1FA MS2_110124234050 #676-1265 RT: 9.89-15.89 AV: 5 NL: 1.50E2

F: ITMS + p NSI d Full ms2 [email protected] [245.00-1875.00]

300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100

m/z

0

10

20

30

40

50

60

70

80

90

100

Rela

tive A

bundance

594.42

818.58

800.58

819.50 913.50481.42

463.25661.33 888.33610.50 687.42 769.08703.42391.25 728.42 871.33445.08320.33 846.42366.17 576.33497.33 631.25533.25348.17 921.25408.25 1016.67 1094.331051.75990.42961.42

subtilis ACN 0_1FA MS2_110124234050 #28-138 RT: 0.79-0.88 AV: 2 NL: 4.78E2

F: ITMS + p NSI d Full ms2 [email protected] [275.00-2000.00]

300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100

m/z

0

10

20

30

40

50

60

70

80

90

100

Rela

tive A

bundance

1026.58800.50

707.50

931.58463.33

594.42728.67693.42 972.75814.58 932.58 1008.67391.33

818.58481.25 913.42657.58 1000.42320.25 604.58 742.50 782.42 956.67365.33 899.50862.58832.67685.42576.50449.33 558.58 639.33419.42 505.42 1036.58536.08 1086.08

Watrous et al., PNAS 2012

Crowdsourced

curated

libraries

Share data

Explore

unknown

molecules

Co-analyze

private+public

data

ProteoSAFe



gnps.ucsd.eduFirst MassIVE

Knowledge Base, open

March 2014 Wang M et al, Nature Biotech, 2016

http://gnps.ucsd.edu/

Since March 2014

537 datasets

167,058 files

3.6 TB of data

100s of species

GNPS DATA SHARING

IDENTIFICATION: SPECTRAL LIBRARY SEARCH

GNPS Spectral LibrariesNew public or private datasets

Dereplication spectral library search Exact match: compound identification using spectral

matching Analog match: find putative related molecules using

approximate spectral matching (spectral alignment)

One-step comparison against all accumulated knowledge of known molecules

ProteoSAFe

Open spectral libraries

Antibiotics and natural products

FDA approved drugs

10,454 spectra for 6,810 compounds

Over 500 library revisions by curators

Previously in antibiotics

Hardly any spectra publicly available

No natural products spectral libraries

GNPS SPECTRAL LIBRARIES

CREATING AND CURATING SPECTRAL LIBRARIES

Quality levels: Gold: approved users, synthetic

compounds

Silver: published compounds

Bronze: everything else

I know what

this is!

HOW DO GNPS LIBRARIES COMPARE?

. FINDING NEW ANALOGS

DATASET UPDATES AND COMMENTS

Dataset MSV0001

Dataset MSV0002

Dataset MSV0003

Here is some new data

This data seems great,

but could you add this

Add/Request Missing Metadata

Datasets are almost never complete at

the time of initial submission

DATASET SUBSCRIPTIONS

Dataset MSV0001

Dataset MSV0002

Dataset MSV0003

Subscribe to this interesting dataset

What is a subscription?

• Digest of comments

• Digest of new annotations in dataset

All datasets are periodically re-searched

against the whole knowledge base

As owner, you have metadata requests

‘Living’ data: promote and track evolution

of public datasets What’s new

with my data?

SUBSCRIPTION EMAIL – YOUR DATA “CALLS HOME”

Datasets

continuously

reanalyzed

+

New results

sent to

subscribers

Public

comments,

metadata, etc

Datasets

immediately

available for

reanalysis

Subscriptions

available for

new IDs,

comments

‘LIVE’ DATASET VIEW

ENABLING LIVING DATA

We have shown that living data is a workable concept

Continuous reanalysis upgrades data to knowledge Automated reanalysis with new or updated algorithms

Community-contributed reanalyses

Community-wide knowledge aggregated in crowdsourced spectral libraries

Most identifications come from community reference spectra people contribute what they see

Hundreds of knowledge revisions improve reference metadata

New discoveries in published data Already over 20-fold increases since publication

New algorithms being contributed by 3rd party computational groups

Buy-in from the community Thousands of users from over 100 countries, most joined before publication

Data deposited at acquisition, not publication

Subscriptions keep scientists engaged in the life cycle of published data

FUTURE OUTLOOKWhat can the community do to help?

Scientists Data sharing will increase your productivity – avoid rediscovery, reuse

knowledge

Reutilization and reanalysis increases the impact of data that otherwise is lost

Journals Editors:

Publication guidelines should require proper deposition of supporting raw data

Just raw data is not enough – complete submissions require linking claims to raw data

Reviewers: confirm key claims and findings by checking supporting data at repository

Data citations – the best datasets will support many publications whose citations may need to be indirectly accounted through the data (e.g., accession numbers only)

Academic institutions and funding agencies Understand that `data dumps’ are useless

Published results are only as strong as the supporting data

Link claims and conclusions from manuscript to the raw data

Data is useful only in proportion to the knowledge it supports

ACKNOWLEDGEMENTS

Mingxun Wang Julie Wertz Laurence Bernstein Seungjin NaJeremy Carver

GNPS, MassIVE,

ProteoSAFe

Spectral networks, large-scale workflows

gnps: global natural products social molecular networking · spectral libraries coverage current...

Documents