gnps: global natural products social molecular networking · spectral libraries coverage current...
TRANSCRIPT
GNPS: Global Natural Products Social
Molecular NetworkingDelivering data-enabled, community-driven research
Mingxun Wang1,2,4, Jeremy Carver1,4, Julie Wertz1,4, Laurence
Bernstein1,4, Seungjin Na1,2,4,
Nuno Bandeira1-4
http://gnps.ucsd.edu
1Center for Computational Mass Spectrometry2Computer Science and Engineering
3Skaggs School of Pharmacy and Pharmaceutical Sciences4University of California, San Diego
Center for
Computational
Mass
Spectrometry
Center for
Computational
Mass
Spectrometryhttp://proteomics.ucsd.edu
Each gene may generate several proteins – the functional “workhorses” of the cell– Combinatorial splicing
– Sequence variation
– Post-Translational Modifications
• Hundreds of types and sites
– Quantification and turnover
– Protein Structure and interactions
– Endogenous and Immune peptides
– Microbiome: 100-300x more genes
Proteins run life
How much is in the genes?
– Human: ~20,000-22,000
– Mouse: ~20,000-22,000
– Worm: ~19,000 (C.Elegans)
– Rice/Corn: ~32,000 – 45,000
EMILG
EFILG
TRADITIONAL WORKFLOW
Design sample
MS Raw data“De novo”
interpretation
Reference
lookup
http://media.gizmodo.co.uk/wp-content/uploads/2011/11/ScientistAtAComputer.jpg
http://planetorbitrap.com/data/fe/image/Fusion_Image.jpg
Mass spectrometry generates
`fingerprints’ for biomolecules(10k-250k spectra, up to 100 GB per sample)
Identified peptide spectrum
determines protein presence
and abundance
PARADIGM SHIFT → NETWORKED RESEARCH
Design sample
MS Raw data“De novo”
interpretation
Reference
lookup
Which spectra
were identified
before?What are the
similar datasets?Who analyzed
similar data?
How good is this
identification?
What are the most important or
common unidentified spectra?
“BIG” SOLUTIONS?Big Data
Big Compute
Big Algorithms
Big Community
MassIVEMass Spectrometry
Interactive Virtual Environment
ProteoSAFe
Proteomics Scalable, Accessible
and Flexible environment
Thousands of datasets,
hundreds of terabytes
50+ data analysis workflows
scalable to thousands of cores
Designed to build on rather than
just ‘tolerate’ big data
Empower and enable community-wide
sharing of knowledge
[271.1] F (SK) S G T E C R A S M S E C D P A E H C T G Q S
b-ions in each spectrum Mass difference between b-ions Oxidized Methionine
http://massive.ucsd.edu http://proteomics.ucsd.edu/software
http://proteomics.ucsd.edu/ProteoSAFe http://gnps.ucsd.edu
“KNOWING” THE PROTEOME
To “know” the proteome we should be able to:
Present the evidence for any identified peptide, protein or post-translational modification
Basic: tandem MS evidence for claimed observations
Provenance: experimental conditions of data acquisition
Structure the knowledge in a reusable format
Personal knowledge is not community knowledge unless it is reusable in some way
Simplest form of reutilization is database that can be consulted – useful but low throughput
Searchable spectral libraries enable high-throughput automated reutilization
PNNL microbial reference – largest dataset to date (12 TB) includes 100+ searchable libraries
How much do we “know”?
http://proteomics.ucsd.edu http://massive.ucsd.edu
Spectral LibrariesCurated collections of identified
reference spectra
SPECTRAL LIBRARIES COVERAGE
Current coverage of 19,874 Human genes
12,773 with 1+ unique peptides (2,250 with shared peptides only)
4,851 with no peptides
Average protein sequence coverage is ~15%
Low coverage of active/functional sites
NIST SWATH Atlas
% Peptides 14.9% 12.0%
# Peptides 207K 152K
http://proteomics.ucsd.edu http://massive.ucsd.edu
Spectral library: curated collection of identified reference spectra
BIG DATA TO THE RESCUE?
70 terabytes of Human mass spectrometry data available in 976 public ProteomeXchange datasets (~1,000 papers) containing 88,049 LCMS runs
How limited is the knowledge shared through ProteomeXchange?
86 % - partial submissions, no identifications
6 % - data format lost provenance metadata (e.g., only peptide identifications)
8 % - have results files with provenance informationoNo statistical controls – 6.7M IDs
oReport FDR > 1% - 4.5M IDs
oReport FDR ≤ 1% - 1.7M IDs + 14M IDs (CPTAC colorectal)
Final tally: ~4.7% of all data minimally ready to be reused
(<1% without CPTAC colorectal)http://proteomics.ucsd.edu http://massive.ucsd.edu
MassIVEMass Spectrometry
Interactive Virtual Environment
MASSIVE REANALYSIS
Community knowledge requires reproducible, well-characterized results
MS-GF+ standard database search
Search workflow, source code and exact search parameters available at http://proteomics.ucsd.edu/ProteoSAFe
Everyone can reproduce the searches and test the search protocol
Reanalyzed 14 TB of Human data with ~200M MS/MS spectra
95 million new statistically controlled identifications (~8-50x more)
4.2 million modified versions of 3.1 million unique peptide sequences
Search results are attached to each dataset
Interactive visualization, available for download in open formats
http://massive.ucsd.edu/ProteoSAFe/massive_search.jsp
http://massive.ucsd.edu
http://proteomics.ucsd.edu http://massive.ucsd.edu
http://massive.ucsd.edu
125 colorectal samples: 95 TCGA cases, 30 controls Imported CPTAC results
6.9M IDs
MS-GF+ database search
8.9M IDs, 70k variants (169k total)
Spectral library search (MSPLIT)
10M IDs, including 387K mixture spectra
Proteogenomics searches of TCGA transcriptomics sequences (Enosi)
6.8M total IDs, 19,728 proteogenomic events
Blind modification search (MODa)
7.8M IDs, 2.8M IDs for 221k variants (306k total)
203K new variants (unique modified peptides)
BROADER LOOK: CPTAC COLON CANCER
Enosi: K.K(VA)LGAFSDGLAHLDNLK.G
MODa: K.K(V,-28)LGAFSDGLAHLDNLK.G
http://proteomics.ucsd.edu
TRANSLATING GENOMICS ALGORITHMS
[271.1] F (SK) S G T E C R A S M S E C D P A E H C T G Q S
b-ions in each spectrum Mass difference between b-ions Oxidized Methionine
Spectral
NetworksMassIVE
1
2
3
457
8 6
1 KQGGTLDD LEE QAREL
2 KQGGTLDD LEE QARE
3 KQGGTLDD LEE QAR
4 KQGGTLDD LEE QA
5 KQGGTLDD LEE-18QAR
6 KQGGTLDD LEE-18Q
7 QGGTLDD LEE QAR
8 QGGTLDD-53LEE QAR
Spectrum Alignment
and Assembly
Global alignment
Local alignment
Sequence Alignment
and Assembly
SPECTRAL NETWORKS ANALYSIS
New information from algorithms that were not available at time of publication
(V,28)LGAFSDGLAHLDNLK
VLGAFSDGLAHLDNLK
K(V,-28)LGAFSDGLAHLDNLK
Ac-VLGAFSDGLAHLDNLK
SPECTRAL NETWORKS OF CATARACTOUS LENS
Tandem MS acquisition sensitivity is still very low for deep PTM analysis – we typically don’t see most peptides and most modifications for lysate experiments
Cataractous Lens data
Lens samples from 9 different patients
Spans stages of development, age and disease
Proteins do not turn over, accumulate modifications over time
Also technically useful
Sample complexity is low enough to allow for deep acquisition of low abundance peptide variants
Lens dataset is probably the most analyzed for discovery of modifications (original publication in 2005)
http://massive.ucsd.eduhttp://proteomics.ucsd.edu
Searle et al, J Proteome Res. 4(2):546-54, 2005
RAVEN ANALYSIS OF CATARACTOUS LENS
Improved spectrum identification
Discovery of proteome diversity
~100 different variations, defined as (mass,site) pairs
Every variation is supported by at least one spectral pair with corroborating fragmentation patterns
MODa5699
13032882
3432
4434
1002
3435
MS-GF+10462
RaVen11753
512
SpectralNetworks
MS-GF+ MODa
Identifications
15,771 10,462 6,355
Gain +51% +148%
VQDDFVEIHGK VQ(D,14)DFVEIHGK
Identifications also in
multiple charge states
z=1
z=3
VQ(D,14)DFVEIHGK
REVEALING PROTEOME DIVERSITY
Proteome diversity is determined by variation within protein regions
Spectral networks significantly improve the detection of variants
Highly-variant regions with 70+ variants
78 regions in 14 proteins with 10+ variants
34 regions in 10 proteins with 20+ variantsMODa1170
227
323
1251
244
475
MS-GF+1327
RaVen2275
224
305
RYRLPSNVDQSALSCSLSADGMLTFCGPK
RYRLPSNVDQSALSCSLSADGMLTFCGPK
YRLPSNVDQSAL
YRLPSNVDQSALS
YRLPSNVDQSALSC
YRLPSNVDQSALSCS
YRLPSNVDQSALSCSL
YRLPSNVDQSALSCSLS
YRLPSNVDQSALSCSLSA
YRLPSNVDQSALSCSLSAD
YRLPSNVDQSALSCSLSADGML
YRLPSNVDQSALSCSLSADGMLTF
YRLPSNVDQSALSCSLSADGMLTF
YRLPSNVDQSALSCSLSADGMLTF
YRLPSNVDQSALSCSLSADGMLTFCGPK
YRLPSNVDQSALSCSLSADGMLTFCGPK
YRLPSNVDQSALSCSLSADGMLTFCGPK
YRLPSNVDQSALSCSLSADGMLTFCGPK
xYRLPSNVDQSALSCSLSADGMLTFCGPK
YRLPSNVDQSALSCSLSADGMLTFCGPK
xYRLPSNVDQSALSCSLSADGMLTFCGPK
YRLPSNVDQSALSCSLSADGMLTFCGPK
YRLPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSC
LPSNVDQSALSC
LPSNVDQSALSCSL
LPSNVDQSALSCSLS
LPSNVDQSALSCSLSA
LPSNVDQSALSCSLSADGMLTF
LPSNVDQSALSCSLSADGMLTF
LPSNVDQSALSCSLSADGMLTF
LPSNVDQSALSCSLSADGMLTF
LPSNVDQSALSCSLSADGMLTFCGP
LPSNVDQSALSCSLSADGMLTFCGP
LPSNVDQSALSCSLSADGMLTFCGP
LPSNVDQSALSCSLSADGMLTFCGP
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
xLPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPK
LPSNVDQSALSCSLSADGMLTFCGPKI
SNVDQSALSCSLSADGMLTFCGPK
SNVDQSALSCSLSADGMLTFCGPK
SNVDQSALSCSLSADGMLTFCGPK
NVDQSALSCSLSADGMLTFCGPK
NVDQSALSCSLSADGMLTFCGPK
VDQSALSCSLSADGMLTFCGPK
QSALSCSLSADGMLTFCGPK
QSALSCSLSADGMLTFCGPK
QSALSCSLSADGMLTFCGPK
QSALSCSLSADGMLTFCGPK
SALSCSLSADGMLTFCGPK
SCSLSADGMLTFCGPK
SLSADGMLTFCGP
SLSADGMLTFCGPK
SLSADGMLTFCGPK
SLSADGMLTFCGPK
SLSADGMLTFCGPK
xSLSADGMLTFCGPK
SLSADGMLTFCGPK
LSADGMLTFCGPK
SADGMLTFCGPK
SADGMLTFCGPK
ADGMLTFCGPK
GMLTFCGPK
Alpha-crystallin A
aa 119-145
78 variants
http://proteomics.ucsd.edu
ALPHA-CRYSTALLIN A
Only present in cataracts group
VQDDFVEIHGK
VQ(D,14)DFVEIHGK
Cataract-specific modification (methylation) or polymorphism (DE, bovine homolog) on Alpha-crystallin A
http://proteomics.ucsd.edu
New discoveries in
decade-old data
NATURAL PRODUCTS
DNA makes RNA makes PROTEIN makes … PEPTIDE
transcription translation non-ribosomal peptide synthesis
Mass spectrometry analysis of natural products differs from proteomics and ribosomal peptides in many respects: non-linear structures, e.g., cyclic and branch-cyclic peptides
they often contain non-standard amino acids increasing the
number of possible building blocks from 20 to 100+
they are often heavily modified with complex PTMs
they often have a non-standard backbone
Each of these complications (let alone all of them) invalidates traditional algorithms for MS-based peptide sequencing and identification.
Natural products account for most antibiotics in clinical use
and for 75% of antibacterials introduced in 1980-2000: antibiotics (penicillin, vancomicine, etc.),
immunosuppressants (cyclosporin),
antiviral agents (luzopeptin A),
antitumor agents (bleomycin), …
. MOLECULARSPECTRAL NETWORKS
subtilis ACN 0_1FA MS2_110124234050 1/24/2011 11:40:50 PM
subtilis ACN 0_1FA MS2_110124234050 #676-1265 RT: 9.89-15.89 AV: 5 NL: 1.50E2
F: ITMS + p NSI d Full ms2 [email protected] [245.00-1875.00]
300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100
m/z
0
10
20
30
40
50
60
70
80
90
100
Rela
tive A
bundance
594.42
818.58
800.58
819.50 913.50481.42
463.25661.33 888.33610.50 687.42 769.08703.42391.25 728.42 871.33445.08320.33 846.42366.17 576.33497.33 631.25533.25348.17 921.25408.25 1016.67 1094.331051.75990.42961.42
subtilis ACN 0_1FA MS2_110124234050 #28-138 RT: 0.79-0.88 AV: 2 NL: 4.78E2
F: ITMS + p NSI d Full ms2 [email protected] [275.00-2000.00]
300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100
m/z
0
10
20
30
40
50
60
70
80
90
100
Rela
tive A
bundance
1026.58800.50
707.50
931.58463.33
594.42728.67693.42 972.75814.58 932.58 1008.67391.33
818.58481.25 913.42657.58 1000.42320.25 604.58 742.50 782.42 956.67365.33 899.50862.58832.67685.42576.50449.33 558.58 639.33419.42 505.42 1036.58536.08 1086.08
Watrous et al., PNAS 2012
Crowdsourced
curated
libraries
Share data
Explore
unknown
molecules
Co-analyze
private+public
data
ProteoSAFe
MassIVEMass Spectrometry
Interactive Virtual Environment
gnps.ucsd.eduFirst MassIVE
Knowledge Base, open
March 2014 Wang M et al, Nature Biotech, 2016
Since March 2014
537 datasets
167,058 files
3.6 TB of data
100s of species
GNPS DATA SHARING
IDENTIFICATION: SPECTRAL LIBRARY SEARCH
GNPS Spectral LibrariesNew public or private datasets
Dereplication spectral library search Exact match: compound identification using spectral
matching Analog match: find putative related molecules using
approximate spectral matching (spectral alignment)
One-step comparison against all accumulated knowledge of known molecules
ProteoSAFe
Open spectral libraries
Antibiotics and natural products
FDA approved drugs
10,454 spectra for 6,810 compounds
Over 500 library revisions by curators
Previously in antibiotics
Hardly any spectra publicly available
No natural products spectral libraries
GNPS SPECTRAL LIBRARIES
CREATING AND CURATING SPECTRAL LIBRARIES
Quality levels: Gold: approved users, synthetic
compounds
Silver: published compounds
Bronze: everything else
I know what
this is!
HOW DO GNPS LIBRARIES COMPARE?
HOW DO GNPS LIBRARIES COMPARE?
HOW DO GNPS LIBRARIES COMPARE?
. FINDING NEW ANALOGS
DATASET UPDATES AND COMMENTS
Dataset MSV0001
Dataset MSV0002
Dataset MSV0003
Here is some new data
This data seems great,
but could you add this
Add/Request Missing Metadata
Datasets are almost never complete at
the time of initial submission
DATASET SUBSCRIPTIONS
Dataset MSV0001
Dataset MSV0002
Dataset MSV0003
Subscribe to this interesting dataset
What is a subscription?
• Digest of comments
• Digest of new annotations in dataset
All datasets are periodically re-searched
against the whole knowledge base
As owner, you have metadata requests
‘Living’ data: promote and track evolution
of public datasets What’s new
with my data?
SUBSCRIPTION EMAIL – YOUR DATA “CALLS HOME”
Datasets
continuously
reanalyzed
+
New results
sent to
subscribers
Public
comments,
metadata, etc
Datasets
immediately
available for
reanalysis
Subscriptions
available for
new IDs,
comments
‘LIVE’ DATASET VIEW
ENABLING LIVING DATA
We have shown that living data is a workable concept
Continuous reanalysis upgrades data to knowledge Automated reanalysis with new or updated algorithms
Community-contributed reanalyses
Community-wide knowledge aggregated in crowdsourced spectral libraries
Most identifications come from community reference spectra people contribute what they see
Hundreds of knowledge revisions improve reference metadata
New discoveries in published data Already over 20-fold increases since publication
New algorithms being contributed by 3rd party computational groups
Buy-in from the community Thousands of users from over 100 countries, most joined before publication
Data deposited at acquisition, not publication
Subscriptions keep scientists engaged in the life cycle of published data
FUTURE OUTLOOKWhat can the community do to help?
Scientists Data sharing will increase your productivity – avoid rediscovery, reuse
knowledge
Reutilization and reanalysis increases the impact of data that otherwise is lost
Journals Editors:
Publication guidelines should require proper deposition of supporting raw data
Just raw data is not enough – complete submissions require linking claims to raw data
Reviewers: confirm key claims and findings by checking supporting data at repository
Data citations – the best datasets will support many publications whose citations may need to be indirectly accounted through the data (e.g., accession numbers only)
Academic institutions and funding agencies Understand that `data dumps’ are useless
Published results are only as strong as the supporting data
Link claims and conclusions from manuscript to the raw data
Data is useful only in proportion to the knowledge it supports
ACKNOWLEDGEMENTS
Mingxun Wang Julie Wertz Laurence Bernstein Seungjin NaJeremy Carver
GNPS, MassIVE,
ProteoSAFe
Spectral networks, large-scale workflows