pride cluster 062016 update

35
Spectrum clustering of PRIDE MS/MS data Dr. Juan Antonio Vizcaíno Proteomics Team Leader EMBL-EBI Hinxton, Cambridge, UK

Upload: juan-antonio-vizcaino

Post on 11-Apr-2017

122 views

Category:

Science


0 download

TRANSCRIPT

Spectrum clustering of PRIDE MS/MS data

Dr. Juan Antonio Vizcaíno

Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Overview• Brief introduction to PRIDE

• PRIDE Cluster: motivation and concept, first implementation

• PRIDE Cluster second implementation

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Overview• Brief introduction to PRIDE

• PRIDE Cluster: motivation and concept, first implementation

• PRIDE Cluster second implementation

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE (PRoteomics IDEntifications) database

http://www.ebi.ac.uk/pride

• PRIDE stores mass spectrometry (MS)-based proteomics data:

• Peptide and protein expression data (identification and quantification)

• Post-translational modifications• Mass spectra (raw data and peak

lists)• Technical and biological metadata• Any other related information

• Full support for tandem MS approaches

Martens et al., Proteomics, 2005Vizcaíno et al., NAR, 2016

Juan A. Vizcaí[email protected]

Seminar20 June 2016

ProteomeXchange: A Global, distributed proteomics database

PASSEL (SRM data)

PRIDE (MS/MS data)

MassIVE (MS/MS data)

Raw

ID/Q

Met

a

155 datasets/month since July 2015

Mandatory raw data deposition since July 2015

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Overview• Brief introduction to PRIDE

• PRIDE Cluster: motivation and concept, first implementation

• PRIDE Cluster second implementation

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Motivation• Data is stored in PRIDE as originally analysed by

the submitters (no data reprocessing is done)

• Heterogeneous quality, difficult to make the data comparable

• Enable assessment of (published) proteomics data

• Pre-requisite for data reuse (e.g. in other bioinformatics resources such as UniProt)

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster - Concept

Griss et al., Nat Methods, 2013

NMMAACDPR

NMMAACDPR

PPECPDFDPPR

NMMAACDPR

Consensus spectrum

PPECPDFDPPR

NMMAACDPR

NMMAACDPR

Threshold: At least 10 spectra in a cluster and ratio >70%.

Originally submitted identified spectra

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster - Concept

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster: Implementation• Griss et al, Nat. Methods

2013

• Clustered all public, identified spectra in PRIDE

• EBI compute farm, LSF• 20.7 M identified

spectra• 610 CPU days, two

calendar weeks• Validation, calibration• Feedback into PRIDE

datasets• EBI farm, LSF

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Overview• Brief introduction to PRIDE

• PRIDE Cluster: motivation and concept, first implementation

• PRIDE Cluster second implementation

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Archive• World-leading repository for MS/MS-based

proteomics data• Founding member of ProteomeXchange

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster

Sequence-based search engines

Spectrum clustering

Incorrectly or unidentified spectra

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster: Second Implementation• Griss et al, Nat. Methods

2013

• Clustered all public, identified spectra in PRIDE

• EBI compute farm, LSF• 20.7 M identified

spectra• 610 CPU days, two

calendar weeks• Validation, calibration• Feedback into PRIDE

datasets• EBI farm, LSF

Griss et al, Nat. Methods 2016, in pressClustered all public spectra in PRIDE by April 2015Apache Hadoop• Starting with 256 M

spectra.• 190 M unidentified spectra

(they were filtered to 111 M for spectra that are likely to represent a peptide).

• 66 M identified spectra• Result: 28 M clusters • 5 calendar days on 30 node

Hadoop cluster, 340 CPU cores

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster - Concept

Griss et al., Nat Methods, 2016

NMMAACDPR

NMMAACDPR

PPECPDFDPPR

NMMAACDPR

Consensus spectrum

PPECPDFDPPR

NMMAACDPR

NMMAACDPR

Threshold: At least 3 spectra in a cluster and ratio >70%.

Originally submitted identified spectra

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster - Concept

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster - Concept

Griss et al., Nat Methods, 2016

NMMAACDPR

NMMAACDPR

PPECPDFDPPR

NMMAACDPR

Consensus spectrum

PPECPDFDPPR

NMMAACDPR

NMMAACDPR

Threshold: At least 3 spectra in a cluster and ratio >70%.

Originally submitted identified spectra

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster - Concept

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Output of the analysis• 1. Inconsistent spectrum clusters

• 2. Clusters including identified and unidentified clusters

• 3. Clusters just containing unidentified spectra

Juan A. Vizcaí[email protected]

Seminar20 June 2016

1. Re-analysis of inconsistent clusters

NMMAACDPR

NMMAACDPR

IGGIGTVPVGR

NMMAACDPRPPECPDFDPPRVFDEFKPLVEEPQNLIKNMMAACDPRIGGIGTVPVGR No sequence has a proportion in the cluster >50%

Consensus spectrum

PPECPDFDPPR

VFDEFKPLVEEPQNLIK

Originally submitted identified spectra

Juan A. Vizcaí[email protected]

Seminar20 June 2016

1. Re-analysis of inconsistent clusters

• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem.

• 453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin.

• In this case, it is likely that a contaminants DB was not used in the search.

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Validation

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Juan A. Vizcaí[email protected]

Seminar20 June 2016

2. Inferring identifications for originally unidentified spectra

30

• 9.1 M unidentified spectra were contained in clusters with a reliable identification.

• These are candidate new identifications (that need to be confirmed), often missed due to search engine settings

• Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.

Juan A. Vizcaí[email protected]

Seminar20 June 2016

3. Consistently unidentified clusters

31

• 19 M clusters contain only unidentified spectra.• 41,155 of these spectra have more than 100 spectra (= 12

M spectra).• Most are likely to be derived from peptides.• They could correspond to PTMs or variant peptides.• With various methods, we found likely identifications for

about 20%.• Vast amount of data mining remains to be done.

Juan A. Vizcaí[email protected]

Seminar20 June 2016

3. Consistently unidentified clusters

Juan A. Vizcaí[email protected]

Seminar20 June 2016

PRIDE Cluster as a Public Data Mining Resource

36

• http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species.• All clustering results, as well as specific subsets of interest

available.• Source code (open source) and Java API

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Applications of spectrum clustering…

38

• In individual or small groups or “similar” datasets:• Can be used to target spectra that are “consistently”

unidentified.• Unidentified spectra could represent PTMs or sequence

variants.• Try “more-expensive” computational analysis methods

(e.g. spectral searches, de novo).

• When mixing identified and unidentified spectra from different experiments, if “non-initially” found PTMs are identified, one could modify the initial search parameters.

• For quantification purposes, the alignment of different runs could be improved by clustering the spectra first?

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Acknowledgements

Johannes GrissRui Wang

Yasset Perez-RiverolSteve LewisHenning Hermjakob

Open MS team (led by O. Kohlbacher) David Tabb

The rest of the PRIDE team especially Noemi del Toro and Jose A. Dianes

Juan A. Vizcaí[email protected]

Seminar20 June 2016

Questions?