data standards from the proteomics standards initiative
DESCRIPTION
Data standards from the Proteomics Standards Initiative. Andy Jones [email protected] University of Liverpool. Overview. HUPO-PSI background Data formats Protein and peptide separations GelML spML Mass spectrometry and proteomics informatics mzML mzIdentML mzQuantML. - PowerPoint PPT PresentationTRANSCRIPT
Data standards from the Proteomics Standards Initiative
Andy [email protected] of Liverpool
Overview
• HUPO-PSI background• Data formats
– Protein and peptide separations• GelML• spML
– Mass spectrometry and proteomics informatics– mzML– mzIdentML– mzQuantML
HUPO-PSI background• HUPO was founded in 2001 with several objectives:
– Consolidate worldwide proteome organisations– Assist in the coordination of public proteome initiatives– Engage in scientific and educational activities
• Tissue proteome projects and other initiatives:– Plasma, Liver, Brain, Glyco and Antibody initiative– Proteomics Standards Initiative (PSI)
• HUPO-PSI“The HUPO Proteomics Standards Initiative (PSI) defines community standards for
data representation in proteomics to facilitate data comparison, exchange and verification.”
• Main outputs are:• Minimum reporting guidelines (MIAPE modules)• Data exchange formats (usually in XML)• Ontologies or Controlled vocabularies
PSI main outputs• MIAPE – minimum information about a proteomics experiment
– Information that should be recorded about a proteomics experiment (Taylor et al. Nature Biotechnology 25, 887-893; 2007)
– Modules: gel electrophoresis, gel image informatics, capillary electrophoresis, column chromatography, mass spectrometry, mass spectrometry informatics and molecular interactions
• Data formats for:– molecular interactions– mass spectrometry– protein identifications– gel electrophoresis and other separation methods
• Plus supporting controlled vocabularies for each format• All outputs must pass a stringent standardisation process
– Specifications reviewed by public comment and anonymous review– PSI editor will not sign off specification until reviewers’ comments have been satisfied
PSI data formats
mzML(Mass spec)
mzIdentML(Protein
Identifications)
mzQuantML(Protein
Quantifications)
Protein separation Mass spectrometry Proteomics Informatics
GelML
spML
• 2007-01-18 GelML 1.0• Current: GelML 1.1 (no
formal release yet)
• 2007 - milestone 2• No active development...
• 2008-06-01 mzML 1.0.0 released• 2009-06-01 mzML 1.1.0 released
Previous /related standardsmzData v1.0.5 (PSI)mzXML (from ISB)
• 20-08-2009 mzIdentML 1.0.0 • Early drafting only
MI(molecular
interactions)Version 2.5
GelMLData format for exchanging protocols and image data resulting
from gel electrophoresis, extension of FuGE
• Contents:– Models of 1D and 2D separation, electrophoresis protocol, detection,
and includes DIGE
• Status: – v1.0 was built by extending complete FuGE model; version 1.1 extends
from “FuGElight”– v1.1 simplified protocols e.g. for electrophoresis (free-text not
parameterized)– v1.1 shares the same CV structure as mzML and mzIdentML– v1.1 implemented in ProteoRed MIAPE database, beta
implementation in MIAPEGelDB (SIB)
spMLData exchange format for non-gel based separations,
extension of FuGE
• Contents:– Multi-dimensional chromatography, generic model for other
types of separation (capillary electrophoresis, rotofors, centrifugation etc.)
• Status: – Milestone 2 extended from FuGE; – some work has been done to convert this to same structure as
GelML v1.1– No active development for some time, decision to be taken at
next PSI meeting about community requirement for format
mzML History
mzData1.05
mzXML3.0
mzML0.90
SFO2006-05
dataXML0.6
DC2006-09
ISB2006-11
Lyon2007-04
EBI2007-06
mzML0.91
PSI Doc Proc2007-11
mzML0.99 RC
Toledo2008-04
mzML1.0.0
Release!2008-06
mzML1.1.0RC5
Turku2009-04
mzML1.1.0
Release!2009-06
Early Development
Final Development
mzML
run
spectrum
spectrumDescription
binaryDataArray
binaryDataArray
• • •
precursorList
scan
spectrumList
• • •spectrumspectrum
cvList
referenceableParamGroupList
sampleList
acquisitionSettingsList
dataProcessingList
softwareList
instrumentConfigurationList
chromatogramList
• • •chromatogramchromatogram
chromatogram
binaryDataArray
binaryDataArray
Each spectrum contains a header with scan information and optionally precursor information, followed by two or more base 64 encoded binary data arrays.
Chromatograms may be encoded in mzML in a special element that contains cvParams to describe the type of chromatogram, followed by two base64-encoded binary data arrays.
mzML implementations
mzIdentML overview• Various software packages for searching:
– MASCOT, SEQUEST, X!Tandem, Omssa, Inspect...– Each piece of software has own output format– User interacts with results formatted as web pages– Not easy to submit to databases or re-analyse results
• mzIdentML– Standard format for results of searches with mass spec data– Can capture results from PMF and tandem MS– Flexible model of peptide and protein identifications– Capture search engine parameters, scores and modifications
using controlled vocabulary terms
<Modification location="7" residues="M" monoisotopicMassDelta="15.994919"> <cvParam accession="UNIMOD:35" name="Oxidation" cvRef="UNIMOD" />
mzIdentMLcvList
AnalysisSoftwareList
AnalysisSampleCollection
SequenceCollection
AnalysisCollection
AnalysisProtocolCollection
DataCollection
Software packages
Biological samples
DB entries of protein / peptide sequences
inputs = external spectra1..n
output = SpectrumIdentificationList1
SpectrumIdentificationProtocol
ProteinDetectionProtocol
SpectrumIdentificationProtocolAdditionalSearchParams
ModificationParams
Enzymes
DatabaseFilters
Inputs
AnalysisData
AnalysisDataSpectrumIdentificationList
The database searched and the input file converted to mzIdentML
SpectrumIdentificationResult
SpectrumIdentificationItem
ProteinDetectionListProteinAmbiguityGroup
ProteinDetectionHypothesis
All identifications made from searching one spectrum
One (poly)peptide-spectrum match
A set of related protein identifications e.g. conflicting peptide-protein assignments
A single protein identification
SpectrumIdentification
ProteinDetectionInputs= SpectrumIdentificationLists
output =ProteinDetectionList
mzIdentML
Schema overview
mzIdentML
SpectrumIdentificationList 1SpectrumIdentificationResult 1
SequenceCollectionDBSequenceAccession = “HSP7D_MANSE”Seq = “MAKAPAVGIDLGTTYSCVGVF... “
PeptideSeq = “DAGMISGLNVLR”Mod = Methionine oxidation (pos 4)
SpectrumIdentificationItem 1_1
Score = 67.2E-value = 0.000867Rank = 1
DBSequenceAccession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...”
PeptideEvidence 1_1_Bstart=160 end=171 pre=K post=L
SpectrumIdentificationItem 1_2
PeptideEvidence 1_2_Astart=54 end=65 pre=K post=T
Score = 54.4E-value = 0.026Rank = 2
external data
spectrum
spectrum
spectrum
spectrum
spectrum
mzIdentML
Peptide identifications
PeptideEvidence 1_1_Astart=161 end=172 pre=K post=I
mzIdentML
ProteinDetectionListProteinAmbiguityGroup 1
SpectrumIdentificationList
SpectrumIdentificationResult 3
SpectrumIdentificationResult 2
SpectrumIdentificationResult 1
SpectrumIdentificationItem 2_1
PeptideEvidence 2_1_A
SpectrumIdentificationItem 3_1
PeptideEvidence 3_1_A
ProteinDetectionHypothesis 1_1
PeptideHypothesis (3_1_A)
PeptideHypothesis (2_1_A) PeptideHypothesis (1_1_A) Score = 141
Peptide coverage = 17%E-value = 0.0034
PeptideEvidence 3_1_B
ProteinDetectionHypothesis 1_2
PeptideHypothesis (1_1_B)
SpectrumIdentificationItem 1_1
PeptideEvidence 1_1_A
PeptideEvidence 1_1_B
SequenceCollectionDBSequenceAccession = “HSP7D_MANSE”Seq = “MAKAPAVGIDLGTTYSCVGVF... “
DBSequenceAccession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...”
mzIdentML
Protein identifications
Protein ambiguity group- Groups proteins that share the
same set of peptides (protein inference problem)
Protein Detection Hypothesis- One potential protein hit supported by peptide evidence
mzIdentML
ProteinDetectionListProteinAmbiguityGroup 1
SpectrumIdentificationList
SpectrumIdentificationResult 3
SpectrumIdentificationResult 2
SpectrumIdentificationResult 1
SpectrumIdentificationItem 2_1
PeptideEvidence 2_1_A
SpectrumIdentificationItem 3_1
PeptideEvidence 3_1_A
ProteinDetectionHypothesis 1_1
PeptideHypothesis (3_1_A)
PeptideHypothesis (2_1_A) PeptideHypothesis (1_1_A) Score = 141
Peptide coverage = 17%E-value = 0.0034
PeptideEvidence 3_1_B
ProteinDetectionHypothesis 1_2 PeptideHypothesis (1_1_B)
PeptideHypothesis (3_1_B)
Score = 85Peptide coverage = 12%E-value = 0.055
SpectrumIdentificationItem 1_1
PeptideEvidence 1_1_A
PeptideEvidence 1_1_B
SequenceCollection
DBSequenceAccession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...”
mzIdentML
Protein identifications
ProteinDetectionHypothesis 1_1 has 3 peptides:ESTLHLVLRTLSDYNIQKTITLEVEPSDTIENVK ProteinDetectionHypothesis 1_2 has 2 peptides:ESTLHLVLRTLSDYNIQK
Stronger evidence supporting hypothesis 1 but they are placed within the same ambiguity group
mzIdentML now available for export from Mascot in the next release
Sequest converter produced by MPC (Germany) as part of ProDac consortium:http://www.medizinisches-proteom-center.de
Thermo also working on an “official” exporter
• Basic scripts available for converting other search engine formats (X!Tandem, Omssa, pepXML)
• Export in next version of Scaffold• Database implementation in PRIDE is coming...
mzQuantML• Format to capture proteins quantified from MS data
– Very early drafting• Many methods of quantification
– Label/tag based• Stable isotopes (SILAC)• Tags: ICAT / iTRAQ
– Label-free• Extracted ion chromatogram – align parallel runs• Spectral counting
• Methods still in flux– New methods reported frequently in the literature
• Will need to reference back to spectra (+chromatograms) and identifications– Needs more community input – please offer to help!
Acknowledgements
• PSI workgroups:– Protein separation
• Chair: Juan-Pablo Albar (ProteoRed)– Mass spectrometry
• Chair: Eric Deutsch (ISB)– Proteomics Informatics
• Chair: Andy Jones (Liverpool)• Co-Chair: David Creasy (Matrix Science)
– Molecular interactions• Chair: Henning Hermajakob (and chair of PSI)
• and many developers worldwide...See: http://www.psidev.info/