processing protein identification data for publication€¦ · p02768 722.2113 2 -0.11 yicenqdsissk...
TRANSCRIPT
ProcessingProcessing proteinprotein identification identificationdata for publicationdata for publication
Ana Varela Coelho & Isabel Marcelino
RNEM Course on Protein identification by Mass Spectrometry,2-4 November 2011, Oeiras, PT
Large-scale proteomics studies create huge amounts of data. It is impossible/impractical to present all the results.
Different guidelines are used according to differentproteomic journalsHere we will present the guidelines recommended byMolecular & Cellular Proteomics Journal(http://mcponline.org)
The function of the guidelines is to:Provide enough information to be able to explain the
experiment.Provide an assessment of the reliability of the results.Provide the data that supports the results, particularly those
that have the greatest potential for mis-interpretation, so thereaders can manually assess the results that are importantto them.
Adequate description ofexperiments in M&M section,
…
Proteomics experiments are carried out by many differentmethods, using a variety of instrument types and employingdifferent analysis tools.
Pre-processing
Involves the conversion of MS raw data into
centroid m/z peak list file
Includes removing m/z peaks :
of low intensity and resolution
above a defined maximum number
with predefined charge states
Checklist:1. Name of peaklist-generating software and release version
(number or date)2. Parameters used – default vs altered
1. Name of the search engine and release version (number or date)2. Enzyme specificity considered3. # of missed cleavages permitted4. Fixed modification(s) (including residue specificity)5. Variable modification(s) (including residue specificity)6. Mass tolerance for precursor ions7. Mass tolerance for fragment ions8. Name of database searched and release version/date9. Species restriction and justification for searching only a subset of a database10. Number of protein entries in the database actually searched
11. Threshold score/E-value for accepting individual MS/MS spectra and justification
12. For large datasets – estimation of false positive rate and how this was calculated
13. If PTMs are being reported, Software/method used to evaluate site assignment
Database searchSearch engines work by:MS information: Determining a list of potential peptides from adatabase that can be formed by the specified parameters and havethe correct precursor massMS/MS information: Determining scores for matches of thefragment peaks to each of these peptides
Changing parameters may change:
Number of hitsIdentification scores
Checklist:1. Name of peaklist-generating software and release version (number or date)2. Parameters used – default vs altered
3. Name of the search engine and release version (number or date)4. Enzyme specificity considered5. # of missed cleavages permitted6. Fixed modification(s) (including residue specificity)7. Variable modification(s) (including residue specificity)8. Mass tolerance for precursor ions9. Mass tolerance for fragment ions10.Name of database searched and release version/date11.Species restriction and justification for searching only a subset of a
database12.Number of protein entries in the database actually searched
13. Threshold score/E-value for accepting individual MS/MS spectra and justification
14. For large datasets – estimation of false positive rate and how this was calculated
15. If PTMs are being reported, Software/method used to evaluate site assignment
Database identification and release version/date
Different databases have different depositionsEntries in databases change over timenew entries are submittedold entries may be removedmultiple entries can be merged
Examples:NCBI nr2006. 07.18;3794285 sequencesIPI human database version 3.16, 62322 entries
Number of entries in database and taxonomicrestrictionThe number of entries influences the identification score and the reliabilityof resultsScores are calculated on the basis that all possible identifications ofpeptides are in the database, and that matching any peptide in the databaseis equally likelyIf DB contains protein sequences there are not in the sample, then theconfidence of the matches increasesIf DB is too small, containing only the proteins that are present in thesample, there will a over-confident assessmentsearches against very small databases can be inaccurate and unreliable
Checklist:1. Name of peaklist-generating software and release version (number or date)2. Parameters used – default vs altered
3. Name of the search engine and release version (number or date)4. Enzyme specificity considered5. # of missed cleavages permitted6. Fixed modification(s) (including residue specificity)7. Variable modification(s) (including residue specificity)8. Mass tolerance for precursor ions9. Mass tolerance for fragment ions10. Name of database searched and release version/date11. Species restriction and justification for searching only a subset of a database12. Number of protein entries in the database actually searched
13.Threshold score/E-value for accepting individual MS/MSspectra and justification
14.For large datasets – estimation of false positive rate and howthis was calculated
Score for accepting individual MS/MS spectrasequence assignment
The cut-off score has to be statistically justifiedeg MASCOT uses a default threshold of 5% probability that a sequencematching is incorrect
Calculation of false discovery rate (FDR) for large data setsPerforming searches using a combination of normal and reverseddatabases is an effective way of estimating the number of incorrectpeptide/protein assignments in the dataset as a whole.
Checklist (cont.):
15. If PTMs are being reported, Software/method used to evaluate site assignment and presentation of annotated spectra
15. Peptide Mass Fingerprint Data: Name of software used for peak-picking and its release version16. PMF Peak Picking Parameters and thresholds (e.g. intensity or S/N threshold, resolution, means of calibrating each
spectrum, list of excluded contaminant ions and justification17. PMF Acceptance Criteria/threshold used for acceptance of PMF-based identifications.
18.Combining Peptides into “Proteins Identified”: If peptides match tomultiple members of a protein family, criteria used for selectingwhich member to report; i.e. how was the redundancyeliminated/handled (this is an issue for all protein databases); Howwere isoforms/individual members of a protein familyunambiguously identified.
20. Quantitative studies: How the quantitation was performed (number of peaks,peak intensity, peak area, extracted ion chromatogram). Minimum thresholds required for data to be used for quantitation. Outlier datapoints removed. If so, give justification. Explanation of statistics used to assess accuracy and significance of measurements. Indicate how biological and analytical reproducibility were addressed by experimental design.
If data match multiple members of a proteinfamily...
DB often contain entries with similar sequencesProteins from same species sharing sequences stretchesEquivalent proteins from different speciesMultiple entries for one gene product
What to do?Report all proteins that data support equally wellChoose to only report one accession, when it is possible to justify it (evenusing results from other assays), one protein has a more descriptive name,the DB entry is the best annotated
Now that the experiment/shave been adequately
described in M&M section, itis time to present the
results…
Checklist for results to be reported in a table:
1. Protein acession number and respective DBWhenever possible do not report protein constructs or hypothettical proteins
2. Number of unique peptides identifiedUnique peptides are those differing in at least one amino acidCovalently peptides count as different, but different charge states or multiplefragmentation spectra do not count
3. Protein identification score calculated by the serach engine
4. Sequence coverageIs calculated dividing the number of observed aminoacids by total number ofamino acids in the protein sequence x 100
Additional information such as protein name, function, GO terms, MW, pI,peptide sequences, etc may be included
Accessionnumber*
Unique peptidesdetected*
Sequence coverage%*
Expectation-value*
ADA22_MOUSE 4 17 1.3 E-05
Q5SV72_MOUSE 6 10 9.0E-04
Q9Z0P5_MOUSE 2 10 7.5E-06
Q7TNV2_MOUSE 4 9 8.4E-05
ABI1_MOUSE 11 46 1.6E-05
ABI2_MOUSE 6 39 4.8E-05
Q921H8_MOUSE 3 20 9.0E-03
Protein identification table: an example
* Required information
If only one peptide is identified additional information should be provided
Create a table and includeSequence identifiedPercursor m/z and chargeScore/E-value for this peptide
Acc. nb* m/z* z* Error [Da] Peptide* score* Expect. *
P04264 979.4174 3 1.2 FLEQQNQVLQTKWELLQQVDTSTR 38.4 1.4e-6
P35527 919.4144 2 -0.072 HGVQELEIELQSQLSK 48.7 2.6e-8
P13645 998.8564 2 -0.13 ELTTEIDNNIEQISSYK 37.8 6.4e-5
P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6
P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5
Q16195 1044.3822 2 -0.10 GQTGGDVNVEMDAAPGVDLSR 38.2 4.5e-6
P13646 651.2297 2 -0.10 ALEEANADLEVK 36.8 0.0012
Example
* Required information
Example
If only one peptide is identified additional information should be provided
Present an MS/MS spectrum appropriately labeled with masses detected andfragments assigned
If a manuscript is accepted for publication, all the mass spectra contributingto the described work must be deposited in electronic form by the time ofpublication at a publicly accessible site that is independent of authors´control
MS data original instrument vendor file format or an open format, such asmzML is encouraged (some problems still persist!!)
Upon acceptance of the manuscript and by the time of publication aurthorsshould provide a URL and password for accessing the data
These guidelines will be continuously adjusted to the everchanging needs of the proteomics field.
As software improves to become easier to output results intocompliant formats, it will become easier to properly presentproteomics data in any forum.
As common formats e.g. mzML, AnalysisXML, become universallyavailable, tools will be produced that will be able to automaticallyextract the relevant information from raw data and search results.
Once these formatting problems are solved, journal submissionguidelines for proteomics data are expected to become similar for alljournals
Making raw data publicly available is going to become morecommon
Scafold viewer
http://www.proteomesoftware.com/Scaffold/Scaffold_viewer.htm
Thank you!
Modified from Science 291: 1221. 2001