processing protein identification data for publication€¦ · p02768 722.2113 2 -0.11 yicenqdsissk...

23
Processing Processing protein protein identification identification data for publication data for publication Ana Varela Coelho & Isabel Marcelino RNEM Course on Protein identification by Mass Spectrometry, 2-4 November 2011, Oeiras, PT

Upload: others

Post on 03-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

ProcessingProcessing proteinprotein identification identificationdata for publicationdata for publication

Ana Varela Coelho & Isabel Marcelino

RNEM Course on Protein identification by Mass Spectrometry,2-4 November 2011, Oeiras, PT

Page 2: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Large-scale proteomics studies create huge amounts of data. It is impossible/impractical to present all the results.

Different guidelines are used according to differentproteomic journalsHere we will present the guidelines recommended byMolecular & Cellular Proteomics Journal(http://mcponline.org)

The function of the guidelines is to:Provide enough information to be able to explain the

experiment.Provide an assessment of the reliability of the results.Provide the data that supports the results, particularly those

that have the greatest potential for mis-interpretation, so thereaders can manually assess the results that are importantto them.

Page 3: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Adequate description ofexperiments in M&M section,

Page 4: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Proteomics experiments are carried out by many differentmethods, using a variety of instrument types and employingdifferent analysis tools.

Page 5: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Pre-processing

Involves the conversion of MS raw data into

centroid m/z peak list file

Includes removing m/z peaks :

of low intensity and resolution

above a defined maximum number

with predefined charge states

Page 6: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Checklist:1. Name of peaklist-generating software and release version

(number or date)2. Parameters used – default vs altered

1. Name of the search engine and release version (number or date)2. Enzyme specificity considered3. # of missed cleavages permitted4. Fixed modification(s) (including residue specificity)5. Variable modification(s) (including residue specificity)6. Mass tolerance for precursor ions7. Mass tolerance for fragment ions8. Name of database searched and release version/date9. Species restriction and justification for searching only a subset of a database10. Number of protein entries in the database actually searched

11. Threshold score/E-value for accepting individual MS/MS spectra and justification

12. For large datasets – estimation of false positive rate and how this was calculated

13. If PTMs are being reported, Software/method used to evaluate site assignment

Page 7: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Database searchSearch engines work by:MS information: Determining a list of potential peptides from adatabase that can be formed by the specified parameters and havethe correct precursor massMS/MS information: Determining scores for matches of thefragment peaks to each of these peptides

Changing parameters may change:

Number of hitsIdentification scores

Page 8: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Checklist:1. Name of peaklist-generating software and release version (number or date)2. Parameters used – default vs altered

3. Name of the search engine and release version (number or date)4. Enzyme specificity considered5. # of missed cleavages permitted6. Fixed modification(s) (including residue specificity)7. Variable modification(s) (including residue specificity)8. Mass tolerance for precursor ions9. Mass tolerance for fragment ions10.Name of database searched and release version/date11.Species restriction and justification for searching only a subset of a

database12.Number of protein entries in the database actually searched

13. Threshold score/E-value for accepting individual MS/MS spectra and justification

14. For large datasets – estimation of false positive rate and how this was calculated

15. If PTMs are being reported, Software/method used to evaluate site assignment

Page 9: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Database identification and release version/date

Different databases have different depositionsEntries in databases change over timenew entries are submittedold entries may be removedmultiple entries can be merged

Examples:NCBI nr2006. 07.18;3794285 sequencesIPI human database version 3.16, 62322 entries

Page 10: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Number of entries in database and taxonomicrestrictionThe number of entries influences the identification score and the reliabilityof resultsScores are calculated on the basis that all possible identifications ofpeptides are in the database, and that matching any peptide in the databaseis equally likelyIf DB contains protein sequences there are not in the sample, then theconfidence of the matches increasesIf DB is too small, containing only the proteins that are present in thesample, there will a over-confident assessmentsearches against very small databases can be inaccurate and unreliable

Page 11: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Checklist:1. Name of peaklist-generating software and release version (number or date)2. Parameters used – default vs altered

3. Name of the search engine and release version (number or date)4. Enzyme specificity considered5. # of missed cleavages permitted6. Fixed modification(s) (including residue specificity)7. Variable modification(s) (including residue specificity)8. Mass tolerance for precursor ions9. Mass tolerance for fragment ions10. Name of database searched and release version/date11. Species restriction and justification for searching only a subset of a database12. Number of protein entries in the database actually searched

13.Threshold score/E-value for accepting individual MS/MSspectra and justification

14.For large datasets – estimation of false positive rate and howthis was calculated

Page 12: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Score for accepting individual MS/MS spectrasequence assignment

The cut-off score has to be statistically justifiedeg MASCOT uses a default threshold of 5% probability that a sequencematching is incorrect

Calculation of false discovery rate (FDR) for large data setsPerforming searches using a combination of normal and reverseddatabases is an effective way of estimating the number of incorrectpeptide/protein assignments in the dataset as a whole.

Page 13: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Checklist (cont.):

15. If PTMs are being reported, Software/method used to evaluate site assignment and presentation of annotated spectra

15. Peptide Mass Fingerprint Data: Name of software used for peak-picking and its release version16. PMF Peak Picking Parameters and thresholds (e.g. intensity or S/N threshold, resolution, means of calibrating each

spectrum, list of excluded contaminant ions and justification17. PMF Acceptance Criteria/threshold used for acceptance of PMF-based identifications.

18.Combining Peptides into “Proteins Identified”: If peptides match tomultiple members of a protein family, criteria used for selectingwhich member to report; i.e. how was the redundancyeliminated/handled (this is an issue for all protein databases); Howwere isoforms/individual members of a protein familyunambiguously identified.

20. Quantitative studies: How the quantitation was performed (number of peaks,peak intensity, peak area, extracted ion chromatogram). Minimum thresholds required for data to be used for quantitation. Outlier datapoints removed. If so, give justification. Explanation of statistics used to assess accuracy and significance of measurements. Indicate how biological and analytical reproducibility were addressed by experimental design.

Page 14: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

If data match multiple members of a proteinfamily...

DB often contain entries with similar sequencesProteins from same species sharing sequences stretchesEquivalent proteins from different speciesMultiple entries for one gene product

What to do?Report all proteins that data support equally wellChoose to only report one accession, when it is possible to justify it (evenusing results from other assays), one protein has a more descriptive name,the DB entry is the best annotated

Page 15: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Now that the experiment/shave been adequately

described in M&M section, itis time to present the

results…

Page 16: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Checklist for results to be reported in a table:

1. Protein acession number and respective DBWhenever possible do not report protein constructs or hypothettical proteins

2. Number of unique peptides identifiedUnique peptides are those differing in at least one amino acidCovalently peptides count as different, but different charge states or multiplefragmentation spectra do not count

3. Protein identification score calculated by the serach engine

4. Sequence coverageIs calculated dividing the number of observed aminoacids by total number ofamino acids in the protein sequence x 100

Additional information such as protein name, function, GO terms, MW, pI,peptide sequences, etc may be included

Page 17: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Accessionnumber*

Unique peptidesdetected*

Sequence coverage%*

Expectation-value*

ADA22_MOUSE 4 17 1.3 E-05

Q5SV72_MOUSE 6 10 9.0E-04

Q9Z0P5_MOUSE 2 10 7.5E-06

Q7TNV2_MOUSE 4 9 8.4E-05

ABI1_MOUSE 11 46 1.6E-05

ABI2_MOUSE 6 39 4.8E-05

Q921H8_MOUSE 3 20 9.0E-03

Protein identification table: an example

* Required information

Page 18: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

If only one peptide is identified additional information should be provided

Create a table and includeSequence identifiedPercursor m/z and chargeScore/E-value for this peptide

Acc. nb* m/z* z* Error [Da] Peptide* score* Expect. *

P04264 979.4174 3 1.2 FLEQQNQVLQTKWELLQQVDTSTR 38.4 1.4e-6

P35527 919.4144 2 -0.072 HGVQELEIELQSQLSK 48.7 2.6e-8

P13645 998.8564 2 -0.13 ELTTEIDNNIEQISSYK 37.8 6.4e-5

P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6

P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5

Q16195 1044.3822 2 -0.10 GQTGGDVNVEMDAAPGVDLSR 38.2 4.5e-6

P13646 651.2297 2 -0.10 ALEEANADLEVK 36.8 0.0012

Example

* Required information

Page 19: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Example

If only one peptide is identified additional information should be provided

Present an MS/MS spectrum appropriately labeled with masses detected andfragments assigned

Page 20: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

If a manuscript is accepted for publication, all the mass spectra contributingto the described work must be deposited in electronic form by the time ofpublication at a publicly accessible site that is independent of authors´control

MS data original instrument vendor file format or an open format, such asmzML is encouraged (some problems still persist!!)

Upon acceptance of the manuscript and by the time of publication aurthorsshould provide a URL and password for accessing the data

Page 21: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

These guidelines will be continuously adjusted to the everchanging needs of the proteomics field.

As software improves to become easier to output results intocompliant formats, it will become easier to properly presentproteomics data in any forum.

As common formats e.g. mzML, AnalysisXML, become universallyavailable, tools will be produced that will be able to automaticallyextract the relevant information from raw data and search results.

Once these formatting problems are solved, journal submissionguidelines for proteomics data are expected to become similar for alljournals

Making raw data publicly available is going to become morecommon

Page 22: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Scafold viewer

http://www.proteomesoftware.com/Scaffold/Scaffold_viewer.htm

Page 23: Processing protein identification data for publication€¦ · P02768 722.2113 2 -0.11 YICENQDSISSK 33.3 9.6e-6 P48666 724.2895 2 -0.10 AIGGGLSSVGGGSSTIK 39.4 2.1e-5 Q16195 1044.3822

Thank you!

Modified from Science 291: 1221. 2001