overview we have developed a complete, end-to-end data analysis pipeline that provides an automated,...
TRANSCRIPT
Overview•We have developed a complete, end-to-end data analysis pipeline that provides an automated, reliable, consistent, and objective analysis of high-throughput quantitative LC-MS/MS data from multiple data sources using multiple search engines.
•The Trans-Proteomics Pipeline (TPP) is a complete, mature, suite of software tools for MS data representation, MS data visualization, peptide identification and validation, protein identification, quantification, and annotation, data storage and mining, and biological inference.
•The TPP has been adopted throughout the international proteomics community, in use at many prominent academic and corporate labs.
•We present an overview of the TPP and describe newly available functionality. All software tools are freely available under an open-source software license at tools.proteomecenter.org
IntroductionHigh throughput LC-MS/MS is capable of simultaneously identifying and quantifying thousands of proteins in a complex sample. The consistent and objective analysis of the obtained large amounts of data is challenging and time-consuming. Over the past 5 years, we have developed and refined a data analysis pipeline that facilitates and standardizes such analysis.
The Trans-Proteomic Pipeline (TPP) is an open-source software package with well-established community acceptance. The TPP provides a completely free, open-source proteomics analysis solution, spanning: conversion of raw MS/MS data to open formats and standards; support for searching MS/MS spectra with various search engines, including the bundled X!Tandem engine (www.thegpm.org) as well as Sequest, Mascot, Phenyx, OMSSA, and others; conversion of search engine results to a uniform open format; statistical validation of peptide identifications with PeptideProphet; statistically validated protein identification with ProteinProphet; quantitative proteomics (SILAC, ICAT, ITRAQ, etc) with XPRESS, ASAPRatio, and Libra; and tools for visualization of and interaction with results. Here we present recent updates to the software tools to improve analysis functionality and user experience.
New Developments for Open-Source Shotgun Proteomics Analysis with the Trans-Proteomic PipelineJoshua Tasman1, Luis Mendoza1, David Shteynberg1, James Eddes2, Ning Zhang1, Nichole King1, Chee-Hong Wong3, Brian Pratt4, Patrick Pedrioli2, Henry Lam1, Eric Deutsch1, Jimmy Eng5, Xiao-jun Li6, Alexey Nesvizhskii7, Andrew Keller8, and Ruedi Aebersold2
1Institute for Systems Biology, Seattle, WA; 2Institute for Molecular Systems Biology (ETH), Zurich, Switzerland, 3Bioinformatics Institute, Singapore; 4Insilicos LLC, Seattle, WA; 5University of Washington; 6Homestead Clinical; 6Rosetta Biosoftware, Seattle, WA; Seattle, WA; 7Department of Pathology, University of Michigan, Ann Arbor, MI; 8Rosetta Biosoftware, Seattle, WA
Methods
Improvements from the Insilicos TPP version (IPP) have been merged to the TPP, and the build system has been improved to allow native windows deployment. Significant speed improvements have already been seen from moving away from a unix emulation layer (Cygwin) based distribution. True versus false-positive peptide ID discrimination has been improved through addition of the decoy, retention time, and high-mass-accuracy PeptideProphet models, as well as through using a semi-parametric distribution for describing peptide population distributions. The open-source search engine X!Tandem is now bundled with the TPP, allowing us to provide a complete and free solution for proteomics analysis. Work has begun on integrating the OMSSA open-source engine as well.
Results and Conclusions
This work is performed under the Seattle Proteome Center, suppored by NHLBI contract No. N01-HV-28179. We would also like to thank all Aebersold Lab and external developers who have contributed to this project.
End-to-End MS/MS Proteomics Analysis with the TPP
mzXML document
mzML document
pepXML document
protXML document
Spectral search engine results file
MS/MS data: Conversion from proprietary (vendor) to open formats
•Choice of common open formats: mzXML (SPC/ISB) or mzML (HUPO PSI, SPC/ISB, and others– see flagship poster 001)•Converters for Thermo Xcalibur (.raw), Waters MassLynx (.raw directory), ABI/MDS Analyst (.wiff), Agilent MassHunter (.d directory) and others
Spectal search engine output: Conversion to open formats
•Supports most common commercial and open-source data formats: Sequest, Mascot, X!Tandem, SpectraST, Pheynx and others
Peptide ID Validation with PeptideProphet
•Majority of peptide assignments by search engines are incorrect
•Manual validation is time-consuming, subjective and impossible to compare
•Applies statistical principles to automate peptide validation
•Validates peptide assignments by Sequest, Mascot, X!Tandem, Phenyx, SpectraST, and others
• Robust: learns distributions of search scores and peptide properties among correct and incorrect results • Accurate: probabilities are true measures of confidence
0
200
400
600
800
1000
1200
1400
1600
-4.9
-3.5
-2.1
-0.7 0.7
2.1
3.5
4.9
6.3
7.7
9.1
discriminant score
num
ber
of s
pect
ra
incorrectcorrect
model results
PeptideProphet performs post-search processing to compute probabilities that peptide assignments from MS/MS spectra are correct.
raw MS/MS data file
Quantitation
•Evaluate peptide ratio from multiple charge states (ASAP)
•Apply statistical methods to evaluate protein ratios and standard deviations
•Quantify ICAT, SILAC, and many other samples
Libra performs quantification on MS/MS spectra that have multi-reagent labeled (4 or 8 channel) peptides, such as iTRAQ labeled samples.
ASAPRatio and XPRESS calculate relative abundance of proteins labeled with heavy and light (2 channel) isotope tags:
•Compute protein ratios automatically and accurately
Specta
Information
(mzXML/mzML/
mzData
Document)
Downstream analysis withOther TPP-compatible SPC tools
Data storage and mining with PeptideAtlas and SBEAMS (Systems Biology Experiment Analysis Management System) :
•Data products of the TPP analysis pipeline are imported into the database
•Data exploration, annotation, and correlation with other experiments can all be managed
• Interface allows flexible analysis of the data: analysis across multiple experiments
Additional visualization, statistical analysis, and exploration tools enabling investigation of biological meaning and significance with Gaggle-compatible tools such as Cytoscape (network visualization), the stats package R, and the PIPE (Protein Inference and Property Explorer)
pepXML document
protXML document
The TPP is constantly improved with new functionality. Highlights of major recent developments include:
•Build system improvements and native Windows distribution: Insilicos had previously released their own version of the TPP (the "IPP".) In order to combine efforts more efficiently, Insilicos has integrated their customizations into the main TPP distribution. The TPP build system has been improved to allow a native windows distribution, allowing for significant performance improvements as well as ease of installationImplementation of raw-to-mzML data converters and full support for parsing mzML throughout the TPP;•Implementation of vendor MS/MS-to-mzML converters and full support for mzML input;•PeptideProphet, the TPP module for peptide ID validation, has been updated with additional modeling capabilities to compare observed retention time vs. calculated purported peptide hydrophobicity. Additionally a high-mass-accuracy model improves discrimination of IDs with data from newer instruments. Decoy database entries can now be taken advantage of in distribution modeling. A semi-parametric distribution model allows better discrimination of true and false-positive results;•Inclusion of X!Tandem (from the GPM project) for a complete, end-to-end MS/MS searching and validation solution;•Upcoming multi-experiment data integration with iProphet (see Poster TPU 669);•Spectral library searching with SpectraST
Protein Identification
ProteinProphet takes as input a list of peptides and probabilities and infers the proteins in the sample:•Groups peptides according to their corresponding protein
•Adjusts individual peptide probabilities to account for new protein grouping information
•Finds simplest list of proteins sufficient to explain all observed peptides (Occam’s razor approach)
•Computes accurate protein probabilities
• Integrates with protein-level quantiation results
• Allows meaningful comparison between results of different experiments
sensitivity
error rate
1.0
0.8
0.2
0.2 0.6 0.8 1.00.4
minimum probability threshold
pe
rce
nta
ge
0
0
0.6
0.6