a bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · a...
TRANSCRIPT
A bioinformatic pipeline for the analysis of proteomics data
Patrick PedrioliInstitute for Systems Biology
? Common Data Analysis Pipeline
Data storage
publication
Biological sample
Overview
? Common Data Analysis Pipeline
Standardinput
Peptideassignment
Validation Proteinassignment
Quantitation
Interpretation
Tasks for a proteomic analysis pipeline
Standardinput
Peptideassignment
Validation Proteinassignment
Quantitation
Interpretation
A diverse set of Mass Spectrometers allows for more flexibility! Or ...
Different MS manufacturers store data in different proprietary data in different proprietary formatsformats.
This limitslimits:
✗ Data analysisanalysis
✗ ExchangeExchange of RAW datasets and creation of public repositories
✗ Software developmentSoftware development and distribution
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
Workarounds
✗ ExportingExporting to other formats
✗ Pure text (ASCII)
✗ ANDI/NetCDF
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
Workarounds
✗ ExportingExporting to other formats
✗ Pure text (ASCII)Loss of precision <-> File sizeSpeed of access
✗ ANDI/NetCDFComplex dictionaries of tags required for data completeness
Peptideassignment Validation
Proteinassignment
Quantitation
InterpretationStandard
input
✗ Obtaining libraries and documentation to directly access the RAW data in exchange for a non-disclosure agreementnon-disclosure agreement
Peptideassignment Validation
Proteinassignment
Quantitation
InterpretationStandard
input
Workarounds
mzXML
Specialisedinput formats
(dta, mg, pkl, ...)
Com
mon D
ata An alys is P
ip el inePeptide
assignment Validation Proteinassignment
Quantitation
InterpretationStandard
input
Thermo Finnigan Ion Trap
Micromass Q-TOFABI MALDI TOF-TOF
Sciex Q-TOF
ABI ESI TOF
Specialized applications
In the ideal world ...
✗ Compact✗ Precise✗ Fast✗ Machine independent✗ Easy to expand✗ Public domain
Binary
XML
Common file format would be:
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
mzXML documentMain XML section
Scans' offsets index
SAX parser 1st parsing
Copy of scans' offsets index
Subsequent parsing operations
Speed issue can be addressed with an index for non-index for non-sequentialsequential access
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
Data analysis pipeline
base64sha1-sum
htonlXML formatting XML formatting
ReAdW MassWolf
mzXMLinstance doc
ValidateXMLXML2Other
RAP
mzXMLschema
dta pkl mg txt
raw to
XML conversion
Integrity checkXML to ... conversion
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
XML2Other
Export to external formatsExport to external formats for programs that cannot be ported to read mzXML
XML2Other
pkl mg txtdta
mzXMLinstance doc
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
mzXMLinstance doc
RAP
RAPRandom Access Parser
Peaks list
...Scan header
LibraryLibrary for directly for directly extractingextracting information information from mzXMLfrom mzXML files filesAvailable in 3 flavors: C, CAvailable in 3 flavors: C, C++++ and Java and Java
Uses index for fast random data access
Multiple functions for fine tuned data retrieval
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
XMLicit ?
mzXMLinstance doc
ValidateXML mzXMLschema
Integrity checksIntegrity checks on the XML instance document
✗ XML Schema validation✗ sha1-sum validation
Schema is hosted on SF -> easy way to test for data corruption from any computer connected to the Net (exchange of data with external collaborators)
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
mzXML viewer
Peptideassignment Validation
Proteinassignment
Quantitation
InterpretationStandard
input
Offers a common way to display datacommon way to display data generated on different instruments, therefore facilitating the exchange of datasets
mzXML viewer
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
Peptideassignment Validation Protein
assignment
Quantitation
InterpretationStandard
input
http://sashimi.sourceforge.net/
In conclusion
A instrument independent file format for MS data allows us to:✗ Easily integrate a new instrument in a pre-existing analysis framework✗ Develop and distribute new software✗ Exchange data with external collaborators✗ Create public data repositories
Standardinput
Peptideassignment
Validation Proteinassignment
Quantitation
Interpretation
Assigning a Peptide to a Spectrum
Standardinput Validation
Proteinassignment
Quantitation
InterpretationPeptide
assignment
FindFind a sequence of amino acidssequence of amino acids that can generate the list of masses seen in the tandem MS scan
Assigning a Peptide to a Spectrum
Many different strategies:
✗ Searching MS/MS spectra against a sequence database
✗ De novo sequencing
✗ Combination of the two
Standardinput Validation Protein
assignment
Quantitation
InterpretationPeptide
assignment
Uninterpreted MS/MS Database Search
SequenceDatabase
MS/MSData
m/z – Intprecursor mass
Theoretical spectraTheoretical spectra for given precursor mass
✗ Assign scoresscores to overlapsoverlaps✗ (Normalize scores)✗ Keep best match
Standardinput Validation Protein
assignment
Quantitation
InterpretationPeptide
assignment
Scoring Peptide Assignments
Multiple mature search engines are available, but ...
✗ They use different scoring algorithms.✗ Search outputs are not comparable✗ Search outputs require expert validation
Standardinput Validation Protein
assignment
Quantitation
InterpretationPeptide
assignment
Standardinput
Peptideassignment Validation
Proteinassignment
Quantitation
Interpretation
Peptide Validation
Standardinput
Peptideassignment
Proteinassignment
Quantitation
InterpretationValidation
Validate peptide assignmentsValidate peptide assignments made during the database search step.
Validation method should be standardized and independent from the experimental and computational methods used
Manual Validation
✗ Filtering by:✗ Database search scores✗ NTT (Number of Tryptic Termini)
Problems:Filtering criteria vary among researchers and error rates are unknown
Standardinput
Peptideassignment
Proteinassignment
Quantitation
InterpretationValidation
Peptide Identification Effort
Sample prep
Instrument time
Database search
Manula val-idation
✗ Manual verification of the assignmentsProblem:
possible only on very small datasets
PeptideProphet
Compute probabilities that peptides assigned to MS/MS spectra are correct
Spectrum Peptide Score
Spectrum 1 LGEYGH 4.5 1.0Spectrum 2 FQSEEQ 3.4 0.97Spectrum 3 FLYQE 1.3 0.01 ... ... ... ....Spectrum N EIQKKF 2.2 0.3
probability
correct ---
incorrect ---entire dataset:
Learns distributionsLearns distributions of search scores and peptideproperties among correct and incorrect resultscorrect and incorrect results
p=0.5
Standardinput
Peptideassignment
Proteinassignment
Quantitation
InterpretationValidation
The computed probabilitiesprobabilities are a true measure of the confidencetrue measure of the confidence
PeptideProphet
Sensitivity: fraction of all correct resultspassing filter
Error Rate: fraction of allresults passingfilter that areincorrect
SEQUEST thresholds(from literature)
Improved discrimination: more identifications (for the same error rate)
Ideal Spot
probabilitymodel
Standardinput
Peptideassignment
Proteinassignment
Quantitation
InterpretationValidation
PeptideProphet
Standardinput
Peptideassignment
Proteinassignment
Quantitation
InterpretationValidation
“ Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search”
A. Keller, A. Nesvizhskii, E. Kolker, R. AebersoldAnal. Chem. 2002, 74, 5383-5392
Standardinput
Peptideassignment Validation
Proteinassignment
Quantitation
Interpretation
>sp|P02754|LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG) (ALLERGEN BOS D 5) - Bos taurus (Bovine).
MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI
TPEVDDEALEK : p = 0.96
TPEVDDEALEKFDK : p = 0.96KPTPEGDLEILLQK : p = 0.83LSFNPTQLEEQCHI : p = 0.65
LSFNPTQLEEQCHI : p = 0.76
sp|P02754|LACB_BOVINProb = ???
Standardinput
Peptideassignment
Validation
Quantitation
InterpretationProtein
assignment
ProteinProphet
Allows standardizedstandardized publishing of large-scale datasets in the literature and meaningful comparison between
results of different experiments
Combines probabilities of peptides assigned to MS/MS spectra to compute probabilityprobability that corresponding proteinsproteins are present in the
sample
Standardinput
Peptideassignment
Validation
Quantitation
InterpretationProtein
assignment
ProteinProphet
Prot APeptide 1
Prot BPeptide 2
Standardinput
Peptideassignment
Validation
Quantitation
InterpretationProtein
assignment
Prot A
Prot B
Peptide 2
Peptide 3
✗ Correctly assigned peptides tend to have siblingssiblings
✗ In case of redundancy use expectation maximization to find a minimal list of proteins
ProteinProphet
Standardinput
Peptideassignment
Validation
Quantitation
InterpretationProtein
assignment
“ A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry”
A. Nesvizhskii, A. Keller, E. Kolker, R. AebersoldAnal. Chem., in press
Standardinput
Peptideassignment Validation
Proteinassignment
Quantitation
Interpretation
Quantitation
Standardinput
Peptideassignment
Validation Proteinassignment
Interpretation
Quantitation
Compute relative quantitationrelative quantitation of isotopicaly labeled peptides
Derive protein ratios with a corresponding p value
Quantitation
XPRESS and ASAPRatio
heavylight ✗ Identify labeled peptides (MS/MS)
✗ Reconstruct single ion chromatograms for the 2 species (MS)
✗ Calculate peptide ratios and confidence intervals✗ Interface with ProteinProphet and derive protein
ratios with confidence intervals✗ Error module determines significant ratio changes in
large datasets
Standardinput
Peptideassignment
Validation Proteinassignment
Interpretation
Quantitation
Quantitation
Standardinput
Peptideassignment
Validation Proteinassignment
Interpretation
Quantitation
“ Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry”
D. K. Han, J. Eng, H. Zhou, and R. AebersoldNature Biotechnology, 2001, 19, 946-951
“ Automated Statistical Analysis of Protein Abundance Ratios from Data Generated by Stable-Isotope Dilution and Tandem Mass Spectrometry”
X. Li, H. Zhang, J. A. Ranish, R. AebersoldSubmitted
Standardinput
Peptideassignment Validation
Proteinassignment
Quantitation
Interpretation
Interpretation
Standardinput
Peptideassignment Validation Protein
assignment
Quantitation
Interpretation
Assign a biological meaningbiological meaning to the output of the pipeline
SBEAMS
✗ SBEAMS (Systems Biology Experiment Analysis Management System) project is a framework for collecting, storing, and accessing datacollecting, storing, and accessing data produced by a variety of different experimentsof different experiments; these experiments can be managed separately but then correlated later under the same framework.
✗ It combines a relational databaserelational database, a web web front endfront end for querying the database and providing integrated access to remote data sources, and an APIAPI for further software development.
✗ The data productsproducts of the analysis pipeline arepipeline are then ingestedingested into the database
Standardinput
Peptideassignment Validation Protein
assignment
Quantitation
Interpretation
http://www.sbeams.org/
Visualize molecular interaction networks and integrate these interactions with gene expression profiles and other state data
Cytoscape
● FilteringFiltering the network to select subsets of nodes and/or interactions. For instance, users may select nodes involved in a threshold number of interactions, or nodes that share a particular GO annotation.
Standardinput
Peptideassignment Validation Protein
assignment
Quantitation
Interpretation
http://www.cytoscape.org/
Acknowledgments
Ruedi Aebersold
Jimmy Eng
Xiao-jun LiAlexey Nesvizhskii
Andy Keller
Robert HubleyEric Deutsch
Paul Shannon Ning Zhang Erik Nillson
Brian Pratt