a bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · a...

A bioinformatic pipeline for the analysis of proteomics data

Patrick PedrioliInstitute for Systems Biology

? Common Data Analysis Pipeline

Data storage

publication

Biological sample

Overview

? Common Data Analysis Pipeline

Standardinput

Peptideassignment

Validation Proteinassignment

Quantitation

Interpretation

Tasks for a proteomic analysis pipeline

Standardinput

Peptideassignment


Quantitation

Interpretation

A diverse set of Mass Spectrometers allows for more flexibility! Or ...

Different MS manufacturers store data in different proprietary data in different proprietary formatsformats.

This limitslimits:

✗ Data analysisanalysis

✗ ExchangeExchange of RAW datasets and creation of public repositories

✗ Software developmentSoftware development and distribution

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Workarounds

✗ ExportingExporting to other formats

✗ Pure text (ASCII)

✗ ANDI/NetCDF


assignment

Quantitation


input

Workarounds

✗ ExportingExporting to other formats

✗ Pure text (ASCII)Loss of precision <-> File sizeSpeed of access

✗ ANDI/NetCDFComplex dictionaries of tags required for data completeness

Peptideassignment Validation

Proteinassignment

Quantitation


input

✗ Obtaining libraries and documentation to directly access the RAW data in exchange for a non-disclosure agreementnon-disclosure agreement


Proteinassignment

Quantitation


input

Workarounds

mzXML

Specialisedinput formats

(dta, mg, pkl, ...)

Com

mon D

ata An alys is P

ip el inePeptide

assignment Validation Proteinassignment

Quantitation


input

Thermo Finnigan Ion Trap

Micromass Q-TOFABI MALDI TOF-TOF

Sciex Q-TOF

ABI ESI TOF

Specialized applications

In the ideal world ...

✗ Compact✗ Precise✗ Fast✗ Machine independent✗ Easy to expand✗ Public domain

Binary

XML

Common file format would be:


assignment

Quantitation


input

mzXML documentMain XML section

Scans' offsets index

SAX parser 1st parsing

Copy of scans' offsets index

Subsequent parsing operations

Speed issue can be addressed with an index for non-index for non-sequentialsequential access


assignment

Quantitation


input

Data analysis pipeline

base64sha1-sum

htonlXML formatting XML formatting

ReAdW MassWolf

mzXMLinstance doc

ValidateXMLXML2Other

RAP

mzXMLschema

dta pkl mg txt

raw to

XML conversion

Integrity checkXML to ... conversion


assignment

Quantitation


input

XML2Other

Export to external formatsExport to external formats for programs that cannot be ported to read mzXML

XML2Other

pkl mg txtdta

mzXMLinstance doc


assignment

Quantitation


input

mzXMLinstance doc

RAP

RAPRandom Access Parser

Peaks list

...Scan header

LibraryLibrary for directly for directly extractingextracting information information from mzXMLfrom mzXML files filesAvailable in 3 flavors: C, CAvailable in 3 flavors: C, C++++ and Java and Java

Uses index for fast random data access

Multiple functions for fine tuned data retrieval


assignment

Quantitation


input

XMLicit ?

mzXMLinstance doc

ValidateXML mzXMLschema

Integrity checksIntegrity checks on the XML instance document

✗ XML Schema validation✗ sha1-sum validation

Schema is hosted on SF -> easy way to test for data corruption from any computer connected to the Net (exchange of data with external collaborators)


assignment

Quantitation


input

mzXML viewer


Proteinassignment

Quantitation


input

Offers a common way to display datacommon way to display data generated on different instruments, therefore facilitating the exchange of datasets

mzXML viewer


assignment

Quantitation


input


assignment

Quantitation


input

http://sashimi.sourceforge.net/

In conclusion

A instrument independent file format for MS data allows us to:✗ Easily integrate a new instrument in a pre-existing analysis framework✗ Develop and distribute new software✗ Exchange data with external collaborators✗ Create public data repositories

Standardinput

Peptideassignment


Quantitation

Interpretation

Assigning a Peptide to a Spectrum

Standardinput Validation

Proteinassignment

Quantitation

InterpretationPeptide

assignment

FindFind a sequence of amino acidssequence of amino acids that can generate the list of masses seen in the tandem MS scan

Assigning a Peptide to a Spectrum

Many different strategies:

✗ Searching MS/MS spectra against a sequence database

✗ De novo sequencing

✗ Combination of the two

Standardinput Validation Protein

assignment

Quantitation


assignment

Uninterpreted MS/MS Database Search

SequenceDatabase

MS/MSData

m/z – Intprecursor mass

Theoretical spectraTheoretical spectra for given precursor mass

✗ Assign scoresscores to overlapsoverlaps✗ (Normalize scores)✗ Keep best match


assignment

Quantitation


assignment

Scoring Peptide Assignments

Multiple mature search engines are available, but ...

✗ They use different scoring algorithms.✗ Search outputs are not comparable✗ Search outputs require expert validation


assignment

Quantitation


assignment

Standardinput


Proteinassignment

Quantitation

Interpretation

Peptide Validation

Standardinput

Peptideassignment

Proteinassignment

Quantitation

InterpretationValidation

Validate peptide assignmentsValidate peptide assignments made during the database search step.

Validation method should be standardized and independent from the experimental and computational methods used

Manual Validation

✗ Filtering by:✗ Database search scores✗ NTT (Number of Tryptic Termini)

Problems:Filtering criteria vary among researchers and error rates are unknown

Standardinput

Peptideassignment

Proteinassignment

Quantitation


Peptide Identification Effort

Sample prep

Instrument time

Database search

Manula val-idation

✗ Manual verification of the assignmentsProblem:

possible only on very small datasets

PeptideProphet

Compute probabilities that peptides assigned to MS/MS spectra are correct

Spectrum Peptide Score

Spectrum 1 LGEYGH 4.5 1.0Spectrum 2 FQSEEQ 3.4 0.97Spectrum 3 FLYQE 1.3 0.01 ... ... ... ....Spectrum N EIQKKF 2.2 0.3

probability

correct ---

incorrect ---entire dataset:

Learns distributionsLearns distributions of search scores and peptideproperties among correct and incorrect resultscorrect and incorrect results

p=0.5

Standardinput

Peptideassignment

Proteinassignment

Quantitation


The computed probabilitiesprobabilities are a true measure of the confidencetrue measure of the confidence

PeptideProphet

Sensitivity: fraction of all correct resultspassing filter

Error Rate: fraction of allresults passingfilter that areincorrect

SEQUEST thresholds(from literature)

Improved discrimination: more identifications (for the same error rate)

Ideal Spot

probabilitymodel

Standardinput

Peptideassignment

Proteinassignment

Quantitation


PeptideProphet

Standardinput

Peptideassignment

Proteinassignment

Quantitation


“ Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search”

A. Keller, A. Nesvizhskii, E. Kolker, R. AebersoldAnal. Chem. 2002, 74, 5383-5392

Standardinput


Proteinassignment

Quantitation

Interpretation

>sp|P02754|LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG) (ALLERGEN BOS D 5) - Bos taurus (Bovine).

MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

TPEVDDEALEK : p = 0.96

TPEVDDEALEKFDK : p = 0.96KPTPEGDLEILLQK : p = 0.83LSFNPTQLEEQCHI : p = 0.65

LSFNPTQLEEQCHI : p = 0.76

sp|P02754|LACB_BOVINProb = ???

Standardinput

Peptideassignment

Validation

Quantitation

InterpretationProtein

assignment

ProteinProphet

Allows standardizedstandardized publishing of large-scale datasets in the literature and meaningful comparison between

results of different experiments

Combines probabilities of peptides assigned to MS/MS spectra to compute probabilityprobability that corresponding proteinsproteins are present in the

sample

Standardinput

Peptideassignment

Validation

Quantitation


assignment

ProteinProphet

Prot APeptide 1

Prot BPeptide 2

Standardinput

Peptideassignment

Validation

Quantitation


assignment

Prot A

Prot B

Peptide 2

Peptide 3

✗ Correctly assigned peptides tend to have siblingssiblings

✗ In case of redundancy use expectation maximization to find a minimal list of proteins

ProteinProphet

Standardinput

Peptideassignment

Validation

Quantitation


assignment

“ A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry”

A. Nesvizhskii, A. Keller, E. Kolker, R. AebersoldAnal. Chem., in press

Standardinput


Proteinassignment

Quantitation

Interpretation

Quantitation

Standardinput

Peptideassignment


Interpretation

Quantitation

Compute relative quantitationrelative quantitation of isotopicaly labeled peptides

Derive protein ratios with a corresponding p value

Quantitation

XPRESS and ASAPRatio

heavylight ✗ Identify labeled peptides (MS/MS)

✗ Reconstruct single ion chromatograms for the 2 species (MS)

✗ Calculate peptide ratios and confidence intervals✗ Interface with ProteinProphet and derive protein

ratios with confidence intervals✗ Error module determines significant ratio changes in

large datasets

Standardinput

Peptideassignment


Interpretation

Quantitation

Quantitation

Standardinput

Peptideassignment


Interpretation

Quantitation

“ Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry”

D. K. Han, J. Eng, H. Zhou, and R. AebersoldNature Biotechnology, 2001, 19, 946-951

“ Automated Statistical Analysis of Protein Abundance Ratios from Data Generated by Stable-Isotope Dilution and Tandem Mass Spectrometry”

X. Li, H. Zhang, J. A. Ranish, R. AebersoldSubmitted

Standardinput


Proteinassignment

Quantitation

Interpretation

Interpretation

Standardinput


assignment

Quantitation

Interpretation

Assign a biological meaningbiological meaning to the output of the pipeline

SBEAMS

✗ SBEAMS (Systems Biology Experiment Analysis Management System) project is a framework for collecting, storing, and accessing datacollecting, storing, and accessing data produced by a variety of different experimentsof different experiments; these experiments can be managed separately but then correlated later under the same framework.

✗ It combines a relational databaserelational database, a web web front endfront end for querying the database and providing integrated access to remote data sources, and an APIAPI for further software development.

✗ The data productsproducts of the analysis pipeline arepipeline are then ingestedingested into the database

Standardinput


assignment

Quantitation

Interpretation

http://www.sbeams.org/

Visualize molecular interaction networks and integrate these interactions with gene expression profiles and other state data

Cytoscape

● FilteringFiltering the network to select subsets of nodes and/or interactions. For instance, users may select nodes involved in a threshold number of interactions, or nodes that share a particular GO annotation.

Standardinput


assignment

Quantitation

Interpretation

http://www.cytoscape.org/

Acknowledgments

Ruedi Aebersold

Jimmy Eng

Xiao-jun LiAlexey Nesvizhskii

Andy Keller

Robert HubleyEric Deutsch

Paul Shannon Ning Zhang Erik Nillson

Brian Pratt

a bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · a...

Documents