a bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · a...

43
A bioinformatic pipeline for the analysis of proteomics data Patrick Pedrioli Institute for Systems Biology

Upload: lytuyen

Post on 30-Jul-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

A bioinformatic pipeline for the analysis of proteomics data

Patrick PedrioliInstitute for Systems Biology

Page 2: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

? Common Data Analysis Pipeline

Data storage

publication

Biological sample

Overview

Page 3: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

? Common Data Analysis Pipeline

Standardinput

Peptideassignment

Validation Proteinassignment

Quantitation

Interpretation

Tasks for a proteomic analysis pipeline

Page 4: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Standardinput

Peptideassignment

Validation Proteinassignment

Quantitation

Interpretation

Page 5: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

A diverse set of Mass Spectrometers allows for more flexibility! Or ...

Different MS manufacturers store data in different proprietary data in different proprietary formatsformats.

This limitslimits:

✗ Data analysisanalysis

✗ ExchangeExchange of RAW datasets and creation of public repositories

✗ Software developmentSoftware development and distribution

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 6: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Workarounds

✗ ExportingExporting to other formats

✗ Pure text (ASCII)

✗ ANDI/NetCDF

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 7: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Workarounds

✗ ExportingExporting to other formats

✗ Pure text (ASCII)Loss of precision <-> File sizeSpeed of access

✗ ANDI/NetCDFComplex dictionaries of tags required for data completeness

Peptideassignment Validation

Proteinassignment

Quantitation

InterpretationStandard

input

Page 8: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

✗ Obtaining libraries and documentation to directly access the RAW data in exchange for a non-disclosure agreementnon-disclosure agreement

Peptideassignment Validation

Proteinassignment

Quantitation

InterpretationStandard

input

Workarounds

Page 9: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

mzXML

Specialisedinput formats

(dta, mg, pkl, ...)

Com

mon D

ata An alys is P

ip el inePeptide

assignment Validation Proteinassignment

Quantitation

InterpretationStandard

input

Thermo Finnigan Ion Trap

Micromass Q-TOFABI MALDI TOF-TOF

Sciex Q-TOF

ABI ESI TOF

Specialized applications

Page 10: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

In the ideal world ...

✗ Compact✗ Precise✗ Fast✗ Machine independent✗ Easy to expand✗ Public domain

Binary

XML

Common file format would be:

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 11: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

mzXML documentMain XML section

Scans' offsets index

SAX parser 1st parsing

Copy of scans' offsets index

Subsequent parsing operations

Speed issue can be addressed with an index for non-index for non-sequentialsequential access

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 12: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Data analysis pipeline

base64sha1-sum

htonlXML formatting XML formatting

ReAdW MassWolf

mzXMLinstance doc

ValidateXMLXML2Other

RAP

mzXMLschema

dta pkl mg txt

raw to

XML conversion

Integrity checkXML to ... conversion

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 13: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

XML2Other

Export to external formatsExport to external formats for programs that cannot be ported to read mzXML

XML2Other

pkl mg txtdta

mzXMLinstance doc

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 14: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

mzXMLinstance doc

RAP

RAPRandom Access Parser

Peaks list

...Scan header

LibraryLibrary for directly for directly extractingextracting information information from mzXMLfrom mzXML files filesAvailable in 3 flavors: C, CAvailable in 3 flavors: C, C++++ and Java and Java

Uses index for fast random data access

Multiple functions for fine tuned data retrieval

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 15: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

XMLicit ?

mzXMLinstance doc

ValidateXML mzXMLschema

Integrity checksIntegrity checks on the XML instance document

✗ XML Schema validation✗ sha1-sum validation

Schema is hosted on SF -> easy way to test for data corruption from any computer connected to the Net (exchange of data with external collaborators)

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 16: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

mzXML viewer

Peptideassignment Validation

Proteinassignment

Quantitation

InterpretationStandard

input

Offers a common way to display datacommon way to display data generated on different instruments, therefore facilitating the exchange of datasets

Page 17: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

mzXML viewer

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

Page 18: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Peptideassignment Validation Protein

assignment

Quantitation

InterpretationStandard

input

http://sashimi.sourceforge.net/

In conclusion

A instrument independent file format for MS data allows us to:✗ Easily integrate a new instrument in a pre-existing analysis framework✗ Develop and distribute new software✗ Exchange data with external collaborators✗ Create public data repositories

Page 19: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Standardinput

Peptideassignment

Validation Proteinassignment

Quantitation

Interpretation

Page 20: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Assigning a Peptide to a Spectrum

Standardinput Validation

Proteinassignment

Quantitation

InterpretationPeptide

assignment

FindFind a sequence of amino acidssequence of amino acids that can generate the list of masses seen in the tandem MS scan

Page 21: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Assigning a Peptide to a Spectrum

Many different strategies:

✗ Searching MS/MS spectra against a sequence database

✗ De novo sequencing

✗ Combination of the two

Standardinput Validation Protein

assignment

Quantitation

InterpretationPeptide

assignment

Page 22: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Uninterpreted MS/MS Database Search

SequenceDatabase

MS/MSData

m/z – Intprecursor mass

Theoretical spectraTheoretical spectra for given precursor mass

✗ Assign scoresscores to overlapsoverlaps✗ (Normalize scores)✗ Keep best match

Standardinput Validation Protein

assignment

Quantitation

InterpretationPeptide

assignment

Page 23: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Scoring Peptide Assignments

Multiple mature search engines are available, but ...

✗ They use different scoring algorithms.✗ Search outputs are not comparable✗ Search outputs require expert validation

Standardinput Validation Protein

assignment

Quantitation

InterpretationPeptide

assignment

Page 24: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Standardinput

Peptideassignment Validation

Proteinassignment

Quantitation

Interpretation

Page 25: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Peptide Validation

Standardinput

Peptideassignment

Proteinassignment

Quantitation

InterpretationValidation

Validate peptide assignmentsValidate peptide assignments made during the database search step.

Validation method should be standardized and independent from the experimental and computational methods used

Page 26: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Manual Validation

✗ Filtering by:✗ Database search scores✗ NTT (Number of Tryptic Termini)

Problems:Filtering criteria vary among researchers and error rates are unknown

Standardinput

Peptideassignment

Proteinassignment

Quantitation

InterpretationValidation

Peptide Identification Effort

Sample prep

Instrument time

Database search

Manula val-idation

✗ Manual verification of the assignmentsProblem:

possible only on very small datasets

Page 27: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

PeptideProphet

Compute probabilities that peptides assigned to MS/MS spectra are correct

Spectrum Peptide Score

Spectrum 1 LGEYGH 4.5 1.0Spectrum 2 FQSEEQ 3.4 0.97Spectrum 3 FLYQE 1.3 0.01 ... ... ... ....Spectrum N EIQKKF 2.2 0.3

probability

correct ---

incorrect ---entire dataset:

Learns distributionsLearns distributions of search scores and peptideproperties among correct and incorrect resultscorrect and incorrect results

p=0.5

Standardinput

Peptideassignment

Proteinassignment

Quantitation

InterpretationValidation

Page 28: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

The computed probabilitiesprobabilities are a true measure of the confidencetrue measure of the confidence

PeptideProphet

Sensitivity: fraction of all correct resultspassing filter

Error Rate: fraction of allresults passingfilter that areincorrect

SEQUEST thresholds(from literature)

Improved discrimination: more identifications (for the same error rate)

Ideal Spot

probabilitymodel

Standardinput

Peptideassignment

Proteinassignment

Quantitation

InterpretationValidation

Page 29: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

PeptideProphet

Standardinput

Peptideassignment

Proteinassignment

Quantitation

InterpretationValidation

“ Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search”

A. Keller, A. Nesvizhskii, E. Kolker, R. AebersoldAnal. Chem. 2002, 74, 5383-5392

Page 30: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Standardinput

Peptideassignment Validation

Proteinassignment

Quantitation

Interpretation

Page 31: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

>sp|P02754|LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG) (ALLERGEN BOS D 5) - Bos taurus (Bovine).

MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

TPEVDDEALEK : p = 0.96

TPEVDDEALEKFDK : p = 0.96KPTPEGDLEILLQK : p = 0.83LSFNPTQLEEQCHI : p = 0.65

LSFNPTQLEEQCHI : p = 0.76

sp|P02754|LACB_BOVINProb = ???

Standardinput

Peptideassignment

Validation

Quantitation

InterpretationProtein

assignment

Page 32: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

ProteinProphet

Allows standardizedstandardized publishing of large-scale datasets in the literature and meaningful comparison between

results of different experiments

Combines probabilities of peptides assigned to MS/MS spectra to compute probabilityprobability that corresponding proteinsproteins are present in the

sample

Standardinput

Peptideassignment

Validation

Quantitation

InterpretationProtein

assignment

Page 33: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

ProteinProphet

Prot APeptide 1

Prot BPeptide 2

Standardinput

Peptideassignment

Validation

Quantitation

InterpretationProtein

assignment

Prot A

Prot B

Peptide 2

Peptide 3

✗ Correctly assigned peptides tend to have siblingssiblings

✗ In case of redundancy use expectation maximization to find a minimal list of proteins

Page 34: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

ProteinProphet

Standardinput

Peptideassignment

Validation

Quantitation

InterpretationProtein

assignment

“ A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry”

A. Nesvizhskii, A. Keller, E. Kolker, R. AebersoldAnal. Chem., in press

Page 35: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Standardinput

Peptideassignment Validation

Proteinassignment

Quantitation

Interpretation

Page 36: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Quantitation

Standardinput

Peptideassignment

Validation Proteinassignment

Interpretation

Quantitation

Compute relative quantitationrelative quantitation of isotopicaly labeled peptides

Derive protein ratios with a corresponding p value

Page 37: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Quantitation

XPRESS and ASAPRatio

heavylight ✗ Identify labeled peptides (MS/MS)

✗ Reconstruct single ion chromatograms for the 2 species (MS)

✗ Calculate peptide ratios and confidence intervals✗ Interface with ProteinProphet and derive protein

ratios with confidence intervals✗ Error module determines significant ratio changes in

large datasets

Standardinput

Peptideassignment

Validation Proteinassignment

Interpretation

Quantitation

Page 38: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Quantitation

Standardinput

Peptideassignment

Validation Proteinassignment

Interpretation

Quantitation

“ Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry”

D. K. Han, J. Eng, H. Zhou, and R. AebersoldNature Biotechnology, 2001, 19, 946-951

“ Automated Statistical Analysis of Protein Abundance Ratios from Data Generated by Stable-Isotope Dilution and Tandem Mass Spectrometry”

X. Li, H. Zhang, J. A. Ranish, R. AebersoldSubmitted

Page 39: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Standardinput

Peptideassignment Validation

Proteinassignment

Quantitation

Interpretation

Page 40: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Interpretation

Standardinput

Peptideassignment Validation Protein

assignment

Quantitation

Interpretation

Assign a biological meaningbiological meaning to the output of the pipeline

Page 41: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

SBEAMS

✗ SBEAMS (Systems Biology Experiment Analysis Management System) project is a framework for collecting, storing, and accessing datacollecting, storing, and accessing data produced by a variety of different experimentsof different experiments; these experiments can be managed separately but then correlated later under the same framework.

✗ It combines a relational databaserelational database, a web web front endfront end for querying the database and providing integrated access to remote data sources, and an APIAPI for further software development.

✗ The data productsproducts of the analysis pipeline arepipeline are then ingestedingested into the database

Standardinput

Peptideassignment Validation Protein

assignment

Quantitation

Interpretation

http://www.sbeams.org/

Page 42: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Visualize molecular interaction networks and integrate these interactions with gene expression profiles and other state data

Cytoscape

● FilteringFiltering the network to select subsets of nodes and/or interactions. For instance, users may select nodes involved in a threshold number of interactions, or nodes that share a particular GO annotation.

Standardinput

Peptideassignment Validation Protein

assignment

Quantitation

Interpretation

http://www.cytoscape.org/

Page 43: A bioinformatic pipeline for the analysis of …sashimi.sourceforge.net/extra/oral.pdf · A bioinformatic pipeline for the analysis of proteomics data ... Software development and

Acknowledgments

Ruedi Aebersold

Jimmy Eng

Xiao-jun LiAlexey Nesvizhskii

Andy Keller

Robert HubleyEric Deutsch

Paul Shannon Ning Zhang Erik Nillson

Brian Pratt