improving the sensitivity of peptide identification

31
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center

Upload: zeus-robertson

Post on 31-Dec-2015

36 views

Category:

Documents


0 download

DESCRIPTION

Improving the Sensitivity of Peptide Identification. by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center. Searching under the street-light…. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving the Sensitivity of Peptide Identification

Improving the Sensitivityof Peptide Identification

by Meta-Search, Grid-Computing,

and Machine-Learning

Nathan EdwardsGeorgetown University Medical Center

Page 2: Improving the Sensitivity of Peptide Identification

2

Searching under the street-light…

Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

Searching traditional protein sequence databases biases the results in favor of well-understood and/or computationally predicted proteins and protein isoforms!

Page 3: Improving the Sensitivity of Peptide Identification

3

Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long

Page 4: Improving the Sensitivity of Peptide Identification

4

Lost peptide identifications

Missing from the sequence database Build exhaustive peptide sequence databases Build evidence for unannotated proteins and protein isoforms

Search engine strengths, weaknesses, quirks Use multiple search engines and combine results

Poor score or statistical significance Use search-engine consensus to boost confidence Use machine-learning to distinguish true from false

Thorough search takes too long Harness the power of heterogeneous computational grids

Page 5: Improving the Sensitivity of Peptide Identification

5

Unannotated Splice Isoform

Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. Peptide Atlas raftflow, raftapr, raftaug

LIME1 gene: LCK interacting transmembrane adaptor 1

LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias.

Multiple significant peptide identifications

Page 6: Improving the Sensitivity of Peptide Identification

6

Unannotated Splice Isoform

Page 7: Improving the Sensitivity of Peptide Identification

7

Unannotated Splice Isoform

Page 8: Improving the Sensitivity of Peptide Identification

8

Splice Isoform Anomaly

Human erythroleukemia K562 cell-line Depth of coverage study Resing et al. Anal. Chem. 2004. Peptide Atlas A8_IP

SALT1A2 gene: Sulfotransferase family, cytosolic, 1A

2 ESTs, 1 mRNA mRNA from lung, small cell-cancinoma sample

Single (significant) peptide identification Five agreeing search engines PepArML FDR < 1%. All source engines have non-significant E-values

Page 9: Improving the Sensitivity of Peptide Identification

9

Splice Isoform Anomaly

Page 10: Improving the Sensitivity of Peptide Identification

10

Splice Isoform Anomaly

Page 11: Improving the Sensitivity of Peptide Identification

11

Peptide Sequence Databases

All amino-acid seqs of at most 30 amino-acids from: IPI and all IPI constituent protein sequences

IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank

SwissProt variants, conflicts, splices, and annotated signal peptide truncations.

Genbank and RefSeq mRNA sequence 3 frame translation

GenBank EST and HTC sequences 6 frame translation and found in at least 2 sequences

Grouped by Gene/UniGene cluster and compressed.

Page 12: Improving the Sensitivity of Peptide Identification

12

Formatted as a FASTA sequence database Easy integration with search engines.

One entry per gene/cluster. Automated rebuild every few months.

Peptide Sequence Databases

Organism Size (AA) Size (Entries)Human 248Mb 74,976Mouse 171Mb 55,887

Rat 76Mb 42,372Zebra-fish 94Mb 40,490

Page 13: Improving the Sensitivity of Peptide Identification

13

Peptide evidence, in context

Statistically significant identified peptides can be misleading… Isobaric amino-acid/PTM substitutions Unsubstantiated peptide termini

Few b-ions or y-ions suggest “random” mass match Single amino-acids on upstream or downstream exons

Peptides in 5’ UTR with no upstream Met Need tools to quickly check the corroborating

(genomic, transcript, SNP) evidence

Page 14: Improving the Sensitivity of Peptide Identification

14

PeptideMapper Web Service Counts:

by gene and evidence EST, mRNA, Protein

Sequences: accessions by gene UniProt variants nucleotide sequence &

link to BLAT alignment Genomic Loci:

one-click projection onto the UCSC genome browser

peptides with cSNPs too!

Page 15: Improving the Sensitivity of Peptide Identification

15

PeptideMapper Web Service

I’m Feeling Lucky

Page 16: Improving the Sensitivity of Peptide Identification

16

PeptideMapper Web Service

I’m Feeling Lucky

Page 17: Improving the Sensitivity of Peptide Identification

17

Combining search engine results – harder than it looks!

Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!

How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?

We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

Page 18: Improving the Sensitivity of Peptide Identification

18

PepArML – Peptide identification Arbiter by Machine-Learning

Page 19: Improving the Sensitivity of Peptide Identification

19

Peptide Atlas A8_IP LTQ Dataset

Page 20: Improving the Sensitivity of Peptide Identification

20

Peptide Atlas Halobacterium Dataset

Page 21: Improving the Sensitivity of Peptide Identification

21

Running many search engines

Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially

modifications and protein identifiers

Page 22: Improving the Sensitivity of Peptide Identification

22

Peptide Identification Meta-Search Parameters Instrument

Precursor Tolerance Fragment Tolerance Max. Charge

Sequence Database Target and # of Decoys

Modification Fixed/Variable Amino-Acids Position Delta

Proteolytic Agent Motif

Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13C Peaks

Search Engines Mascot, X!Tandem, K-Score, OMSSA, MyriMatch

Page 23: Improving the Sensitivity of Peptide Identification

23

Peptide Identification Meta-Search Simple unified search

interface for: Mascot, X!Tandem,

K-Score, OMSSA, MyriMatch

Automatic decoy searches

Automatic spectrumfile "chunking"

Automatic scheduling Serial, Multi-

Processor, Cluster, Grid

Page 24: Improving the Sensitivity of Peptide Identification

24

PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

X!Tandem,KScore,OMSSA,

MyriMatch,Mascot(1 core).

X!Tandem,KScore,OMSSA,

MyriMatch.

X!Tandem,KScore,OMSSA.

Page 25: Improving the Sensitivity of Peptide Identification

25

PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

X!Tandem,KScore,OMSSA,

MyriMatch,Mascot(1 core).

X!Tandem,KScore,OMSSA,

MyriMatch.

X!Tandem,KScore,OMSSA.

Page 26: Improving the Sensitivity of Peptide Identification

26

PepArML Meta-Search Engine

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Simple searchrequest

Page 27: Improving the Sensitivity of Peptide Identification

27

PepArML Meta-Search Engine

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &48+ CPUs

Securecommunication

Heterogeneouscompute resources

Simple searchrequest

Page 28: Improving the Sensitivity of Peptide Identification

28

Peptide Atlas A8_IP LTQ Dataset

Tryptic search of Human ESTs using PepSeqDB 107084 spectra (145 files) searched ~ 26 times:

Target + 2 decoys, 5 engines, 1+ vs 2+/3+ charge

8685 search jobs 25.7 days of CPU time. 5211 TeraGrid TKO jobs < 2 hours

Using 143 different machines

Total elapsed time < 26 hours Bottleneck: Mascot license (1 core, 4 CPUs)

Page 29: Improving the Sensitivity of Peptide Identification

29

PepArML Meta-Search Engine

Access to high-performance computing resources for the proteomics community NSF TeraGrid Community Portal University/Institute HPC clusters Individual lab compute resources Contribute cycles to the community

and get access to others’ cycles in return.

Centralized scheduler Compute capacity can still be exclusive, or prioritized. Compute client plays well with HPC grid schedulers.

Page 30: Improving the Sensitivity of Peptide Identification

30

Conclusions

Improve sensitivity of peptide identification, using Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search

Tools & cycles available to the community...

http://edwardslab.bmcb.georgetown.edu

Page 31: Improving the Sensitivity of Peptide Identification

31

Acknowledgements

Dr. Catherine Fenselau University of Maryland Biochemistry

Dr. Rado Goldman Georgetown University Medical Center

Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science

Funding: NIH/NCI