Download - Meta-Search and Result Combining
![Page 1: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/1.jpg)
Meta-Search and Result Combining
Nathan EdwardsDepartment of Biochemistry and
Molecular & Cellular Biology
Georgetown University Medical Center
![Page 2: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/2.jpg)
Peptide Identifications
Search engines provide an answer for every spectrum... Can we figure out which ones to believe?
Why is this hard? Hard to determine “good” scores Significance estimates are unreliable Need more ids from weak spectra Each search engine has its strengths ...
... and weaknesses Search engines give different answers
2
![Page 3: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/3.jpg)
Mascot Search Results
3
![Page 4: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/4.jpg)
Translation start-site correction
Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble
membrane and soluble cytoplasmic proteins Goo, et al. MCP 2003.
GdhA1 gene: Glutamate dehydrogenase A1
Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0
prediction(s)4
![Page 5: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/5.jpg)
Halobacterium sp. NRC-1ORF: GdhA1
K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated
translation start site of NP_279651
5
0 40 80 120 160 200 240 280 320 360 400 440
![Page 6: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/6.jpg)
Translation start-site correction
6
![Page 7: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/7.jpg)
Search engine scores are inconsistent!
7 Mascot
Tan
dem
![Page 8: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/8.jpg)
Common Algorithmic Framework – Different Results
Pre-process experimental spectra Charge state, cleaning, binning
Filter peptide candidates Decide which PSMs to evaluate
Score peptide-spectrum match Fragmentation modeling, dot product
Rank peptides per spectrum Retain statistics per spectrum
Estimate E-values Apply empirical or theoretical model
8
![Page 9: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/9.jpg)
Comparison of search engines
No single score is comprehensive
Search engines disagree
Many spectra lack confident peptide assignment
9
4%
OMSSA10%
2%
5%9%
69%
2%
X!Tandem
Mascot
![Page 10: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/10.jpg)
Simple approaches (Union)
Different search engines confidently identify different spectra: Due to search space, spectral processing, scoring,
significance estimation Filter each search engine's results and union
Union of results must be more complete But how to estimate significance for the union? What if the results for same spectra disagree?
Need to compensate for reduced specificity How much?
10
![Page 11: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/11.jpg)
Union of filtered peptide ids
11 Mascot
Tan
dem
![Page 12: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/12.jpg)
Union of filtered peptide ids
12 Mascot
Tan
dem
![Page 13: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/13.jpg)
Union of filtered peptide ids
13 Mascot
Tan
dem
![Page 14: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/14.jpg)
Simple approaches (Intersection)
Different search engines agree on many spectra Agreement is unexpected given differences Filter each search engine's results and take the
intersection Intersection of results must be more significant
But how to estimate significance for the intersection? What about the borderline spectra?
Need to compensate for reduced sensitivity How and how much?
14
![Page 15: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/15.jpg)
Intersection of filtered peptide ids
15 Mascot
Tan
dem
![Page 16: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/16.jpg)
Intersection of filtered peptide ids
16 Mascot
Tan
dem
![Page 17: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/17.jpg)
Intersection of filtered peptide ids
17 Mascot
Tan
dem
![Page 18: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/18.jpg)
Combine / Merge Results
Threshold peptide-spectrum matches from each of two search engines PSMs agree → boost specificity PSMs from one → boost sensitivity PSMs disagree → ?????
Sometimes agreement is "lost" due to threshold... How much should agreement increase our confidence?
Scores easy to "understand" Difficult to establish statistical significance
How to generalize to more engines?
18
![Page 19: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/19.jpg)
Consensus and Multi-Search Multiple witnesses increase confidence
As long as they are independent Example: Getting the story straight
Independent "random" hits unlikely to agree Agreement is indication of biased sampling Example: loaded dice
Meta-search is relatively easy Merging and re-ranking is hard Example: Booking a flight to Boston!
Scores and E-values are not comparable How to choose the best answer? Example: Best E-value favors Tandem!
19
![Page 20: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/20.jpg)
Search for Consensus Running many search engines is hard! Identifications must have every opportunity to
agree: No failed searches, matched search parameters,
sequence databases, spectra But the search engines all use:
Varying spectral file formats, different parameter specifications for mass tolerance, modifications, pre-processing for sequence databases, different charge-state handling, termini rules
Decoy searches must also use identical parameters
20
![Page 21: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/21.jpg)
Searching for Consensus
Initial methionine loss as tryptic peptide? Missing charge state handling? X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications? Precursor mass tolerance (Da vs ppm) Semi-tryptic only (no fully-tryptic mode).
21
![Page 22: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/22.jpg)
Configuring for Consensus
Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases.
Must strive to ensure that each search engine is presented with the same search criteria, despite different formats, syntax, and quirks. Search engine configuration must be automated.
22
![Page 23: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/23.jpg)
Results Extraction for Consensus
Must be able to unambiguously extract peptide identifications from results Spectrum identifiers / scan numbers Modification identifiers Protein accessions
How should we handle E-values vs. probabilities vs. FDR (partitioned)? Cannot rely on these to be comparable Must use consistent, external significance calibration
23
![Page 24: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/24.jpg)
Search Engine Independent FDR Estimation
Comparing search engines is difficult due to different FDR estimation techniques Implicit assumption: Spectra scores can be thresholded
Competitive vs Global Competitive controls some spectral variation
Reversed vs Shuffled Decoy Sequence Reversed models target redundancy accurately
Charge-state partition or Unified Mitigates effect of peptide length dependent scores What about peptide property partitions?
24
![Page 25: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/25.jpg)
Search Execution for Consensus
Running many search engines take time 7 x 3 searches of the same spectra!
Some search engines require licenses or specific operating systems
How to use grid/cloud computing effectively? Cannot assume a shared file-system Search engines may crash or be preempted Machine may "disappear" Machine may consistently fail searches
25
![Page 26: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/26.jpg)
Combining Multi-Search Results
Treat search engines as black-boxes Generate PSMs + scores, features
Apply machine learning / statistical modeling to results Use multiple match metrics
Combine/refine using multiple search engines Agreement suggests correctness
26
![Page 27: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/27.jpg)
Machine Learning / Statistical Modeling
Use of multiple metrics of PSM quality: Precursor delta, trypsin digest features, etc
Often requires "training" with examples Different examples will change the result Generalization is always the question
Scores can be hard to "understand" Difficult to establish statistical significance
e.g. PeptideProphet/iProphet Weighted linear combination of features Number of sibling searches
27
![Page 28: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/28.jpg)
Available Tools
PeptideProphet/iProphet Part of trans-proteomic-pipeline suite
Scaffold Commercial reimplementation of PP/iP
PepArML Publicly available from the Edwards lab
Lots of in-house stuff… Result combining mentioned in talks, lots of
papers, etc. but no public tools28
![Page 29: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/29.jpg)
Peptide 8
Peptide 7
For Each Spectrum
GetMascot
Identification
GetSEQUEST
Identification
GetX!Tandem
Identification
Peptide 1
Peptide 3
Peptide 4
Peptide 5
Peptide 6
, ( | )i j
i j
agreement score p D
( | ) ( | )( | , )
( | ) ( | ) ( | ) ( | )
p D p NSPp D NSP
p D p NSP p D p NSP
Peptide 2
p=76%
p=81%
p=56%
Agreement score
Using the probabilities given by each search engine and the probability of them agreeing, a better peptide ID is made
Using the probabilities given by each search engine and the probability of them agreeing, a better peptide ID is made
Brian Searle
![Page 30: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/30.jpg)
PepArML Strategy
Meta-Search for Multi-Search: Automatic configuration of searches Automatic preprocessing of sequence databases Automatic spectral reformatting Automatic execution of search on local or remote
computing resources (AWS/grid/NFS). Result Combining:
Decoy-based FDR significance estimation Unsupervised, model-free, machine-learning
30
![Page 31: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/31.jpg)
Peptide Identification Meta-Search Simple unified search
interface for: Mascot, X!Tandem, K-
Score, S-Score, OMSSA, MyriMatch, InsPecT+MSSGF
Automatic decoy searches
Automatic spectrumfile "chunking"
Automatic scheduling Serial, Multi-Processor,
Cluster, Grid, Cloud
31
![Page 32: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/32.jpg)
Grid-Enabled Peptide Identification Meta-Search
32
AmazonWeb Services
UniversityCluster
Edwards LabScheduler &80+ CPUs
Securecommunication
Heterogeneouscompute resources
Single, simplesearch request
Scales easily to 250+ simultaneous
searches
![Page 33: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/33.jpg)
PepArML Combiner
Peptide identification arbiter by machine learning
Unifies these ideas within a model-free, combining machine learning framework
Unsupervised training procedure
33
![Page 34: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/34.jpg)
PepArML Overview
34
X!Tandem
Mascot
OMSSA
Other
PepArML
Feature extraction
![Page 35: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/35.jpg)
Dataset Construction
35
T),( 11 PS
F),( 21 PS
T),( 12 PS
X!Tandem Mascot OMSSA
T),( mn PS
……
![Page 36: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/36.jpg)
Voting Heuristic Combiner
Choose PSM with most votes
Break ties using FDR Select PSM with min. FDR of tied votes
How to apply this to a decoy database?
Lots of possibilities – all imperfect Now using: 100*#votes – min. decoy hits
36
![Page 37: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/37.jpg)
Supervised Learning
37
![Page 38: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/38.jpg)
Search Engine Info. Gain
38
![Page 39: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/39.jpg)
Precursor & Digest Info. Gain
39
![Page 40: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/40.jpg)
Retention Time & Proteotypic Peptide Properties Info. Gain
40
![Page 41: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/41.jpg)
Application to Real Data
How well do these models generalize? Different instruments
Spectral characteristics change scores Search parameters
Different parameters change score values Supervised learning requires
(Synthetic) experimental data from every instrument Search results from available search engines Training/models for all
parameters x search engine sets x instruments
41
![Page 42: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/42.jpg)
Model Generalization
42
![Page 43: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/43.jpg)
Unsupervised Learning
43
![Page 44: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/44.jpg)
Unsupervised Learning Performance
44
![Page 45: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/45.jpg)
Unsupervised Learning Convergence
45
![Page 46: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/46.jpg)
PepArML Performance
46
LCQ QSTAR LTQ-FT
Standard Protein Mix Database18 Standard Proteins – Mix1
![Page 47: Meta-Search and Result Combining](https://reader036.vdocument.in/reader036/viewer/2022081603/56814d0d550346895dba488a/html5/thumbnails/47.jpg)
Conclusions
Combining search results from multiple engines can be very powerful Boost both sensitivity and specificity Running multiple search engines is hard
Statistical significance is hard Use empirical FDR estimates...but be careful...lots of
subtleties Consensus is powerful, but fragile
Search engine quirks can destroy it "Witnesses" are not independent
47