improving the reliability of peptide identification by tandem mass spectrometry nathan edwards...
TRANSCRIPT
Improving the Reliability of
Peptide Identification by
Tandem Mass Spectrometry
Improving the Reliability of
Peptide Identification by
Tandem Mass SpectrometryNathan EdwardsDepartment of Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical Center
2
Mass Spectrometry for Proteomics
• Measure mass of many (bio)molecules simultaneously• High bandwidth
• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required
3
Mass Spectrometer
Ionizer
Sample
+_
Mass Analyzer Detector
• MALDI• Electro-Spray
Ionization (ESI)
• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap
• ElectronMultiplier(EM)
4
High Bandwidth
100
0250 500 750 1000
m/z
% I
nte
nsit
y
5
Mass is fundamental!
6
Mass Spectrometry for Proteomics
• Measure mass of many molecules simultaneously• ...but not too many, abundance bias
• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to
7
Mass Spectrometry for Proteomics
• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?
• Ionization methods• MALDI, Electrospray
• Protein chemistry & automation• Chromatography, Gels, Computers
• Protein sequence databases• A reference for comparison
8
Sample Preparation for Peptide Identification
Enzymatic Digestand
Fractionation
9
Single Stage MS
MS
m/z
10
Tandem Mass Spectrometry(MS/MS)
Precursor selection
m/z
m/z
11
Tandem Mass Spectrometry(MS/MS)
Precursor selection + collision induced dissociation
(CID)
MS/MS
m/z
m/z
12
The big picture...
• MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.
• Key concepts:• Spectrum acquisition is unbiased• Direct observation of amino-acid sequence• Sensitive to minor sequence variation• Observed peptides represent folded proteins
13
Peptide Identification
• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well
• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...
• Automated, high-throughput peptide identification in complex mixtures
14
Peptide Identification, but...
• What about novel peptides?• Search compressed ESTs (C3, PepSeqDB)
• What about peak intensity?• Spectral matching using HMMs (HMMatch)
• Which identifications are correct?• Unsupervised, model-free, result combiner
with false discovery rate estimation
15
Why don’t we see more novel peptides?
• Tandem mass spectrometry doesn’t discriminate against novel peptides...
...but protein sequence databases do!
• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!
16
What goes missing?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Alternative translation start-sites
• Microexons
• Alternative translation frames
17
Why should we care?
• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins
• Proteins have clinical implications• Biomarker discovery
• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• Little hard evidence for translation start site
18
Novel Splice Isoform
• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.
• LIME1 gene:• LCK interacting transmembrane adaptor 1
• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.
• Multiple significant peptide identifications
19
Novel Splice Isoform
20
Novel Splice Isoform
21
Novel Mutation
• HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female
healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)
• TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy
• late-onset, dominant inheritance
22
Novel Mutation
Ala2→Pro associated with familial amyloid polyneuropathy
23
Novel Mutation
24
Searching ESTs
• Proposed long ago:• Yates, Eng, and McCormack; Anal Chem, ’95.
• Now:• Protein sequences are sufficient for protein identification• Computationally expensive/infeasible• Difficult to interpret
• Make EST searching feasible for routine searching to discover novel peptides.
25
Searching Expressed Sequence Tags (ESTs)
Pros• No introns!• Primary splicing
evidence for annotation pipelines
• Evidence for dbSNP• Often derived from
clinical cancer samples
Cons• No frame• Large (8Gb)• “Untrusted” by
annotation pipelines• Highly redundant• Nucleotide error
rate ~ 1%
26
Compressed EST Peptide Sequence Database
• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mers
• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results
27
PepSeq FASTA Databases
• Organisms:• HUMAN, MOUSE, RAT, ZEBRA FISH
• Peptide Evidence:• Genbank mRNA, EST, HTC• RefSeq mRNA, Proteins• Swiss-Prot/TrEMBL, EMBL, VEGA, H-Inv, IPI
Proteins• Swiss-Prot variants• Swiss-Prot signal peptide & init. Met removal
• Singe FASTA entry per Gene
28
Spectral Matching for Peptide Identification
• Detection vs. identification• Increased sensitivity & specificity• No novel peptides!
• NIST GC/MS Spectral Library• Identifies small molecules, • 100,000’s of (consensus) spectra• Bundled/Sold with many instruments• “Dot-product” spectral comparison• Current project: Peptide MS/MS
29
NIST MS Search: Peptides
30
Peptide DLATVYVDVLK
31
Protein Families
32
Protein Families
33
Peptide DLATVYVDVLK
34
Hidden Markov Models for Spectral Matching
• Capture statistical variation and consensus in peak intensity• Only need 10 spectra to build a model
• Capture semantics of peaks• Extrapolate model to other peptides
• Good specificity with superior sensitivity for peptide detection• Assign 1000’s of additional spectra (p-value < 10-5)
35
Hidden Markov Model
Ion
Delete
Insert
(m/z,int) pair emitted by ion & insert states
36
The devil in the details
• Intensity normalization
• Discretize (m/z,int) pairs
• Viterbi distance as score
• Compute p-value using “random” spectra
37
Random Spectra
• Uniform sample of (m/z,int)• Permutation (m/z) of true spectra peaks• M/z distribution between true spectra and
uniform sample (parameter)
RandomTrue False
Viterbi Score
# of
spe
ctra
38
HMM Peptide Identification Results – DLATV
DLAT (viterbi)
0
20
40
60
80
100
120
140
160
180
200
220
240
0-10
20-3
0
40-5
0
60-7
0
80-9
0
100-
110
120-
130
140-
150
160-
170
180-
190
200-
210
220-
230
240-
250
260-
270
280-
290
Viterbi Distance
# o
f s
pe
ctr
a
True_test(0.0001) True_test(other) False_test(0.0001) False_test(other)
DLAT (-logP)
0
10
20
30
40
50
60
70
80
90
100
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10
10-11
11-12
12-13
13-14
14-15
15-16
16-17
17-18
18-19
inf
-log(p-value)
# o
f s
pe
ctr
a
True_test(0.0001) True_test(other) False_test(0.0001) False_test(other)
39
Spectral Matching of Peptide Variants
DFLAGGVAAAISK
DFLAGGIAAAISK
40
HMM model extrapolation
41
Mascot Search Results
42
Peptide Identification Results
• Search engines always provide an answer
• Current search engines:• Hard to determine “good” scores• Significance estimates are unreliable
• Need better methods!
43
Common Algorithmic Framework
• Pre-process experimental spectra
• Filter peptide candidates
• Score match between peptides and spectra
• Rank peptides and assign
44
Comparison of search engines
• No single score is comprehensive
• Search engines disagree
• Many spectra lack confident peptide assignment
4%
OMSSA10%
2%
5%9%
69%
2%
X!Tandem
Mascot
45
Lots of published solutions!
• Treat search engines as black-boxes
• Apply supervised machine learning to results• Use multiple match metrics
• Combine/refine using multiple search engines• Agreement suggests correctness
• Use empirical significance estimates• “Decoy” databases (FDR)
46
PepArML
• Peptide identification arbiter by machine learning
• Unifies these ideas within a model-free, combining machine learning framework
• Unsupervised training procedure
47
PepArML Overview
• Unify Tandem, Mascot, and OMSSA results
X!Tandem
Mascot
OMSSA
Other
PepArML
Identified
Unidentified
48
Voting Heuristic Combiner
• Choose peptide ID with the most votes• Use best FDR as confidence
• Break ties (single votes) using FDR
• Strawman for comparison
49
Dataset construction
Machine Learningx
Spectra compare
Matched Ions
Peak_intensity
Mass delta
# of missed cleavages
Peptide length
Tandem Score
Mascot Score
OMSSA Score
Extract Features
X!Tandem
Mascot
OMSSA
Other
Search Tools
50
Dataset construction
• Build feature vectors
T),( 11 PS
F),( 21 PS
T),( 12 PS
Tandem Mascot OMSSA
T),( mn PS
……
51
Dataset construction
• Synthetic protein mixtures provide ground truth
• C8 • 8 standard proteins (Calibrant Biosystems)• 4594 MS/MS spectra (LTQ)• 618 (11.2%) true positives
• S17• 17 standard proteins (Sashimi Repository)• 1389 MS/MS spectra (Q-TOF)• 354 (25.4%) true positives
• AURUM• 364 standard proteins (AURUM 1.0)• 7508 MS/MS spectra (MALDI-TOF-TOF)• 3775 (50.3%) true positives
52
Machine learning improves single search engines (S17)
53
Multiple search engines are better than single search engines (S17)
54
Feature Evaluation
55
Application to Real Data
• How well do these models generalize?
• Different instruments• Spectral characteristics change scores
• Search parameters• Different parameters change score values
• Supervised learning requires• (Synthetic) experimental data from every instrument• Search results from available search engines• Training/models for all
parameters x search engine sets x instruments
56
Model Generalization
57
Rescuing Machine Learning
• Train a new machine-learning model for every dataset!• Generalization not required• No predetermined search engines, parameters,
instruments, features
• Perhaps we can “guess” the true proteins• Most proteins not in doubt• Machine learning can tolerate imperfect labels
58
Unsupervised Learning
• Heuristic selection of “true” proteins• Train classifier, predict true peptide IDs
• Update “true” proteins• Heuristic selection of “true” proteins from
classifier predictions
• Iterate until convergence
59
Unsupervised Learning Performance
60
Unsupervised Learning Convergence
61
Conclusions
• Proteomics can inform genome annotation• Eukaryotic and prokaryotic • Functional vs silencing variants
• Peptides identify more than just proteins• Untapped source of disease biomarkers
• Computational inference can make a substantial impact in proteomics
62
Conclusions
• Compressed peptide sequence databases make routine EST searching feasible
• HMMatch spectral matching improves identification performance for familiar peptides
• Unsupervised, model-free, combining PepArML framework solves peptide identification interpretation problem
63
Acknowledgements
• Chau-Wen Tseng, Xue Wu• UMCP Computer Science
• Catherine Fenselau• UMCP Biochemistry
• Cheng Lee• Calibrant Biosystems
• PeptideAtlas, HUPO PPP, X!Tandem
• Funding: NIH/NCI, USDA/ARS