improving the reliability of peptide identification by tandem mass spectrometry nathan edwards...

Improving the Reliability of

Peptide Identification by

Tandem Mass Spectrometry

Improving the Reliability of

Peptide Identification by

Tandem Mass SpectrometryNathan EdwardsDepartment of Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical Center

2

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

3

Mass Spectrometer

Ionizer

Sample

+_

Mass Analyzer Detector

• MALDI• Electro-Spray

Ionization (ESI)

• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap

• ElectronMultiplier(EM)

4

High Bandwidth

100

0250 500 750 1000

m/z

% I

nte

nsit

y

5

Mass is fundamental!

6


• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

7


• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein sequence databases• A reference for comparison

8

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

9

Single Stage MS

MS

m/z

10

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

11

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

12

The big picture...

• MS/MS spectra provide evidence for the amino-acid sequence of functional proteins.

• Key concepts:• Spectrum acquisition is unbiased• Direct observation of amino-acid sequence• Sensitive to minor sequence variation• Observed peptides represent folded proteins

13

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...

• Automated, high-throughput peptide identification in complex mixtures

14

Peptide Identification, but...

• What about novel peptides?• Search compressed ESTs (C3, PepSeqDB)

• What about peak intensity?• Spectral matching using HMMs (HMMatch)

• Which identifications are correct?• Unsupervised, model-free, result combiner

with false discovery rate estimation

15

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

16

What goes missing?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

17

Why should we care?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

• Proteins have clinical implications• Biomarker discovery

• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• Little hard evidence for translation start site

18

Novel Splice Isoform

• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.

• LIME1 gene:• LCK interacting transmembrane adaptor 1

• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.

• Multiple significant peptide identifications

19


http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000340.3.xml&uid=53361&label=AAAACKOM&homolog=AAAACKOM&id=895.1.1&proex=-1

20


http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr20:61839670-61839828&hgsid=68112320&est=pack

21

Novel Mutation

• HUPO Plasma Proteome Project• Pooled samples from 10 male & 10 female

healthy Chinese subjects• Plasma/EDTA sample protocol• Li, et al. Proteomics 2005. (Lab 29)

• TTR gene• Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis.• Familial amyloidotic polyneuropathy

• late-onset, dominant inheritance

22

Novel Mutation

Ala2→Pro associated with familial amyloid polyneuropathy

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300002887.18.xml&uid=202568&label=AAAKEPZA&homolog=AAAKEPZA&id=1838.1.1&proex=-1

23

Novel Mutation

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr18:27426944-27426971&hgsid=68063647

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr18:27426944-27426971&hgsid=68063647

24

Searching ESTs

• Proposed long ago:• Yates, Eng, and McCormack; Anal Chem, ’95.

• Now:• Protein sequences are sufficient for protein identification• Computationally expensive/infeasible• Difficult to interpret

• Make EST searching feasible for routine searching to discover novel peptides.

25

Searching Expressed Sequence Tags (ESTs)

Pros• No introns!• Primary splicing

evidence for annotation pipelines

• Evidence for dbSNP• Often derived from

clinical cancer samples

Cons• No frame• Large (8Gb)• “Untrusted” by

annotation pipelines• Highly redundant• Nucleotide error

rate ~ 1%

26

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

27

PepSeq FASTA Databases

• Organisms:• HUMAN, MOUSE, RAT, ZEBRA FISH

• Peptide Evidence:• Genbank mRNA, EST, HTC• RefSeq mRNA, Proteins• Swiss-Prot/TrEMBL, EMBL, VEGA, H-Inv, IPI

Proteins• Swiss-Prot variants• Swiss-Prot signal peptide & init. Met removal

• Singe FASTA entry per Gene

28

Spectral Matching for Peptide Identification

• Detection vs. identification• Increased sensitivity & specificity• No novel peptides!

• NIST GC/MS Spectral Library• Identifies small molecules, • 100,000’s of (consensus) spectra• Bundled/Sold with many instruments• “Dot-product” spectral comparison• Current project: Peptide MS/MS

29

NIST MS Search: Peptides

30

Peptide DLATVYVDVLK

31

Protein Families

32

Protein Families

33

Peptide DLATVYVDVLK

34

Hidden Markov Models for Spectral Matching

• Capture statistical variation and consensus in peak intensity• Only need 10 spectra to build a model

• Capture semantics of peaks• Extrapolate model to other peptides

• Good specificity with superior sensitivity for peptide detection• Assign 1000’s of additional spectra (p-value < 10-5)

35

Hidden Markov Model

Ion

Delete

Insert

(m/z,int) pair emitted by ion & insert states

36

The devil in the details

• Intensity normalization

• Discretize (m/z,int) pairs

• Viterbi distance as score

• Compute p-value using “random” spectra

37

Random Spectra

• Uniform sample of (m/z,int)• Permutation (m/z) of true spectra peaks• M/z distribution between true spectra and

uniform sample (parameter)

RandomTrue False

Viterbi Score

# of

spe

ctra

38

HMM Peptide Identification Results – DLATV

DLAT (viterbi)

0

20

40

60

80

100

120

140

160

180

200

220

240

0-10

20-3

0

40-5

0

60-7

0

80-9

0

100-

110

120-

130

140-

150

160-

170

180-

190

200-

210

220-

230

240-

250

260-

270

280-

290

Viterbi Distance

# o

f s

pe

ctr

a

True_test(0.0001) True_test(other) False_test(0.0001) False_test(other)

DLAT (-logP)

0

10

20

30

40

50

60

70

80

90

100

0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10

10-11

11-12

12-13

13-14

14-15

15-16

16-17

17-18

18-19

inf

-log(p-value)

# o

f s

pe

ctr

a

True_test(0.0001) True_test(other) False_test(0.0001) False_test(other)

39

Spectral Matching of Peptide Variants

DFLAGGVAAAISK

DFLAGGIAAAISK

40

HMM model extrapolation

41

Mascot Search Results

42

Peptide Identification Results

• Search engines always provide an answer

• Current search engines:• Hard to determine “good” scores• Significance estimates are unreliable

• Need better methods!

43

Common Algorithmic Framework

• Pre-process experimental spectra

• Filter peptide candidates

• Score match between peptides and spectra

• Rank peptides and assign

44

Comparison of search engines

• No single score is comprehensive

• Search engines disagree

• Many spectra lack confident peptide assignment

4%

OMSSA10%

2%

5%9%

69%

2%

X!Tandem

Mascot

45

Lots of published solutions!

• Treat search engines as black-boxes

• Apply supervised machine learning to results• Use multiple match metrics

• Combine/refine using multiple search engines• Agreement suggests correctness

• Use empirical significance estimates• “Decoy” databases (FDR)

46

PepArML

• Peptide identification arbiter by machine learning

• Unifies these ideas within a model-free, combining machine learning framework

• Unsupervised training procedure

47

PepArML Overview

• Unify Tandem, Mascot, and OMSSA results

X!Tandem

Mascot

OMSSA

Other

PepArML

Identified

Unidentified

48

Voting Heuristic Combiner

• Choose peptide ID with the most votes• Use best FDR as confidence

• Break ties (single votes) using FDR

• Strawman for comparison

49

Dataset construction

Machine Learningx

Spectra compare

Matched Ions

Peak_intensity

Mass delta

# of missed cleavages

Peptide length

Tandem Score

Mascot Score

OMSSA Score

Extract Features

X!Tandem

Mascot

OMSSA

Other

Search Tools

50


• Build feature vectors

T),( 11 PS

F),( 21 PS

T),( 12 PS

Tandem Mascot OMSSA

T),( mn PS

……

51


• Synthetic protein mixtures provide ground truth

• C8 • 8 standard proteins (Calibrant Biosystems)• 4594 MS/MS spectra (LTQ)• 618 (11.2%) true positives

• S17• 17 standard proteins (Sashimi Repository)• 1389 MS/MS spectra (Q-TOF)• 354 (25.4%) true positives

• AURUM• 364 standard proteins (AURUM 1.0)• 7508 MS/MS spectra (MALDI-TOF-TOF)• 3775 (50.3%) true positives

52

Machine learning improves single search engines (S17)

53

Multiple search engines are better than single search engines (S17)

54

Feature Evaluation

55

Application to Real Data

• How well do these models generalize?

• Different instruments• Spectral characteristics change scores

• Search parameters• Different parameters change score values

• Supervised learning requires• (Synthetic) experimental data from every instrument• Search results from available search engines• Training/models for all

parameters x search engine sets x instruments

56

Model Generalization

57

Rescuing Machine Learning

• Train a new machine-learning model for every dataset!• Generalization not required• No predetermined search engines, parameters,

instruments, features

• Perhaps we can “guess” the true proteins• Most proteins not in doubt• Machine learning can tolerate imperfect labels

58

Unsupervised Learning

• Heuristic selection of “true” proteins• Train classifier, predict true peptide IDs

• Update “true” proteins• Heuristic selection of “true” proteins from

classifier predictions

• Iterate until convergence

59

Unsupervised Learning Performance

60

Unsupervised Learning Convergence

61

Conclusions

• Proteomics can inform genome annotation• Eukaryotic and prokaryotic • Functional vs silencing variants

• Peptides identify more than just proteins• Untapped source of disease biomarkers

• Computational inference can make a substantial impact in proteomics

62

Conclusions

• Compressed peptide sequence databases make routine EST searching feasible

• HMMatch spectral matching improves identification performance for familiar peptides

• Unsupervised, model-free, combining PepArML framework solves peptide identification interpretation problem

63

Acknowledgements

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Catherine Fenselau• UMCP Biochemistry

• Cheng Lee• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: NIH/NCI, USDA/ARS

improving the reliability of peptide identification by tandem mass spectrometry nathan edwards...

Documents

proteomicsmeasure mass

mrna sequence

novel peptides

lck gene

tomass spectrometry

proteomicsmass spectrometry

likely peptide sequence1

lime1 gene