generalized protein parsimony and spectral counting for functional enrichment analysis
DESCRIPTION
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Why Tandem Mass Spectrometry?. - PowerPoint PPT PresentationTRANSCRIPT
Generalized Protein Parsimony and Spectral Counting for
Functional Enrichment Analysis
Nathan EdwardsDepartment of Biochemistry and
Molecular & Cellular Biology
Georgetown University Medical Center
2
Why Tandem Mass Spectrometry?
LC-MS/MS spectra provide evidence for the amino-acid sequence and abundance of functional proteins.
Key concepts: Spectrum acquisition is unbiased by knowledge Direct observation of amino-acid sequence Sensitive to small sequence variations Spectrum acquisition is biased by abundance
3
Sample Preparation for MS/MS
Enzymatic Digestand
Fractionation
4
Single Stage MS
MS
5
Tandem Mass Spectrometry(MS/MS)
Precursor selection
6
Tandem Mass Spectrometry(MS/MS)
Precursor selection + collision induced dissociation
(CID)
MS/MS
7
Peptide Fragmentation
Peptide: S-G-F-L-E-E-D-E-L-K
y1
y2
y3
y4
y5
y6
y7
y8
y9
ion
1020
907
778
663
534
405
292
145
88
MW
762SGFL EEDELKb4
389SGFLEED ELKb7
MWion
633SGFLE EDELKb5
1080S GFLEEDELKb1
1022SG FLEEDELKb2
875SGF LEEDELKb3
504SGFLEE DELKb6
260SGFLEEDE LKb8
147SGFLEEDEL Kb9
8
Unannotated Splice Isoform
Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003.
LIME1 gene: LCK interacting transmembrane adaptor 1
LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias.
Multiple significant peptide identifications
9
Unannotated Splice Isoform
10
Unannotated Splice Isoform
11
Translation start-site correction
Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane
and soluble cytoplasmic proteins Goo, et al. MCP 2003.
GdhA1 gene: Glutamate dehydrogenase A1
Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0
prediction(s)
12
Halobacterium sp. NRC-1ORF: GdhA1
K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated
translation start site of NP_279651
0 40 80 120 160 200 240 280 320 360 400 440
13
Lost peptide identifications
Missing from the sequence database
Search engine strengths, weaknesses, quirks
Poor score or statistical significance
Thorough search takes too long
14
All amino-acid 30-mers, no redundancy From ESTs, Proteins, mRNAs
30-40 fold size and search time reduction Formatted as a FASTA sequence database One entry per gene/cluster.
Peptide Sequence Databases
Organism Size (AA) Size (Entries)Human 248Mb 74,976Mouse 171Mb 55,887
Rat 76Mb 42,372Zebra-fish 94Mb 40,490
15
Combine search engine results
No single score is comprehensive
Search engines disagree
Many spectra lack confident peptide assignment
Searle et al. JPR 7(1), 2008
38%
14%28%
14%
3%
2%
1%
X! Tandem
SEQUESTMascot
16
Combining search engine results – harder than it looks!
Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!
How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?
We apply "unsupervised" machine-learning.... Lots of related work unified in a single framework.
Search Engine Info. Gain
17
Mascot OMSSATandem
Train Classifier & Predict Correct IDs
Stable?
Ouput Peptide Spectrum Assignments
Spectra
No
Yes
Recalibrate Confidence as FDR (D1)
Select "True" Proteins
Extract Peptides & Features
Select High-Quality IDs (D0)
Assign Training Labels
Select "True" Proteins
. . . . . .PepArML Workflow
Select high-quality IDs Guess true proteins from
search results Label spectra & train Calibrate confidence Guess true proteins from
ML results Iterate! Estimate FDR using
(external) decoy18
False-Discovery-Rate Curves
19
20
PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs
Edwards LabScheduler &80+ CPUs
Securecommunication
Heterogeneouscompute resources
Single, simplesearch request
Scales easily to 250+ simultaneous
searches
X!Tandem,KScore,OMSSA,
MyriMatch,Mascot(1 core).
X!Tandem,KScore,OMSSA,
MyriMatch.
Amazon AWS
21
PeptideMapper Web Service
I’m Feeling Lucky
22
PeptideMapper Web Service
I’m Feeling Lucky
23
PeptideMapper Web Service
Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible
Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact
Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates
molecular biology ↕
phenotype
Systems Biology
24
KnowledgeDatabases
Structured High-Throughput
Experiments• Localization• Function• Process• Interactions• Pathway• Mutation
• Proteomics• Sequencing• Microarrays• Metabolomics
molecular biology↕
biology
molecular biology ↕
phenotype
Systems Biology
25
MathematicalModels
Structured High-Throughput
Experiments• Localization• Function• Process• Interactions• Pathway• Mutation
• Proteomics• Sequencing• Microarrays• Metabolomics
molecular biology↕
biology
KnowledgeDatabasesFunctional
AnnotationEnrichment
molecular biology ↕
phenotype
Systems Biology
26
MathematicalModels
Structured High-Throughput
Experiments• Localization• Function• Process• Interactions• Pathway• Mutation
• Proteomics• Sequencing• Microarrays• Metabolomics
molecular biology↕
biology
KnowledgeDatabasesFunctional
AnnotationEnrichment
Why not in proteomics?
Double counting and false positives… …due to traditional protein inference
Proteomics cannot see all proteins… …proteins are not equally likely to be drawn
Good relative abundance is hard… …extra chemistries, workflows, and software …missing values are particularly problematic
27
In proteomics…
Double counting and false positives… Use generalized protein parsimony
Proteomics cannot see all proteins… Use identified proteins as background
Good relative abundance is hard… Model differential spectral counts directly
28
Traditional Protein Parsimony
Select the smallest set of proteins that explain all identified peptides.
Sensible principle, implies Eliminate equivalent/subset proteins
Equivalent proteins are problematic: Which one to choose?
Unique-protein peptides force the inclusion of proteins into solution True for most tools, even probability based ones Bad consequences for FDR filtered ids 29
Peptide-Spectrum Matches
Sigma49 – 32,691 LTQ MS/MS spectra of 49 human protein standards; IPI Human
Yeast – 162,420 LTQ MS/MS spectra from a yeast cell lysate; SGD.
X!Tandem E-value (no refinement), 1% FDR
30Spectra used in: Zhang, B.; Chambers, M. C.; Tabb, D. L. 2007.
Many proteins are easy
Eliminate equivalent / dominated proteins Sigma49: 277 → 60 proteins Yeast: 1226 → 1085 proteins
Many components have a single protein: Sigma49: 52 ( 3 multi-protein) Yeast: 994 (43 multi-protein)
Single peptides force protein inclusion Sigma49: 16 single-peptide proteins Yeast: 476 single-peptide proteins
31
Must eliminate redundancy
Contained proteins should not be selected
32
IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
37 distinct peptides
Must eliminate redundancy
Contained proteins should not be selected Even if they have some probability mass Number of sibling peptides matter less if they are
shared.33
IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
1.01.00.80.70.01.0
Single AA Difference
1.00.00.00.00.01.0
Must ignore some PSMs
A single additional peptide should not force protein into solution
34
IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
Single AA Difference
Example from Yeast
"Inosine monophosphate dehydrogenase" 4 gene family
Contained proteins should not be selected Single peptide evidence for YML056C
35
YLR432W X X X X X X XYHR216W X X XYAR073W X X YML056C X X X X X X
1.00.60.01.0
Must ignore some PSMs
Improving peptide identification sensitivitymakes things worse! False PSMs don't cluster
36
10%
2xProteins
PSMs
PSMs
Must ignore some PSMs
Improving peptide identification sensitivitymakes things worse! False PSMs don't cluster
37
Select Proteins toExplain True PSM%
PSMs
PSMs
90%
90%
Must ignore some PSMs
How do we choose? Maximize # peptides? Minimize FDR (naïve model)? Maximize # PSMs?
38
IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
YLR432W X X X X X X XYHR216W X X XYAR073W X X YML056C X X X X X X
Generalized Protein Parsimony
Weight peptides by number of PSMs Constrain unique peptides per protein Maximize explained peptides (PSMs)
Match PSM filtering FDR to % uncovered PSMs
Readily solved by branch-and-bound Permits complex protein/peptide constraints
Reduces to traditional protein parsimony39
Match uncovered PSMs to FDR
40
Plasma membrane enrichment
Pellicle enrichment of plasma membrane Choksawangkarn et al. JPR 2013 (Fenselau Lab)
Six replicate LC-MS/MS analyses each Cell-lysate (44,861 MS/MS) Fe3O4-Al2O3 pellicle (21,871 MS/MS)
625 3-unique proteins to match 10% FDR: Lysate: 18,976 PSMs; Pellicle: 13,723 PSMs 89 proteins with significantly (< 10-5) increased counts
41
Semi-quantitative LC-MS/MS
42
Precursor selection + collision induced dissociation
(CID)
MS/MS
Semi-quantitative LC-MS/MS
43
Chen and Yates. Molecular Oncology, 2007
Plasma membrane enrichment
Na/K+ ATPase subunit alpha-1 (P05023): Lysate: 1; Pellicle: 90; p-value: 5.2 x 10-33
Transferrin receptor protein 1 (P02786): Lysate: 17; Pellicle: 63; p-value: 2.0 x 10-11
DAVID Bioinformatics analysis (89/625): Plasma membrane (GO:0005886) : 29 (5.2 x 10-5) Transmembrane (SwissProtKW): 24 (1.3 x 10-6)
Transmembrane (SwissProtKW): Lysate: 524; Pellicle: 1335; p-value: 2.6 x 10-158
44
Distribution of p-values (Yeast)
45
A protein's PSMs rise and fall together!
46
A protein's PSMs rise and fall together?
47
Anomalies indicate proteoforms
48
HER2/Neu Mouse Model of Breast Cancer
Paulovich, et al. JPR, 2007 Study of normal and tumor mammary tissue
by LC-MS/MS 1.4 million MS/MS spectra
Peptide-spectrum assignments Normal samples (Nn): 161,286 (49.7%) Tumor samples (Nt): 163,068 (50.3%)
4270 proteins identified in total 2-unique generalized protein parsimony
49
Nascent polypeptide-associated complex subunit alpha
50
7.3 x 10-8
51
Pyruvate kinase isozymes M1/M22.5 x 10-5
52
Summary
Improve the scope and sensitivity of peptide identification for genome annotation, using
Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search
Summary
Functional annotation enrichment for proteomics too: Careful counting (generalized parsimony) Differential abundance by spectral counts
Use (multivariate-)hypergeometric model for Differential abundance by spectral counts Proteoform detection
53