Improving Genome
Annotation using
Proteomics
Improving Genome
Annotation using
ProteomicsNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park
2
Mass Spectrometry for Proteomics
• Measure mass of many (bio)molecules simultaneously• High bandwidth
• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required
3
Mass Spectrometer
Ionizer
Sample
+_
Mass Analyzer Detector
• MALDI• Electro-Spray
Ionization (ESI)
• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap
• ElectronMultiplier(EM)
4
High Bandwidth
100
0250 500 750 1000
m/z
% I
nte
nsit
y
5
Mass is fundamental!
6
Mass Spectrometry for Proteomics
• Measure mass of many molecules simultaneously• ...but not too many, abundance bias
• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to
7
Mass Spectrometry for Proteomics
• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?
• Ionization methods• MALDI, Electrospray
• Protein chemistry & automation• Chromatography, Gels, Computers
• Protein / genome sequences• A reference for comparison
8
Sample Preparation for Peptide Identification
Enzymatic Digestand
Fractionation
9
Single Stage MS
MS
m/z
10
Tandem Mass Spectrometry(MS/MS)
Precursor selection
m/z
m/z
11
Tandem Mass Spectrometry(MS/MS)
Precursor selection + collision induced dissociation
(CID)
MS/MS
m/z
m/z
12
Peptide Identification
• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well
• Peptide sequences from (any) sequence database• Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes, ...
• Automated, high-throughput peptide identification in complex mixtures
13
Peptide Identification
...can provide direct experimental evidence for the amino-acid sequence of functional proteins.
Evidence for:• Functional protein isoforms• Translation start and frame• Proteins with short open-reading-frames
14
How could this help?
• Evidence for SNPs and alternative splicing stops with transcription
• No genomic or transcript evidence for translation start-site.
• Conservation doesn’t stop at coding bases!
• Statistical gene-finders struggle with micro-exons, translation start-site, and short ORFs.
15
What can be observed?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Microexons ( non-cannonical splice-sites )
• Alternative translation start-sites ( codons )
• Alternative translation frames
• “Dark” open-reading-frames
16
Splice Isoform
• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.
• LIME1 gene:• LCK interacting transmembrane adaptor 1
• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.
• Multiple significant peptide identifications
17
Splice Isoform
18
Novel Splice Isoform
19
Translation Start-Site
• Human erythroleukemia K562 cell-line• Depth of coverage study• Resing et al. Anal. Chem. 2004.
• THOC2 gene:• Part of the heteromultimeric THO/TREX complex.
• Initially believed to be a “novel” ORF• RefSeq mRNA in Jun 2007, no RefSeq protein• TrEMBL entry Feb 2005, no SwissProt entry• Genbank mRNA in May 2002 (complete CDS)• Plenty of EST support• ~ 100,000 bases upstream of other isoforms
20
Translation Start-Site
21
Translation Start-Site
22
Translation Start-Site
23
Translation Start-Site
24
Easily distinguish minor sequence variations
Two B. anthracis Sterne α/β SASP annotations
• RefSeq/Gb: MVMARN... (7441 Da)• CMR: MARN... (7211 Da)
• Intact proteins differ by 230 Da• 7441 Da vs 7211 Da
• N-terminal tryptic peptides:• MVMAR (606.3 Da), MVMARNR (876.4 Da), vs• MARNR (646.3 Da)• Very different MS/MS spectra
25
Bacterial Gene-Finding
…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…
Stopcodon
Stopcodon
• Find all the open-reading-frames...
...courtesy of Art Delcher
26
Bacterial Gene-Finding
…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…
Stopcodon
Stopcodon
…ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT…
ShiftedStop
Stopcodon
Reversestrand
• Find all the open-reading-frames...
...but they overlap – which ones are correct?
...courtesy of Art Delcher
27
Coding-Sequence “Score”
...courtesy of Art Delcher
28
Glimmer3 Performance
Organism Length GC% # Genes ExtraArchaeoglobus fulgidus 2.18Mb 48.6 1165 1162 99.70% 875 75.10% 1305Bacillus anthracis 5.23Mb 35.4 3132 3129 99.9% 2768 88.4% 2340Bacillus subtilis 4.21Mb 43.5 1576 1567 99.4% 1429 90.7% 2879Campylobacter jejuni 1.78Mb 30.3 1233 1233 100.0% 1149 93.2% 668Carboxydothermus hydrogenoformans 2.40Mb 42.0 1753 1752 99.9% 1590 90.7% 865Caulobacter crescentus 4.02Mb 67.2 2192 2187 99.8% 1552 70.8% 1559Chlorobium tepidum 2.15Mb 56.5 1292 1289 99.8% 949 73.5% 765Clostridium perfringens 3.03Mb 28.6 1504 1503 99.9% 1385 92.1% 1178Colwellia psychrerythraea 5.37Mb 38.0 3063 3060 99.9% 2663 86.9% 1714Dehalococcoides ethenogenes 1.47Mb 48.9 1069 1059 99.1% 929 86.9% 483Escherichia coli 4.64Mb 50.8 3603 3553 98.6% 3150 87.4% 913Geobacter sulfurreducens 3.81Mb 60.9 2351 2340 99.5% 1974 84.0% 1091Haemophilus influenzae 1.83Mb 38.1 1170 1170 100.0% 1054 90.1% 639Helicobacter pylori 1.67Mb 38.9 915 914 99.9% 805 88.0% 765Listeria monocytogenes 2.91Mb 38.0 1966 1965 99.9% 1797 91.4% 845Methylococcus capsulatus 3.30Mb 63.6 2015 2005 99.5% 1542 76.5% 1231Mycobacterium tuberculosis 4.40Mb 65.6 2217 2205 99.5% 1493 67.3% 2104Neisseria meningitidis 2.27Mb 51.5 1232 1217 98.8% 1042 84.6% 1329Porphyromonas gingivalis 2.34Mb 48.3 1200 1198 99.8% 933 77.8% 887Pseudomonas fluorescens 7.07Mb 63.3 4535 4503 99.3% 3577 78.9% 1871Pseudomonas putida 6.18Mb 61.5 3633 3596 99.0% 2825 77.8% 1916Ralstonia solanacearum 3.72Mb 67.0 2512 2487 99.0% 2061 82.0% 1077Staphylococcus epidermidis 2.62Mb 32.1 1650 1649 99.9% 1511 91.6% 771Streptococcus agalactiae 2.16Mb 35.6 1441 1438 99.8% 1336 92.7% 683Streptococcus pneumoniae 2.16Mb 39.7 1359 1355 99.7% 1214 89.3% 780Thermotoga maritima 1.86Mb 46.2 1092 1090 99.8% 892 81.7% 804Treponema denticola 2.84Mb 37.9 1463 1463 100.0% 1332 91.0% 1210Treponema pallidum 1.14Mb 52.8 575 572 99.5% 425 73.9% 557Ureaplasma parvum 0.75Mb 25.5 327 327 100.0% 300 91.7% 293Wolbachia endosymbiont 1.08Mb 34.2 628 627 99.8% 528 84.1% 537
99.6% 84.3%Averages:
Genome Glimmer3 PredictionsMatches Correct Starts
• Glimmer3 trained & compared to RefSeq genes with annotated function
• Correct STOP:• 99.6%
• Correct START:• 84.3%
• “Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.”
29
N-terminal peptides
• (Protein) N-terminal peptides establish• start-site of known & unexpected ORFs
Use:• Directly to annotate genomes• Evaluate and improve algorithms• Map cross-species
30
N-terminal peptide workflows
• Typical proteomics workflows sample peptides from the proteome “randomly”
• Caulobacter crescentus (70%)• 3733 Proteins (RefSeq Genome annot.)• 66K tryptic peptides (600 Da to 3000 Da)• 2085 N-terminal tryptic peptides (3%)
31
N-terminal peptide workflow
• Protect protein N-terminus
• Digest to peptides• Chemically modify
free peptide N-term• Use chem. mod. to
capture unwanted peptides
Nat Biotech, Vol. 21, pp. 566-569, 2003.
32
Increasing N-terminal peptide coverage
• Multiple (digest) enzymes:• trypsin-R:
60% (80%)• acid + lys-C + trypsin:
85% (94%)• Repeated LC-MS/MS• Precursor Exclusion /
Inclusion lists• MALDI / ESI• Protein separation
and/or orthogonal fractionation Anal Chem, Vol. 76, pp. 4193-4201, 2004.
33
Proteomics Informatics
• Search spectra against:• Entire bacterial genome;• All Met initiated peptides; or • Statistically likely Met initiated peptides.
• Easily consider initial Met loss PTM, too
• Off-the-shelf MS/MS search engines (Mascot / X!Tandem / OMSSA)
34
Other Practical Issues
• Suitable for commonly available instrumentation• Only the sample prep. is (somewhat) novel.
• Need living organism• Stage of life-cycle?
• Bang for buck?• N-terminal peptides / $$$$
35
Other Research Projects
• Alternative splicing and coding SNPs in clinical cancer samples
• MS/MS spectral matching using HMMs• Combining MS/MS search engine results
using machine learning• Microorganism identification using MS
(www.RMIDb.org)• Gapped/spaced seeds for inexact sequence
alignment.• Applications of SBH-graphs and Eulerian
paths
36
Hidden Markov Models for Spectral Matching
• Capture statistical variation and consensus in peak intensity
• Capture semantics of peaks• Extrapolate model to other peptides
• Good specificity with superior sensitivity for peptide detection• Assign 1000’s of additional spectra (w/ p-value < 10-5)
37
Peptide DLATVYVDVLK
38
Peptide DLATVYVDVLK
39
Acknowledgements
• Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry
• Chau-Wen Tseng, Xue Wu• UMCP Computer Science
• Cheng Lee• Calibrant Biosystems
• PeptideAtlas, HUPO PPP, X!Tandem
• Funding: NIH/NCI, USDA/ARS