improving genome annotation using proteomics nathan edwards center for bioinformatics and...

Improving Genome

Annotation using

Proteomics

Improving Genome

Annotation using

ProteomicsNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

2

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

3

Mass Spectrometer

Ionizer

Sample

+_

Mass Analyzer Detector

• MALDI• Electro-Spray

Ionization (ESI)

• Time-Of-Flight (TOF)• Quadrapole• Ion-Trap

• ElectronMultiplier(EM)

4

High Bandwidth

100

0250 500 750 1000

m/z

% I

nte

nsit

y

5

Mass is fundamental!

6


• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

7


• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein / genome sequences• A reference for comparison

8

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

9

Single Stage MS

MS

m/z

10

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

11

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

12

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from (any) sequence database• Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes, ...

• Automated, high-throughput peptide identification in complex mixtures

13

Peptide Identification

...can provide direct experimental evidence for the amino-acid sequence of functional proteins.

Evidence for:• Functional protein isoforms• Translation start and frame• Proteins with short open-reading-frames

14

How could this help?

• Evidence for SNPs and alternative splicing stops with transcription

• No genomic or transcript evidence for translation start-site.

• Conservation doesn’t stop at coding bases!

• Statistical gene-finders struggle with micro-exons, translation start-site, and short ORFs.

15

What can be observed?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Microexons ( non-cannonical splice-sites )

• Alternative translation start-sites ( codons )

• Alternative translation frames

• “Dark” open-reading-frames

16

Splice Isoform

• Human Jurkat leukemia cell-line• Lipid-raft extraction protocol, targeting T cells• von Haller, et al. MCP 2003.

• LIME1 gene:• LCK interacting transmembrane adaptor 1

• LCK gene:• Leukocyte-specific protein tyrosine kinase• Proto-oncogene• Chromosomal aberration involving LCK in leukemias.

• Multiple significant peptide identifications

17

Splice Isoform

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000340.3.xml&uid=53361&label=AAAACKOM&homolog=AAAACKOM&id=895.1.1&proex=-1

18

Novel Splice Isoform

http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr20:61839670-61839828&hgsid=68112320&est=pack

19

Translation Start-Site

• Human erythroleukemia K562 cell-line• Depth of coverage study• Resing et al. Anal. Chem. 2004.

• THOC2 gene:• Part of the heteromultimeric THO/TREX complex.

• Initially believed to be a “novel” ORF• RefSeq mRNA in Jun 2007, no RefSeq protein• TrEMBL entry Feb 2005, no SwissProt entry• Genbank mRNA in May 2002 (complete CDS)• Plenty of EST support• ~ 100,000 bases upstream of other isoforms

20


http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000833.3.xml&uid=61235&label=AAAACWFI&homolog=AAAACWFI&id=1962.1.1&proex=-1

21


http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=68062273&hgt.in1=1.5x&position=chrX%3A122566218-122566301

22


http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=68062273&hgt.in1=1.5x&position=chrX%3A122562855-122562923

23


24

Easily distinguish minor sequence variations

Two B. anthracis Sterne α/β SASP annotations

• RefSeq/Gb: MVMARN... (7441 Da)• CMR: MARN... (7211 Da)

• Intact proteins differ by 230 Da• 7441 Da vs 7211 Da

• N-terminal tryptic peptides:• MVMAR (606.3 Da), MVMARNR (876.4 Da), vs• MARNR (646.3 Da)• Very different MS/MS spectra

25

Bacterial Gene-Finding

…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…

Stopcodon

Stopcodon

• Find all the open-reading-frames...

...courtesy of Art Delcher

26

Bacterial Gene-Finding

…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…

Stopcodon

Stopcodon

…ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT…

ShiftedStop

Stopcodon

Reversestrand

• Find all the open-reading-frames...

...but they overlap – which ones are correct?


27

Coding-Sequence “Score”


28

Glimmer3 Performance

Organism Length GC% # Genes ExtraArchaeoglobus fulgidus 2.18Mb 48.6 1165 1162 99.70% 875 75.10% 1305Bacillus anthracis 5.23Mb 35.4 3132 3129 99.9% 2768 88.4% 2340Bacillus subtilis 4.21Mb 43.5 1576 1567 99.4% 1429 90.7% 2879Campylobacter jejuni 1.78Mb 30.3 1233 1233 100.0% 1149 93.2% 668Carboxydothermus hydrogenoformans 2.40Mb 42.0 1753 1752 99.9% 1590 90.7% 865Caulobacter crescentus 4.02Mb 67.2 2192 2187 99.8% 1552 70.8% 1559Chlorobium tepidum 2.15Mb 56.5 1292 1289 99.8% 949 73.5% 765Clostridium perfringens 3.03Mb 28.6 1504 1503 99.9% 1385 92.1% 1178Colwellia psychrerythraea 5.37Mb 38.0 3063 3060 99.9% 2663 86.9% 1714Dehalococcoides ethenogenes 1.47Mb 48.9 1069 1059 99.1% 929 86.9% 483Escherichia coli 4.64Mb 50.8 3603 3553 98.6% 3150 87.4% 913Geobacter sulfurreducens 3.81Mb 60.9 2351 2340 99.5% 1974 84.0% 1091Haemophilus influenzae 1.83Mb 38.1 1170 1170 100.0% 1054 90.1% 639Helicobacter pylori 1.67Mb 38.9 915 914 99.9% 805 88.0% 765Listeria monocytogenes 2.91Mb 38.0 1966 1965 99.9% 1797 91.4% 845Methylococcus capsulatus 3.30Mb 63.6 2015 2005 99.5% 1542 76.5% 1231Mycobacterium tuberculosis 4.40Mb 65.6 2217 2205 99.5% 1493 67.3% 2104Neisseria meningitidis 2.27Mb 51.5 1232 1217 98.8% 1042 84.6% 1329Porphyromonas gingivalis 2.34Mb 48.3 1200 1198 99.8% 933 77.8% 887Pseudomonas fluorescens 7.07Mb 63.3 4535 4503 99.3% 3577 78.9% 1871Pseudomonas putida 6.18Mb 61.5 3633 3596 99.0% 2825 77.8% 1916Ralstonia solanacearum 3.72Mb 67.0 2512 2487 99.0% 2061 82.0% 1077Staphylococcus epidermidis 2.62Mb 32.1 1650 1649 99.9% 1511 91.6% 771Streptococcus agalactiae 2.16Mb 35.6 1441 1438 99.8% 1336 92.7% 683Streptococcus pneumoniae 2.16Mb 39.7 1359 1355 99.7% 1214 89.3% 780Thermotoga maritima 1.86Mb 46.2 1092 1090 99.8% 892 81.7% 804Treponema denticola 2.84Mb 37.9 1463 1463 100.0% 1332 91.0% 1210Treponema pallidum 1.14Mb 52.8 575 572 99.5% 425 73.9% 557Ureaplasma parvum 0.75Mb 25.5 327 327 100.0% 300 91.7% 293Wolbachia endosymbiont 1.08Mb 34.2 628 627 99.8% 528 84.1% 537

99.6% 84.3%Averages:

Genome Glimmer3 PredictionsMatches Correct Starts

• Glimmer3 trained & compared to RefSeq genes with annotated function

• Correct STOP:• 99.6%

• Correct START:• 84.3%

• “Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.”

29

N-terminal peptides

• (Protein) N-terminal peptides establish• start-site of known & unexpected ORFs

Use:• Directly to annotate genomes• Evaluate and improve algorithms• Map cross-species

30

N-terminal peptide workflows

• Typical proteomics workflows sample peptides from the proteome “randomly”

• Caulobacter crescentus (70%)• 3733 Proteins (RefSeq Genome annot.)• 66K tryptic peptides (600 Da to 3000 Da)• 2085 N-terminal tryptic peptides (3%)

31

N-terminal peptide workflow

• Protect protein N-terminus

• Digest to peptides• Chemically modify

free peptide N-term• Use chem. mod. to

capture unwanted peptides

Nat Biotech, Vol. 21, pp. 566-569, 2003.

32

Increasing N-terminal peptide coverage

• Multiple (digest) enzymes:• trypsin-R:

60% (80%)• acid + lys-C + trypsin:

85% (94%)• Repeated LC-MS/MS• Precursor Exclusion /

Inclusion lists• MALDI / ESI• Protein separation

and/or orthogonal fractionation Anal Chem, Vol. 76, pp. 4193-4201, 2004.

33

Proteomics Informatics

• Search spectra against:• Entire bacterial genome;• All Met initiated peptides; or • Statistically likely Met initiated peptides.

• Easily consider initial Met loss PTM, too

• Off-the-shelf MS/MS search engines (Mascot / X!Tandem / OMSSA)

34

Other Practical Issues

• Suitable for commonly available instrumentation• Only the sample prep. is (somewhat) novel.

• Need living organism• Stage of life-cycle?

• Bang for buck?• N-terminal peptides / $$$$

35

Other Research Projects

• Alternative splicing and coding SNPs in clinical cancer samples

• MS/MS spectral matching using HMMs• Combining MS/MS search engine results

using machine learning• Microorganism identification using MS

(www.RMIDb.org)• Gapped/spaced seeds for inexact sequence

alignment.• Applications of SBH-graphs and Eulerian

paths

36

Hidden Markov Models for Spectral Matching

• Capture statistical variation and consensus in peak intensity

• Capture semantics of peaks• Extrapolate model to other peptides

• Good specificity with superior sensitivity for peptide detection• Assign 1000’s of additional spectra (w/ p-value < 10-5)

37

Peptide DLATVYVDVLK

38

Peptide DLATVYVDVLK

39

Acknowledgements

• Catherine Fenselau, Steve Swatkoski• UMCP Biochemistry

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Cheng Lee• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: NIH/NCI, USDA/ARS

improving genome annotation using proteomics nathan edwards center for bioinformatics and...

Documents

proteomicsmeasure mass

lck gene

transcript evidence

coding bases

lime1 gene

tomass spectrometry

proteomicsmass spectrometry

thoc2 gene