proteogenomics10oct2008 v2 com

Upload: s-b-mirza

Post on 09-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    1/22

    Annotating genomes using

    proteomics data

    Andy Jones

    Department of Preclinical VeterinaryScience

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    2/22

    Overview

    Genome annotation

    Current informatics methods

    Experimental data

    How good are we at annotating genomes?

    Proteome data for genome annotation

    Study on Toxoplasma

    Challenges

    Proposed solutions

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    3/22

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    4/22

    Annotating eukaryotic genomes

    Genome annotation:

    Find start codons / transcriptional initiation

    Recognise splice acceptor and donor sequences Stop codon

    Predict alternative splicing...

    Start codon

    Exon 1 Exon 2 Exon 3 Exon 4

    Stop codon

    Genomic DNA

    mRNA

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    5/22

    Computational gene prediction

    Denovo prediction single genome Trained with typical gene structures - learn exon-intron

    signals, translation initiation and termination signals e.g.Markovmodels

    Many different predictions scored based on training set ofknown genes

    Multiple genome Compare confirmed gene sequences from other species

    Coding regions more highly conserved conservationindicates gene position

    Pattern searching: Higher mutation rate of bases separatedin multiples of three (mutations in 3rd position of codons areoften silent)

    Experimental data also contribute to many genomeprojects

    New methods weigh evidence from a variety ofsources Attempting to reproduce how a human annotator would

    work

    Brent, Nat Rev Genet. 2008 Jan;9(1):62-73

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    6/22

    Experimental corroboration of models

    Expressed Sequence Tags Simple to obtain large volumes of data sequence

    randomly from cDNA libraries

    Problems:

    Data sets can contain unprocessed transcripts (do not alwaysconfirm splicing)

    Rarely cover 5 end of gene

    Generally low-quality sequences

    High-throughput sequencing

    Next-generation sequencers capable of directlysequencing mRNA

    Likely to become more widely used in the future

    Proteome data (peptide sequence data)

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    7/22

    How good are gene models?

    Plasmodium falciparum (causative agent malaria)

    genome sequenced in 2002, undergone considerable

    curation of gene models

    Recent article: cDNA study ofP. falciparum

    Suggests ~25% of genes in

    Plasmodium

    falciparum are incorrect (85 genes out of 356

    sampled)

    Majority of errors are in splice junctions (intron-

    exon boundaries)

    What does this mean for other genomes...?

    Likely that high percentage of gene sequences areincorrect!

    BMC Genomics. 2007 Jul 27;8:255.

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    8/22

    Proteome data for genome annotation

    Motivation for genome annotation:

    Can rule out that transcripts are non protein-coding

    Large volumes of proteome data often collected for other

    purposes

    Certain types of proteome data able to confirm the start

    codon of genes (difficult by other methods)

    Even where considerable ESTs / cDNA sequencing has been

    performed, proteins can be detected with nocorresponding EST evidence

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    9/22

    Proteogenomic study ofToxoplasma gondii

    Proteome study ofToxoplasma gondiiusing three

    complementary techniques

    parasite of clinical significance related to Plasmodium

    Study aims:

    Identify as many components of the

    proteome as possible

    Relate peptide sequence data back to

    genome to confirm genes

    Relate protein expression data totranscriptional data (EST / microarray)

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    10/22

    2D gel electrophoresis

    1D gel

    electrophoresis

    Cut bands

    Trypsin digestion

    Cut gel spot

    Trypsin digestion

    Trypsin digestion

    Fractions

    Mass spectrometry

    Sequence database search

    (compare with theoretical spectra

    predicted for each peptide in DB)

    Liquid chromatography

    Peptides

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    11/22

    Database search strategy

    ToxoDB

    60MB genome

    sequence

    Official gene models

    Alternative gene models

    predicted by gene

    finders

    = DNA sequence database

    = amino acid sequence database

    ORFs predicted in a 6 frametranslation

    Concatenate

    databases

    Search all spectra

    Identify peptides

    and proteins

    Align peptide sequences back to corresponding genomic region

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    12/22

    Five exon gene; incomplete agreement between different gene models

    Peptide evidence for all 5 exons and 2 introns out of 4

    Note: Can only provide positive evidence, no peptides matched to 5 and 3

    termini of gene model

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    13/22

    -Appears to be additional exon at 5

    -None of GLEAN, TwinScan or TigrScan algorithms appears to have made correct

    prediction

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    14/22

    ORF/ part of TgGlimmerHMM sequence:VVGGFSSNFLSFFSVIITSVKMSDAEDVTFETA

    DAGASHTYPMQAGAIKKNGFVMLKGNPCKV

    VDYSTSKTGKHGHAKAHIVGLDIFTGKKYED

    VCPTSHNMEVPNVKRSEFQLIDLSDDGFCTLL

    LENGETKDDLMLPKDSEGNLDEVATQVKNLF

    TDGKSVLVTVLQACGKEKIIASKEL

    50.m5694 sequence:

    MVEGVYSSFEAMIFSLPHACRTVTRT

    DLPSVKRFLTCVATSSKFPSESLGSIK

    SSFVSPFSRSSVQKPSSDKSINWNSDL

    FTFGTSML

    - All peptides matched to gene models on opposite strand

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    15/22

    Study outcomes

    Protein evidence for approximately 1/3 of predictedgenes (2250 proteins)

    Around 2500 splicing events confirmed Peptides aligned across intron-exon boundaries

    Around 400 protein IDs appear to match alternativegene models

    Genome database (ToxoDB) hosts peptide sequencesaligned against gene models

    Can we use informatics to improve this strategy...?

    Xia et al. (2008) Genome Biology,9(7),pp.R11

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    16/22

    Challenges of proteogenomics

    Main informatics challenge: A protein can usually only be identified if the gene sequence has

    been correctly predicted from the genome

    In effect, would like to use MS data directly for gene discovery

    But... searching a six frame genome translation is problematic

    All peptide and protein identifications are probabilistic False positive rate is proportional to search database size

    On average only ~10-20% of spectra identify a peptide

    Need methods that can exploit the rest of the meaningful spectra

    When gene models change, protein identifications are outof date No dynamic interaction between proteome and genome data

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    17/22

    Automated re-annotation pipeline

    Planned improvements to the informatics workflow:

    1. Re-querying pipeline each time gene models change, all mass spectra are automatically re-

    queried2. Integrate peptide evidence directly into gene finding

    software

    3. Maximising the number of informative mass spectra

    4. Attempt to optimise algorithms for denovo sequencing of

    peptides5. N-terminal proteomics

    - Could be used to confirm gene initiation point

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    18/22

    Spectra

    Multiple

    database searchengines

    Official

    gene set

    Confirmed official

    model

    Multiple

    database search

    engines

    Modified de

    novo

    algorithms

    Novel ORF, splice

    junction

    Promote alternative

    model

    Stage 1

    Stage 2

    Gene

    Finder

    Proteomic evidence

    Alternative

    gene models

    Genome

    sequence

    Spectra searched in series Peptide evidence confirming official gene, alternative model, new ORF:

    Direct flow back to modified gene finder

    Produce new set of predictions Iteratively improve number of spectra identified

    In each iteration, fewer spectra flow on to stage 2 and 3

    Stage 3

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    19/22

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    20/22

    Query spectra using different search engines

    Jones et al.Improving sensitivity in proteome studies by analysis of false discoveryrates for multiple search engines. PROTEOMICS, in press (2008)

    Each search engine produces a different non-standard score of the quality of a match

    Developed a search engine independent score, based on analysis of false discovery rate

    Identifications made more search engines are scored more highly

    Can generate 35% more peptide identification than best single search engine

    Omssa

    X!Tandem

    Mascot

    Peptides

    Combined

    listP

    eptides

    Peptides

    Omssa X!Tandem

    Mascot

    Peptide identifications

    Rescoring

    Algorithm(FDR)

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    21/22

    Conclusions

    Proteome data is able to confirm gene models are

    correct

    Currently data under-exploited

    Challenges searching mass spec data directly againstthe genome for gene discovery

    Build re-querying pipeline

    Iteratively improve gene models

    Improve capabilities for using multiple search engines

    Integrate peptide evidence directly into gene finders

  • 8/8/2019 Proteogenomics10Oct2008 v2 com

    22/22

    Acknowledgments

    Data from Wastling lab:

    Dong Xia, Sanya Sanderson, Jonathan Wastling

    ToxoDB at Upenn David Roos, Brian Brunk

    Email: [email protected]