harnessing public data repositories for metaproteomics ... · harnessing public data repositories...

1
Harnessing public data repositories for metaproteomics using Enosi Natalie E Castellana * , Andrey D. Prjibelski + , Dmitry An:pov + * Digital Proteomic, LLC, San Diego, CA, + Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia Mo#va#on and Background Metaproteomics is the next frontier in proteomics research, yet bioinformatic tools and resources remain scarce. Lack of fully sequenced genomes or proteomes create an enormous challenge. Often metaproteomics experiments are accompanied by expensive and time-consuming metagenomic and metatranscriptomic experiments that may only incompletely represent the repertoire of proteins in the sample. Alternatively, de novo sequencing of peptides can be used, however, this approach does not report the context of the peptides in a gene. In this study we examine the opportunity for re-use of public data to replace the need for coupled, large-scale metagenomic studies with metaproteomics. All datasets used in the study are publicly available. The driving questions behind our proof-of- concept study were the following: Q: Can current proteogenomic technology 1,2 be adapted for meta- proteogenomics? Q: Are matched metagenomic datasets required for metaproteomics? Q: Can similar (but unmatched) metatranscriptomic data fill the gap between the sequenced, cultured organism and the sampled organism? Conclusion Methods Data 1. Castellana, et al. (2013). An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol. Cell Proteomics, 13, 1:157-67.. 2. Woo, et al. (2013). Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res., 13, 1:21-8. 3. Tveit, et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29, 1:15-21. 4. Bankevich, et al. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 5:455-77. 5. Dobin, et al. (2013). Bioinformatics 6. Kim, et al. (2010). The generating function of CID, ETD, and CID/ ETD pairs of tandem mass spectra: applications to database search. Mol. Cell. Proteomics, 9:2840-52. 1: Transcript Assembly Enosi is a proteogenomic toolkit provided by Digital Proteomics LLC. • Similar, but un-matched, meta-proteogenomic samples exist across the many available databases (NCBI, SRA, ProteomeXchange) The Enosi database construction method effectively condenses sequence information from many RNA-Seq runs. This paves the way to creating organism or disease specific databases from large repositories. • Expanding the peptide identification to allow point mutations could potentially boost the identification rate. • Enosi accepts mass spectra from all major vendor instruments, both high accuracy and low accuracy, and in common file formats including mzML, mzXML, and mgf. Proteome 2: Reference Genome Selection, Variant Calling, Contig Mapping 3: Enosi Database Construction 4: Peptide Identification and Aggregation Originally, we attempted to map raw sequencing reads to the reference genome, however, high levels of variation prevented us from confidently mapping more than 1% of the reads. As a second pass, we pursued de novo transcript assembly to derive longer transcripts that would result in more confident mapping to the reference genome. 408 million RNA reads from both Illumina and 454 were considered. Contig assembly was done using RNAspades 4 . Contigs with more than 500 nucleotides were kept for consideration. 107,131 contigs were accepted in total. To generate a reasonable reference genome for our experiment, we searched the lituerature associated with the metatranscriptomic dataset. We determined the set of phyla that could be represented in the metatranscriptomic and metaproteomic datasets and downloaded all possible genomes from RefSeq. We downloaded genomes for bacteria, protists, fungi, archaea, and metazoa. Contigs were aligned to the genomes using STAR aligner 5 for long read alignment. The top 27 genomes were selected based on number of contigs mapped. The mapped contig coordinates for these genomes were retained. The concatenated 27 genomes became our ‘reference’ genome. Variant calling was done using mpileup from samtools. Since the reference genome was not likely to exactly match the species sampled in the metatranscriptomic experiments, we found many variants. In total, our database contained 14,451 mutations and 4,996 splice junctions. While we don’t expect splicing to exist in many of these species, the splice junctions may represent chromosomal changes between the observed organism and the reference. Accession Class gi|478476202|ref|NC_020912.1|Pseudomonas aeruginosa B136-33 Gammaproteobacteria gi|805557611|ref|NZ_LATE01000150.1|Sinorhizobium sp. PC2 B077DRAFT_scf7180000000478_quiver.150_C Alphaproteobacteria gi|83591340|ref|NC_007643.1|Rhodospirillum rubrum ATCC 11170 chromosome Alphaproteobacteria gi|820953769|ref|NZ_CP010979.1|Pseudomonas putida S13.1.2 Gammaproteobacteria gi|808032638|ref|NZ_CP011018.1|Escherichia coli strain CI5 Gammaproteobacteria gi|262193326|ref|NC_013440.1|Haliangium ochraceum DSM 14365 Deltaproteobacteria gi|389578211|ref|NZ_CM001488.1|Desulfobacter postgatei 2ac9 chromosome Deltaproteobacteria gi|507121141|ref|NZ_CM001773.1|Providencia sneebia DSM 19967 chromosome Gammaproteobacteria gi|114568554|ref|NC_008347.1|Maricaulis maris MCS10 Alphaproteobacteria gi|511097871|ref|NZ_CM001871.1|Xanthomonas campestris pv. campestris str. CN14 chromosome Gammaproteobacteria gi|655514693|ref|NZ_KI867150.1|Syntrophorhabdus aromaticivorans UI SynarDRAFT_SAI.2 Deltaproteobacteria gi|803453125|ref|NZ_JZXD01000120.1|Sinorhizobium meliloti strain L5-30 contig120 Alphaproteobacteria gi|383760955|ref|NC_017079.1|Caldilinea aerophila, complete genome Chloroflexi gi|320539756|ref|NZ_GL636115.1|Serratia symbiotica str. Tucson scaffold00283 Gammaproteobacteria gi|514340177|ref|NZ_KE150238.1|Bilophila wadsworthia 3_1_6 acCls-supercont2.1 Deltaproteobacteria gi|690979824|ref|NZ_KK737786.1|Acinetobacter baumanii BIDMC 57 aeebj-supercont1.1 Gammaproteobacteria gi|757660460|ref|NZ_KK214763.1|Enterobacter sp. BWH 27 adINx-supercont1.1 Gammaproteobacteria gi|148552929|ref|NC_009511.1|Sphingomonas wittichii RW1 Alphaproteobacteria gi|757650965|ref|NZ_KI973322.1|Enterobacter sp. MGH 2 adjFw-supercont2.1 Gammaproteobacteria gi|740145672|ref|NZ_AUNC01000049.1|Thalassospira permensis NBRC 106175 contig53 Alphaproteobacteria gi|820902542|ref|NZ_CP011254.1|Serratia fonticola strain DSM 4576 Gammaproteobacteria gi|739702482|ref|NZ_JNFC01000035.1|Sphingopyxis sp. LC363 contig40 Alphaproteobacteria gi|150395228|ref|NC_009636.1|Sinorhizobium medicae WSM419 chromosome Alphaproteobacteria gi|820948705|ref|NZ_JYFZ02000001.1|Citrobacter freundii strain MRSN 12115 scaffold00001 Gammaproteobacteria gi|482803487|ref|NZ_AJPW01000118.1|Bradyrhizobium elkanii CCBAU 43297 Scaffold1.contig123 Alphaproteobacteria gi|371486894|ref|NC_016147.2|Pseudoxanthomonas spadix BD-a59 Gammaproteobacteria gi|183236918|ref|NW_001915840.1|Entamoeba histolytica Amoebozoa Enosi creates a compact sequence database containing all possible peptides informed by the genome, variant calls, and splice junctions. After database construction, Enosi performs peptide identification using MSGF+ 6 . From the dataset available on ProteomeXchange, 52,641 MS/MS spectra were selected for our proof-of-concept study. Spectra were filtered to 5% FDR using the target-decoy approach. Clusters of co-located peptides were grouped together, with a maximum distance between any pair of peptides in a cluster set to 1000 base pairs. If a reference proteome is provided, we can assign ‘event types’ to the clusters of peptides. Mass spectra from soil samples collected at the Stordalen Mire in Sweden were downloaded from PRIDE (PXD000410). We then chose RNA-seq experiments from arctic peat soil samples (SRA identifier SRP014474) as our simulated ‘paired’ transcriptomic sample 3 Transcriptome We selected 27 bacterial and eukaryotic genomes that share similarity to the transcriptome data. The combined genome was used as the ‘reference’. Genome

Upload: others

Post on 23-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Harnessing public data repositories for metaproteomics ... · Harnessing public data repositories for metaproteomics using Enosi NatalieE’Castellana*,Andrey’D.’Prjibelski +,’Dmitry’An:pov’

Harnessing public data repositories for metaproteomics using Enosi Natalie  E  Castellana*,  Andrey  D.  Prjibelski+,  Dmitry  An:pov+  

 *Digital  Proteomic,  LLC,  San  Diego,  CA,  +Center  for  Algorithmic  Biotechnology,  St.  Petersburg  State  University,  St.  Petersburg,  Russia  

Mo#va#on  and  Background  Metaproteomics is the next frontier in proteomics research, yet bioinformatic tools and resources remain scarce. Lack of fully sequenced genomes or proteomes create an enormous challenge. Often metaproteomics experiments are accompanied by expensive and time-consuming metagenomic and metatranscriptomic experiments that may only incompletely represent the repertoire of proteins in the sample. Alternatively, de novo sequencing of peptides can be used, however, this approach does not report the context of the peptides in a gene. In this study we examine the opportunity for re-use of public data to replace the need for coupled, large-scale metagenomic studies with metaproteomics.

All datasets used in the study are publicly available. The driving questions behind our proof-of-concept study were the following:

Q: Can current proteogenomic technology1,2 be adapted for meta-proteogenomics?

Q: Are matched metagenomic datasets required for metaproteomics?

Q: Can similar (but unmatched) metatranscriptomic data fill the gap between the sequenced, cultured organism and the sampled organism?

Conclusion  

Methods  

Data  

1.  Castellana, et al. (2013). An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol. Cell Proteomics, 13, 1:157-67..

2.  Woo, et al. (2013). Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res., 13, 1:21-8.

3.  Tveit, et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29, 1:15-21.

4.  Bankevich, et al. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 5:455-77.

5.  Dobin, et al. (2013). Bioinformatics 6.  Kim, et al. (2010). The generating function of CID, ETD, and CID/

ETD pairs of tandem mass spectra: applications to database search. Mol. Cell. Proteomics, 9:2840-52.

1: Transcript Assembly

Enosi is a proteogenomic toolkit provided by Digital Proteomics LLC.

•  Similar, but un-matched, meta-proteogenomic samples exist across the many available databases (NCBI, SRA, ProteomeXchange)

•  The Enosi database construction method effectively condenses sequence information from many RNA-Seq runs. This paves the way to creating organism or disease specific databases from large repositories.

•  Expanding the peptide identification to allow point mutations could potentially boost the identification rate.

•  Enosi accepts mass spectra from all major vendor instruments, both high accuracy and low accuracy, and in common file formats including mzML, mzXML, and mgf.

Proteome  

2: Reference Genome Selection, Variant Calling, Contig Mapping

3: Enosi Database Construction

4: Peptide Identification and Aggregation

Originally, we attempted to map raw sequencing reads to the reference genome, however, high levels of variation prevented us from confidently mapping more than 1% of the reads. As a second pass, we pursued de novo transcript assembly to derive longer transcripts that would result in more confident mapping to the reference genome.

408 million RNA reads from both Illumina and 454 were considered. Contig assembly was done using RNAspades4. Contigs with more than 500 nucleotides were kept for consideration. 107,131 contigs were accepted in total.

To generate a reasonable reference genome for our experiment, we searched the lituerature associated with the metatranscriptomic dataset. We determined the set of phyla that could be represented in the metatranscriptomic and metaproteomic datasets and downloaded all possible genomes from RefSeq. We downloaded genomes for bacteria, protists, fungi, archaea, and metazoa.

Contigs were aligned to the genomes using STAR aligner5 for long read alignment. The top 27 genomes were selected based on number of contigs mapped. The mapped contig coordinates for these genomes were retained. The concatenated 27 genomes became our ‘reference’ genome. Variant calling was done using mpileup from samtools.

Since the reference genome was not likely to exactly match the species sampled in the metatranscriptomic experiments, we found many variants. In total, our database contained 14,451 mutations and 4,996 splice junctions. While we don’t expect splicing to exist in many of these species, the splice junctions may represent chromosomal changes between the observed organism and the reference.

Accession   Class  

gi|478476202|ref|NC_020912.1|Pseudomonas aeruginosa B136-33 Gammaproteobacteria gi|805557611|ref|NZ_LATE01000150.1|Sinorhizobium sp. PC2 B077DRAFT_scf7180000000478_quiver.150_C Alphaproteobacteria

gi|83591340|ref|NC_007643.1|Rhodospirillum rubrum ATCC 11170 chromosome Alphaproteobacteria

gi|820953769|ref|NZ_CP010979.1|Pseudomonas putida S13.1.2 Gammaproteobacteria

gi|808032638|ref|NZ_CP011018.1|Escherichia coli strain CI5 Gammaproteobacteria

gi|262193326|ref|NC_013440.1|Haliangium ochraceum DSM 14365 Deltaproteobacteria

gi|389578211|ref|NZ_CM001488.1|Desulfobacter postgatei 2ac9 chromosome Deltaproteobacteria

gi|507121141|ref|NZ_CM001773.1|Providencia sneebia DSM 19967 chromosome Gammaproteobacteria

gi|114568554|ref|NC_008347.1|Maricaulis maris MCS10 Alphaproteobacteria gi|511097871|ref|NZ_CM001871.1|Xanthomonas campestris pv. campestris str. CN14 chromosome Gammaproteobacteria

gi|655514693|ref|NZ_KI867150.1|Syntrophorhabdus aromaticivorans UI SynarDRAFT_SAI.2 Deltaproteobacteria

gi|803453125|ref|NZ_JZXD01000120.1|Sinorhizobium meliloti strain L5-30 contig120 Alphaproteobacteria

gi|383760955|ref|NC_017079.1|Caldilinea aerophila, complete genome Chloroflexi

gi|320539756|ref|NZ_GL636115.1|Serratia symbiotica str. Tucson scaffold00283 Gammaproteobacteria

gi|514340177|ref|NZ_KE150238.1|Bilophila wadsworthia 3_1_6 acCls-supercont2.1 Deltaproteobacteria

gi|690979824|ref|NZ_KK737786.1|Acinetobacter baumanii BIDMC 57 aeebj-supercont1.1 Gammaproteobacteria

gi|757660460|ref|NZ_KK214763.1|Enterobacter sp. BWH 27 adINx-supercont1.1 Gammaproteobacteria

gi|148552929|ref|NC_009511.1|Sphingomonas wittichii RW1 Alphaproteobacteria

gi|757650965|ref|NZ_KI973322.1|Enterobacter sp. MGH 2 adjFw-supercont2.1 Gammaproteobacteria

gi|740145672|ref|NZ_AUNC01000049.1|Thalassospira permensis NBRC 106175 contig53 Alphaproteobacteria

gi|820902542|ref|NZ_CP011254.1|Serratia fonticola strain DSM 4576 Gammaproteobacteria

gi|739702482|ref|NZ_JNFC01000035.1|Sphingopyxis sp. LC363 contig40 Alphaproteobacteria

gi|150395228|ref|NC_009636.1|Sinorhizobium medicae WSM419 chromosome Alphaproteobacteria gi|820948705|ref|NZ_JYFZ02000001.1|Citrobacter freundii strain MRSN 12115 scaffold00001 Gammaproteobacteria gi|482803487|ref|NZ_AJPW01000118.1|Bradyrhizobium elkanii CCBAU 43297 Scaffold1.contig123 Alphaproteobacteria

gi|371486894|ref|NC_016147.2|Pseudoxanthomonas spadix BD-a59 Gammaproteobacteria

gi|183236918|ref|NW_001915840.1|Entamoeba histolytica Amoebozoa

Enosi creates a compact sequence database containing all possible peptides informed by the genome, variant calls, and splice junctions.

After database construction, Enosi performs peptide identification using MSGF+6. From the dataset available on ProteomeXchange, 52,641 MS/MS spectra were selected for our proof-of-concept study. Spectra were filtered to 5% FDR using the target-decoy approach.

Clusters of co-located peptides were grouped together, with a maximum distance between any pair of peptides in a cluster set to 1000 base pairs. If a reference proteome is provided, we can assign ‘event types’ to the clusters of peptides.

Mass spectra from soil samples collected at the Stordalen Mire in Sweden were downloaded from PRIDE (PXD000410).

We then chose RNA-seq experiments from arctic peat soil samples (SRA identifier SRP014474) as our simulated ‘paired’ transcriptomic sample 3

Transcriptome  

We selected 27 bacterial and eukaryotic genomes that share similarity to the transcriptome data. The combined genome was used as the ‘reference’.

Genome