discovery and annotation of novel proteins from rumen gut metagenomic sequencing data

Post on 12-Apr-2017

1.878 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Mick WatsonThe Roslin Institute

Edinburgh GenomicsUniversity of Edinburgh

Edinburgh Genomics• Genomics facility based at the University of Edinburgh• Available for collaborations on an academic, non-profit basis• Formed from merger of

– ARK-Genomics– The GenePool

• Funded by three major bio UK research councils

• A range of technologies and expertise available

http://genomics.ed.ac.uk

What am I going to talk about?• “The peril-ome” – perils of studying the

microbiome• Three projects– Enzyme discovery– Methane emissions– Rumen compartments

“THE PERIL-OME” – PERILS OF STUDYING THE MICROBIOME

What is the microbiome?“the ecological community of commensal, symbiotic, and pathogenic microorganisms that literally share our body space”

- Joshua Lederberg

Note: includes funghi, protists, archaea, bacteria, algae, viruses etc etc etc

(whisper it: most “microbiome” studies only look at bacteria/archaea)

How do we study the microbiome?• Marker gene vs shotgun metagenomics• Marker gene– 16S / 18S / ITS– Amplify this and compare

• Metagenomics– Extract all DNA– Fragment, sequence, interpret

• Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weightman AJ. At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol. 2005 71(12):7724-36.

16S reference databases are not accurate

Your 16S reads are not accurate• Amongst other

things, analysed a mock community with different sequencing and bioinformatics strategies

• Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl Environ Microbiol. 2013 S79(17):5112-20.

• Three 16S regions sequenced using 2x250bp– V4 (~250 bp), V34 (430bp), and V45 regions (~375 bp)– In the Mock community, there should be 20 OTUs

16S sequencing strategy?• The only strategy that got close to the correct result is

complete overlap of 2x250bp MiSeq reads

Your sample/DNA extraction protocol has an influence

“we found that each DNA extraction method resulted in unique community patterns”

“We observed significant differences in distribution of bacterial taxa depending on the method.”

Freezing your sample risks losing Bacteroidetes

“Samples frozen with and without glycerol as cryoprotectant indicated a major loss of Bacteroidetes in unprotected samples”

Your reagents are contaminated

• Sequenced a pure culture of Salmonella bongori

• Extracted DNA using different kits• Did serial dilutions of the pure

culture to assess impact of contaminating species

• Studying the microbiome is hard

• Please proceed carefully

WHY DO WE STUDY THE RUMEN?

Why do we stufy the rumen?• Energy from food"Our results indicate that the obese microbiome has an increased capacity to harvest energy from the diet.

Furthermore, this trait is transmissible: colonization of germ-free mice with an 'obese microbiota' results in a significantly greater increase in total body fat than colonization with a 'lean microbiota'"

Turnbaugh et al (2006) An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444(7122):1027-31

• Novel enzyme discovery"An initial assembly of the metagenomic sequence resulted in 179,092 scaffolds... Only 47 (0.03%) of the

assembled scaffolds showed high levels of similarity to previously sequenced genomes available in GenBank. These results suggest that the vast majority of the assembled scaffolds represent segments of hitherto uncharacterized microbial genomes." Hess M et al (2011) Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 331(6016):463-7.

• Methane EmissionsGlobally, ruminant livestock produce about 80 million metric tons of methane annually, accounting for

about 28% of global methane emissions from human-related activities.

PROJECT 1: NOVEL ENZYME DISCOVERY

What did we sequence?Sample Desc

#Reads (millions) Read type Gbp

Ag2 Sheep, highland pasture 61.84 100x2 12.37Bg2 Sheep, highland pasture 87.12 100x2 17.42

1099_C1 Cattle, maize sileage 56.60 100x2 11.321043_C2 Cattle, maize sileage 55.89 100x2 11.181033_C1 Cattle, maize sileage 63.60 100x2 12.72

983 Cattle, maize sileage 217.79 100x2 43.56D1a Red Deer, rough grazing 149.51 150x2 29.90D2a Red Deer, rough grazing 125.77 150x2 25.15D3b Red Deer, rough grazing 171.13 150x2 34.23D4b Red Deer, rough grazing 160.55 150x2 32.11R1b Reindeer, Summer Pasture 149.40 150x2 29.88R2b Reindeer, Summer Pasture 209.29 150x2 41.86

301.70

Assembly protocol• Trim reads to Q30 (sickle)• Assemble using Velvet• Manual inspection of coverage peaks• Re-assemble using MetaVelvet• At this stage, no optimisation for K (used K:51)

Taxon assignment• Tried to assign scaffolds based on similarity to existing

genomes• What cut-off did we use? Using megablast, require

– HSP of at least 100bp– % identity of 80%

Sample N50 Total Number Max Hits %557_1 Ag2 2502 171080118 73968 250047 5867 7.93

557_2 Bg2 2620 359972055 153624 152301 12770 8.31

557_3 1099_C1 1518 107617445 68547 53793 4842 7.06

557_4 1043_C2 1623 50054937 29157 54895 2963 10.16

557_5 1033_C1 1604 129661930 77631 89904 6445 8.30

557_6 983 1432 54430150 35961 37263 1954 5.43

Log coverage vs %GC

Gene predictions and domains

sheep_1 sheep_2 cow_1 cow_2 cow_3 cow_4 Reindeer_1

deer_1 deer_2 deer_3 deer_40

100000

200000

300000

400000

500000

600000

ProteinsWith DomainTotal domainsUnique domains

Sample Proteins With Domain Total domains Unique domainssheep_1 262578 109432 294242 12972sheep_2 534761 217719 566000 13517cow_1 302624 105701 248925 13072cow_2 213662 83562 200267 13031cow_3 355222 127535 302140 13298cow_4 218302 76966 182265 12723

Reindeer_1 411158 165709 420309 13566deer_1 492563 194275 465101 13572deer_2 199967 77375 185724 13017deer_3 340798 139906 342889 13477deer_4 414010 165756 413926 13540

3745645 1463936 3621788 145785

Gene prediction protocol• Extracted long ORFs (> 200bp) – also use Glimmer-MG• Translate• Compare to Pfam

– Uses pfam_scan.pl -> hmmpfam (HMMER)• Typical output: 801aa protein

• Involved in Fe transport• 54% identical, 72% positive to previously sequenced protein

– ferrous iron transporter B [Odoribacter laneus]

Clustering of Pfam families:

Clustering of known taxonomy

Cellulose degradation• Cellulose -> Glucose• Turning plants into fuel – what ruminants are good at!• Focus of e.g. biofuels

Enzyme class EC# Pfam Number found

Cellulase Various Cellulase (PF00150) 981

Endoglucanases EC 3.2.1.4 Glyco_hydro_45 (PF02015) 70

Exoglucanases EC 3.2.1.91 Glyco_hydro_48 (PF02011) 18

β-glucosidases EC 3.2.1 21 Glyco_hydro_1 (PF00232) 273

PROJECT 2: METHANE EMISSIONS

Methane production• Methane is a natural product of anaerobic microbial fermentation

– Rumen is anaerobic• Methane is a greenhouse gas (GHG) with a global warming

potential 25-fold that of carbon dioxide (IPCC 2006). • Ruminants are the major producers of methane emissions from

anthropogenic activities, – accounting for 37% of total GHG from agriculture in the UK

• Methane emissions from cattle are entirely microbial in origin

Our data set• Steers chosen from

longitudinal study• Chose high and low

methane emitters matched for breed and diet

• Submitted for metagenomic sequencing

• Approx. 11Gb per sample

Relationship to archaeal abundance• Mapped metagenomic reads

to Greengenes database• Recorded all hits in database

that are as good as best hit• Calculated lowest common

taxon (in this case, Kingdom)• Matched for breed and diet,

high methane correlates with high archaeal abundance

• qPCR confirms this

Relationship to enzyme abundance• Mapped

metagenomic reads to KEGG

• Matched for breed and diet, the abundance of several enzymes is associated with methane production

Relationship to enzyme abundance• Mapped

metagenomic reads to KEGG

• Matched for breed and diet, the abundance of several enzymes is associated with methane production

Methane pathway• Fig on left is from

Shi et al Genome Research 24(9):1517-25

• Fig on right is same enzymes in our data set, matched for breed and diet

What’s in there?• Assembled all 8 metagenomes with MetaVelvet

– Predicted genes with Prokka– Annotated using Pfam domains

• 1.5 million gene/protein predictions• Less than half have any known domain• From 44 KEGG orthologues

– 7021 in our data– 5942 unique protein sequences

• Only 29 have exact match in NR• Only 60 are 100% conserved• At 90% identity, 807 / 5942 have hit

PROJECT 3: RUMEN COMPARTMENTS

Aims• New project, BBSRC CASE studentship• Sequenced 4 rumen compartments from 4 cows• Qu: do samples cluster by rumen compartment, by

cow, or neither?

qPCR of archaea/bacteria ratio

Kraken on metagenomics data• Classified < 2% of our data!

Novoalign reads against GREENGENES

PCA of genera abundances (novoalign)

Prokka gene prediction results

557N0200 557N0201 557N0202 557N0203 557N0204 557N0205 557N0206 557N0207 557N0208 557N0209 557N0210 557N0211 Totals

CDS 245580 343655 339668 523740 411384 367806 295141 330008 236079 248166 285507 163064 3789798

CDS ≥100aa 174599 246490 249712 391531 303543 260613 211776 243615 172023 180944 207805 102805 2745456

Ongoing work• How many proteins are novel?

– Compare to nr• How many proteins in common?

– Within datasets– Between datasets

• Annotation of protein domains• Load data into Meta4 database (http://dx.doi.org/10.3389/fgene.2013.00168)

• Identify putative enzymes of interest• Sequence and analysis of additional rumen samples• Assessment of additional software tools

– Xander – focused extraction of genes from metagenomics data– ShortBRED – functional characterisation of metagenomics data

Follow me:

Twitter: @BioMickWatsonBlog: biomickwatson.wordpress.com

AcknowledgementsFunders: BBSRC, Roslin Foundation, TSB

People: Despoina Roumpeka, Rainer Roehe, John Wallace, Edinburgh GenomicsEdinburgh Genomics: http://genomics.ed.ac.ukThe Roslin Institute: http://www.roslin.ed.ac.uk

top related