discovery and annotation of novel proteins from rumen gut metagenomic sequencing data

47
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data Mick Watson The Roslin Institute Edinburgh Genomics University of Edinburgh

Upload: mick-watson

Post on 12-Apr-2017

1.878 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Mick WatsonThe Roslin Institute

Edinburgh GenomicsUniversity of Edinburgh

Page 2: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Edinburgh Genomics• Genomics facility based at the University of Edinburgh• Available for collaborations on an academic, non-profit basis• Formed from merger of

– ARK-Genomics– The GenePool

• Funded by three major bio UK research councils

• A range of technologies and expertise available

http://genomics.ed.ac.uk

Page 3: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

What am I going to talk about?• “The peril-ome” – perils of studying the

microbiome• Three projects– Enzyme discovery– Methane emissions– Rumen compartments

Page 4: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

“THE PERIL-OME” – PERILS OF STUDYING THE MICROBIOME

Page 5: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

What is the microbiome?“the ecological community of commensal, symbiotic, and pathogenic microorganisms that literally share our body space”

- Joshua Lederberg

Note: includes funghi, protists, archaea, bacteria, algae, viruses etc etc etc

(whisper it: most “microbiome” studies only look at bacteria/archaea)

Page 6: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

How do we study the microbiome?• Marker gene vs shotgun metagenomics• Marker gene– 16S / 18S / ITS– Amplify this and compare

• Metagenomics– Extract all DNA– Fragment, sequence, interpret

Page 8: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

• Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weightman AJ. At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol. 2005 71(12):7724-36.

16S reference databases are not accurate

Page 9: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Your 16S reads are not accurate• Amongst other

things, analysed a mock community with different sequencing and bioinformatics strategies

• Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl Environ Microbiol. 2013 S79(17):5112-20.

Page 10: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

• Three 16S regions sequenced using 2x250bp– V4 (~250 bp), V34 (430bp), and V45 regions (~375 bp)– In the Mock community, there should be 20 OTUs

Page 11: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

16S sequencing strategy?• The only strategy that got close to the correct result is

complete overlap of 2x250bp MiSeq reads

Page 12: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Your sample/DNA extraction protocol has an influence

“we found that each DNA extraction method resulted in unique community patterns”

“We observed significant differences in distribution of bacterial taxa depending on the method.”

Page 13: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Freezing your sample risks losing Bacteroidetes

“Samples frozen with and without glycerol as cryoprotectant indicated a major loss of Bacteroidetes in unprotected samples”

Page 14: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Your reagents are contaminated

• Sequenced a pure culture of Salmonella bongori

• Extracted DNA using different kits• Did serial dilutions of the pure

culture to assess impact of contaminating species

Page 15: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data
Page 16: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

• Studying the microbiome is hard

• Please proceed carefully

Page 17: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

WHY DO WE STUDY THE RUMEN?

Page 18: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Why do we stufy the rumen?• Energy from food"Our results indicate that the obese microbiome has an increased capacity to harvest energy from the diet.

Furthermore, this trait is transmissible: colonization of germ-free mice with an 'obese microbiota' results in a significantly greater increase in total body fat than colonization with a 'lean microbiota'"

Turnbaugh et al (2006) An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444(7122):1027-31

• Novel enzyme discovery"An initial assembly of the metagenomic sequence resulted in 179,092 scaffolds... Only 47 (0.03%) of the

assembled scaffolds showed high levels of similarity to previously sequenced genomes available in GenBank. These results suggest that the vast majority of the assembled scaffolds represent segments of hitherto uncharacterized microbial genomes." Hess M et al (2011) Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 331(6016):463-7.

• Methane EmissionsGlobally, ruminant livestock produce about 80 million metric tons of methane annually, accounting for

about 28% of global methane emissions from human-related activities.

Page 19: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

PROJECT 1: NOVEL ENZYME DISCOVERY

Page 20: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

What did we sequence?Sample Desc

#Reads (millions) Read type Gbp

Ag2 Sheep, highland pasture 61.84 100x2 12.37Bg2 Sheep, highland pasture 87.12 100x2 17.42

1099_C1 Cattle, maize sileage 56.60 100x2 11.321043_C2 Cattle, maize sileage 55.89 100x2 11.181033_C1 Cattle, maize sileage 63.60 100x2 12.72

983 Cattle, maize sileage 217.79 100x2 43.56D1a Red Deer, rough grazing 149.51 150x2 29.90D2a Red Deer, rough grazing 125.77 150x2 25.15D3b Red Deer, rough grazing 171.13 150x2 34.23D4b Red Deer, rough grazing 160.55 150x2 32.11R1b Reindeer, Summer Pasture 149.40 150x2 29.88R2b Reindeer, Summer Pasture 209.29 150x2 41.86

301.70

Page 21: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Assembly protocol• Trim reads to Q30 (sickle)• Assemble using Velvet• Manual inspection of coverage peaks• Re-assemble using MetaVelvet• At this stage, no optimisation for K (used K:51)

Page 22: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Taxon assignment• Tried to assign scaffolds based on similarity to existing

genomes• What cut-off did we use? Using megablast, require

– HSP of at least 100bp– % identity of 80%

Sample N50 Total Number Max Hits %557_1 Ag2 2502 171080118 73968 250047 5867 7.93

557_2 Bg2 2620 359972055 153624 152301 12770 8.31

557_3 1099_C1 1518 107617445 68547 53793 4842 7.06

557_4 1043_C2 1623 50054937 29157 54895 2963 10.16

557_5 1033_C1 1604 129661930 77631 89904 6445 8.30

557_6 983 1432 54430150 35961 37263 1954 5.43

Page 23: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Log coverage vs %GC

Page 24: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Gene predictions and domains

sheep_1 sheep_2 cow_1 cow_2 cow_3 cow_4 Reindeer_1

deer_1 deer_2 deer_3 deer_40

100000

200000

300000

400000

500000

600000

ProteinsWith DomainTotal domainsUnique domains

Sample Proteins With Domain Total domains Unique domainssheep_1 262578 109432 294242 12972sheep_2 534761 217719 566000 13517cow_1 302624 105701 248925 13072cow_2 213662 83562 200267 13031cow_3 355222 127535 302140 13298cow_4 218302 76966 182265 12723

Reindeer_1 411158 165709 420309 13566deer_1 492563 194275 465101 13572deer_2 199967 77375 185724 13017deer_3 340798 139906 342889 13477deer_4 414010 165756 413926 13540

3745645 1463936 3621788 145785

Page 25: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Gene prediction protocol• Extracted long ORFs (> 200bp) – also use Glimmer-MG• Translate• Compare to Pfam

– Uses pfam_scan.pl -> hmmpfam (HMMER)• Typical output: 801aa protein

• Involved in Fe transport• 54% identical, 72% positive to previously sequenced protein

– ferrous iron transporter B [Odoribacter laneus]

Page 26: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Clustering of Pfam families:

Page 27: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Clustering of known taxonomy

Page 28: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Cellulose degradation• Cellulose -> Glucose• Turning plants into fuel – what ruminants are good at!• Focus of e.g. biofuels

Enzyme class EC# Pfam Number found

Cellulase Various Cellulase (PF00150) 981

Endoglucanases EC 3.2.1.4 Glyco_hydro_45 (PF02015) 70

Exoglucanases EC 3.2.1.91 Glyco_hydro_48 (PF02011) 18

β-glucosidases EC 3.2.1 21 Glyco_hydro_1 (PF00232) 273

Page 29: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

PROJECT 2: METHANE EMISSIONS

Page 30: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Methane production• Methane is a natural product of anaerobic microbial fermentation

– Rumen is anaerobic• Methane is a greenhouse gas (GHG) with a global warming

potential 25-fold that of carbon dioxide (IPCC 2006). • Ruminants are the major producers of methane emissions from

anthropogenic activities, – accounting for 37% of total GHG from agriculture in the UK

• Methane emissions from cattle are entirely microbial in origin

Page 31: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Our data set• Steers chosen from

longitudinal study• Chose high and low

methane emitters matched for breed and diet

• Submitted for metagenomic sequencing

• Approx. 11Gb per sample

Page 32: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Relationship to archaeal abundance• Mapped metagenomic reads

to Greengenes database• Recorded all hits in database

that are as good as best hit• Calculated lowest common

taxon (in this case, Kingdom)• Matched for breed and diet,

high methane correlates with high archaeal abundance

• qPCR confirms this

Page 33: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Relationship to enzyme abundance• Mapped

metagenomic reads to KEGG

• Matched for breed and diet, the abundance of several enzymes is associated with methane production

Page 34: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Relationship to enzyme abundance• Mapped

metagenomic reads to KEGG

• Matched for breed and diet, the abundance of several enzymes is associated with methane production

Page 35: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Methane pathway• Fig on left is from

Shi et al Genome Research 24(9):1517-25

• Fig on right is same enzymes in our data set, matched for breed and diet

Page 36: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

What’s in there?• Assembled all 8 metagenomes with MetaVelvet

– Predicted genes with Prokka– Annotated using Pfam domains

• 1.5 million gene/protein predictions• Less than half have any known domain• From 44 KEGG orthologues

– 7021 in our data– 5942 unique protein sequences

• Only 29 have exact match in NR• Only 60 are 100% conserved• At 90% identity, 807 / 5942 have hit

Page 37: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data
Page 38: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

PROJECT 3: RUMEN COMPARTMENTS

Page 39: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Aims• New project, BBSRC CASE studentship• Sequenced 4 rumen compartments from 4 cows• Qu: do samples cluster by rumen compartment, by

cow, or neither?

Page 40: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

qPCR of archaea/bacteria ratio

Page 41: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Kraken on metagenomics data• Classified < 2% of our data!

Page 42: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Novoalign reads against GREENGENES

Page 43: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

PCA of genera abundances (novoalign)

Page 44: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Prokka gene prediction results

557N0200 557N0201 557N0202 557N0203 557N0204 557N0205 557N0206 557N0207 557N0208 557N0209 557N0210 557N0211 Totals

CDS 245580 343655 339668 523740 411384 367806 295141 330008 236079 248166 285507 163064 3789798

CDS ≥100aa 174599 246490 249712 391531 303543 260613 211776 243615 172023 180944 207805 102805 2745456

Page 45: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Ongoing work• How many proteins are novel?

– Compare to nr• How many proteins in common?

– Within datasets– Between datasets

• Annotation of protein domains• Load data into Meta4 database (http://dx.doi.org/10.3389/fgene.2013.00168)

• Identify putative enzymes of interest• Sequence and analysis of additional rumen samples• Assessment of additional software tools

– Xander – focused extraction of genes from metagenomics data– ShortBRED – functional characterisation of metagenomics data

Page 46: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

Follow me:

Twitter: @BioMickWatsonBlog: biomickwatson.wordpress.com

Page 47: Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data

AcknowledgementsFunders: BBSRC, Roslin Foundation, TSB

People: Despoina Roumpeka, Rainer Roehe, John Wallace, Edinburgh GenomicsEdinburgh Genomics: http://genomics.ed.ac.ukThe Roslin Institute: http://www.roslin.ed.ac.uk