glbio/ccbc metagenomics workshop

61
GLBIO/CCBC Microbiome Analysis Workshop: Metagenomics Morgan G.I. Langille Assistant Professor Dalhousie University May 16, 2016

Upload: morgan-langille

Post on 15-Apr-2017

1.159 views

Category:

Science


3 download

TRANSCRIPT

Page 1: GLBIO/CCBC Metagenomics Workshop

GLBIO/CCBC Microbiome Analysis Workshop: Metagenomics

Morgan G.I. LangilleAssistant Professor

Dalhousie UniversityMay 16, 2016

Page 2: GLBIO/CCBC Metagenomics Workshop

Learning Objectives• Contrast 16S and metagenomic sequencing

• Taxonomy from metagenomes

• Function from metagenomes

• Applicability of assembling and gene calling with metagenomic data

• Metagenomic inference and limitations

• Tutorial on processing metagenomic data to determine functional and taxonomic profiles

Page 3: GLBIO/CCBC Metagenomics Workshop

16S vs Metagenomics• 16S is targeted sequencing of a single gene which acts as a

marker for identification• Pros

– Well established– Sequencing costs are relatively cheap (~50,000 reads/sample)– Only amplifies what you want (no host contamination)

• Cons– Primer choice can bias results towards certain organisms– Usually not enough resolution to identify to the strain level – Different primers are needed for archaea & eukaryotes (18S)– Doesn’t identify viruses

Page 4: GLBIO/CCBC Metagenomics Workshop

16S vs Metagenomics

• Metagenomics: sequencing all the DNA in a sample• Pros

– No primer bias– Can identify all microbes (euks, viruses, etc.)– Provides functional information (“What are they doing?”)

• Cons– More expensive (millions of sequences needed)– Host/site contamination can be significant– May not be able to sequence “rare” microbes– Complex bioinformatics

Page 5: GLBIO/CCBC Metagenomics Workshop

TAXONOMIC PROFILESWho is there?

Page 6: GLBIO/CCBC Metagenomics Workshop

Metagenomics: Who is there?

• Goal: Identify the relative abundance of different microbes in a sample given using metagenomics

• Problems:– Reads are all mixed together – Reads can be short (~100bp)– Lateral gene transfer

• Two broad approaches1. Binning Based2. Marker Based

Page 7: GLBIO/CCBC Metagenomics Workshop

Binning Based

• Attempts to group or “bin” reads into the genome from which they originated

• Composition-based– Uses sequence composition such as GC%, k-mers (e.g. Naïve

Bayes Classifier)– Generally not very precise

• Sequence-based– Compare reads to large reference database using BLAST (or

some other similarity search method)– Reads are assigned based on “Best-hit” or “Lowest Common

Ancestor” approach

Page 8: GLBIO/CCBC Metagenomics Workshop

LCA: Lowest Common Ancestor • Use all BLAST hits above a threshold and assign taxonomy at the lowest

level in the tree which covers these taxa.

• Notable Examples:– MEGAN: http://ab.inf.uni-tuebingen.de/software/megan5/

• One of the first metagenomic tools• Does functional profiling too!

– MG-RAST: https://metagenomics.anl.gov/• Web-based pipeline (might need to wait awhile for results)

– Kraken: https://ccb.jhu.edu/software/kraken/• Fastest binning approach to date and very accurate. • Large computing requirements (e.g. >128GB RAM)

Page 9: GLBIO/CCBC Metagenomics Workshop

Marker Based

• Single Gene• Identify and extract reads hitting a single marker gene (e.g.

16S, cpn60, or other “universal” genes)• Use existing bioinformatics pipeline (e.g. QIIME, etc.)

• Multiple Gene• Several universal genes

– PhyloSift (Darling et al, 2014)» Uses 37 universal single-copy genes

• Clade specific markers– MetaPhlAn2 (Truong et al., 2015)

Page 10: GLBIO/CCBC Metagenomics Workshop

Marker or Binning?

• Binning approaches– Similarity search is computationally intensive– Varying genome sizes and LGT can bias results

• Marker approaches– Doesn’t allow functions to be linked directly to

organisms– Genome reconstruction/assembly is not possible– Dependent on choice of markers

Page 11: GLBIO/CCBC Metagenomics Workshop

MetaPhlAn2

• Uses “clade-specific” gene markers• A clade represents a set of genomes that can be as

broad as a phylum or as specific as a species• Uses ~1 million markers derived from 17,000 genomes

– ~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic

• Can identify down to the species level (and possibly even strain level)

• Can handle millions of reads on a standard computer within a few minutes

Page 12: GLBIO/CCBC Metagenomics Workshop

MetaPhlAn Marker Selection

Page 13: GLBIO/CCBC Metagenomics Workshop

MetaPhlAn Marker Selection

Page 14: GLBIO/CCBC Metagenomics Workshop

Using MetaPhlan• MetaPhlan uses Bowtie2 for sequence similarity

searching (nucleotide sequences vs. nucleotide database)

• Paired-end data can be used directly

• Each sample is processed individually and then multiple sample can be combined together at the last step

• Output is relative abundances at different taxonomic levels

Page 15: GLBIO/CCBC Metagenomics Workshop

Absolute vs. Relative Abundance

• Absolute abundance: Numbers represent real abundance of thing being measured (e.g. the actual quantity of a particular gene or organism)

• Relative abundance: Numbers represent proportion of thing being measured within sample

• In almost all cases microbiome studies are measuring relative abundance– This is due to DNA amplification during sequencing

library preparation not being quantitative

Page 16: GLBIO/CCBC Metagenomics Workshop

Relative Abundance Use Case• Sample A:

– Has 108 bacterial cells (but we don’t know this from sequencing)– 25% of the microbiome from this sample is classified as Shigella

• Sample B:– Has 106 bacterial cells (but we don’t know this from sequencing)– 50% of the microbiome from this sample is classified as Shigella

• “Sample B contains twice as much Shigella as Sample A”– WRONG! (If quantified it we would find Sample A has more Shigella)

• “Sample B contains a greater proportion of Shigella compared to Sample A”– Correct!

Page 17: GLBIO/CCBC Metagenomics Workshop

FUNCTIONAL COMPOSITIONWhat are they doing?

Page 18: GLBIO/CCBC Metagenomics Workshop

What do we mean by function?

• General categories– Photosynthesis– Nitrogen metabolism– Glycolysis

• Specific gene families– Nifh – EC: 1.1.1.1 (alchohol dehydrogenase)– K00929 (butyrate kinase)

Page 19: GLBIO/CCBC Metagenomics Workshop

Various Functional Databases• COG

– Well known but original classification (not updated since 2003)

• SEED– Used by the RAST and MG-RAST systems

• PFAM– Focused more on protein domains

• EggNOG– Very comprehensive (~190k groups)

• UniRef– Has clustering at different levels (e.g. UniRef100, UniRef90, UniRef50)– Most comprehensive and is constantly updated

• KEGG– Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways” – Full access now requires a license fee

• MetaCyc– Becoming more widely used.– More microbe focused than KEGG

Page 20: GLBIO/CCBC Metagenomics Workshop

KEGG• We will focus on using the

KEGG database during this workshop

• KEGG Orthologs (KOs)– Most specific. Thought to be

homologs and doing the same exact “function”

– ~12,000 KOs in the database– These can be linked into KEGG

Modules and KEGG Pathways, – Identifiers: K01803, K00231, etc.

Page 21: GLBIO/CCBC Metagenomics Workshop

KEGG (cont.)• KEGG Modules

– Manually defined functional units– Small groups of KOs that function together– ~750 KEGG Modules– Identified: M00002, M00011, etc.

Page 22: GLBIO/CCBC Metagenomics Workshop

KEGG (cont.)• KEGG Pathways

– Groups KOs into large pathways (~230)

– Each pathway has a graphical map

– Individual KOs or Modules can be highlighted within these maps

– Pathways can be collapsed into very general functional terms (e.g. Amino Acid Metabolism, Carbohydrate Metabolism, etc.)

Page 23: GLBIO/CCBC Metagenomics Workshop

Metagenomic Annotation Systems• Web-based

– Provide functional and taxonomic analysis, plus hosts your data.– EBI Metagenomics Server– MG-RAST– IMG/M

• GUI based– MEGAN

• Taxonomy and functional annotation– ClovR

• Virtual Machine based, contains SOP, hasn’t been updated recently

• Command-line based– MetAMOS

• Built in assembly, highly customizable, some features can be buggy

– Humann• Functional annotation

– DIY• Set up your own in-house custom computational pipeline

Page 24: GLBIO/CCBC Metagenomics Workshop

Humann

(Abubucker et al. 2012)

Page 25: GLBIO/CCBC Metagenomics Workshop

Humann Step 1• Reads are searched against a protein database (e.g. KEGG)

– Can use BLASTX, but much faster methods now available (e.g. BLAT, USEARCH, RapSearch2, DIAMOND)

Buchfink et al., 2015

Page 26: GLBIO/CCBC Metagenomics Workshop

Humann

(Abubucker et al. 2012)

Page 27: GLBIO/CCBC Metagenomics Workshop

Humann Step 2

• Normalize and weight search results• The relative abundance of each KO is calculated:

– Number of reads mapping to a gene sequence in that KO

– Weighted by the inverse p-value of each mapping

– Normalized by the average length of the KO

Page 28: GLBIO/CCBC Metagenomics Workshop

Humann

(Abubucker et al. 2012)

Page 29: GLBIO/CCBC Metagenomics Workshop

Humann Step 3

• Reduce number of pathways• A KO can map to one or more KEGG Pathways

– Just because a KO is found in a pathway doesn’t mean that complete pathway exists in the community

– If a pathway has 20 KOs and only 2 KOs are observed in the community (but at high abundances) what should be the abundance of the pathway?

– MinPath (Ye, 2009) attempts to estimate the abundance of these pathways and remove spurious noise

Page 30: GLBIO/CCBC Metagenomics Workshop

Humann

(Abubucker et al. 2012)

Page 31: GLBIO/CCBC Metagenomics Workshop

Humann Step 4

• Reduce false positive pathways further and normalize by KO copy number

• Using the organism information from the KEGG hits– Pathways that are not found to be in any of the

observed organisms AND are made up mostly of KOs mapping to a different pathway are removed

– KO abundance can be divided by the estimated copy number of that KO as observed from the KEGG organism database

Page 32: GLBIO/CCBC Metagenomics Workshop

Humann

Page 33: GLBIO/CCBC Metagenomics Workshop

Humann Step 5

• Smoothing pathways by gap filling– Sequencing depth or poor sequence searches

could lead to some KOs within pathways being absent or in low abundance

– KOs with 1.5 interquartile ranges below the pathway median are raised to the pathway median

Page 34: GLBIO/CCBC Metagenomics Workshop

Humann

(Abubucker et al. 2012)

Page 35: GLBIO/CCBC Metagenomics Workshop

What about assembly?

• Assembly is often used in genomics to join raw reads into longer contigs and scaffolds

Page 36: GLBIO/CCBC Metagenomics Workshop

Assembly for Metagenomics?• Pros

– Less computation time for similarity search (sequences are collapsed)– Can allow annotation when reads are too short (<100bp)– Can sometimes (partially) reconstruct genomes

• Cons– Assembly is computationally intensive (high memory machines needed)– Collapsed reads must be added back to get relative abundances (not all

assemblers do this natively) – Low read depth and high diversity can cause assemblers to fail– Reads are not all from the same genome so chimeras are possible– Some organisms/genes will assemble easier (e.g. more abundant) which

could lead to annotation bias

Page 37: GLBIO/CCBC Metagenomics Workshop

What about gene calling?• In genomics, normally you would predict the start and stop

positions of genes using a gene prediction program before annotating the genes

• In metagenomics:– Pros:

• May result in less false positives from annotating “non-real” genes• Lowers the number of similarity searches

– Cons• Computationally intensive • No good learning dataset• Raw reads will not cover an entire gene• Often requires assembled data

– Possible tools: FragGeneScan, MetaGeneAnnotator– Alternative: Do 6 frame-translation (e.g. BLASTX)

Page 38: GLBIO/CCBC Metagenomics Workshop

Community Function Potential

• Important that this is metagenomics, not metatranscriptomics, and not metaproteomics

• These annotations suggest the functional potential of the community

• The presence of these genes/functions does not mean that they are biologically active (e.g. may not be transcribed)

Page 39: GLBIO/CCBC Metagenomics Workshop

PICRUSTPredicting function from 16S profiles

Page 40: GLBIO/CCBC Metagenomics Workshop

Sample 1 Sample 2 Sample 3

OTU 1 4 0 2

OTU 2 1 0 0

OTU 3 2 4 2

16S rRNA gene

QIIME

Shotgun Metagenomics

HUMAnN

Sample 1 Sample 2 Sample 3

K00001 20 15 18

K00002 1 2 0

K00003 4 5 4

MetaPhlAn

PICRUSt

STAMPSTAMP

Page 41: GLBIO/CCBC Metagenomics Workshop

41

• PICRUSt

• Phylogenetic Investigation of Communities by Reconstruction of Unobserved States

• http://picrust.github.com

Page 42: GLBIO/CCBC Metagenomics Workshop

PICRUSt: How does it work?

Page 43: GLBIO/CCBC Metagenomics Workshop

Predicting the abundance of a single function

Known gene abundanceAncestral gene abundancePredicted gene abundance

Page 44: GLBIO/CCBC Metagenomics Workshop

Predicting the abundance of a single function

Known gene abundanceAncestral gene abundancePredicted gene abundance

Repeat for each function (~8000X)

Repeat for all unknown tips (>100,000)

Page 45: GLBIO/CCBC Metagenomics Workshop

PICRUSt: Predicting Metagenomes

S1 S2 S3

12345 10 0 567890 1 0 066666 4 8 2

16S Copy Number

12345 567890 166666 2

S1 S2 S3

12345 2 0 167890 1 0 066666 2 4 1

Normalized OTU TablePICRUST 16S PredictionsOTU Table

Page 46: GLBIO/CCBC Metagenomics Workshop

PICRUSt: Predicting Metagenomes

S1 S2 S3

12345 10 0 567890 1 0 066666 4 8 2

16S Copy Number

12345 567890 166666 2

K0001 K0002 K0003

12345 4 0 267890 1 0 066666 2 4 2

S1 S2 S3

12345 2 0 167890 1 0 066666 2 4 1

S1 S2 S3

12345 2 0 167890 1 0 066666 2 4 1

S1 S2 S3

K0001 13 8 6K0002 8 16 4K0003 8 8 4

Normalized OTU Table

Metagenome Prediction

PICRUST 16S Predictions

PICRUST KEGG Predictions

OTU Table

Page 47: GLBIO/CCBC Metagenomics Workshop

• PICRUSt predictions across body sites

47Langille et al., 2013, Nature Biotechnology

Page 48: GLBIO/CCBC Metagenomics Workshop

48

Page 49: GLBIO/CCBC Metagenomics Workshop

49

Page 50: GLBIO/CCBC Metagenomics Workshop

50

Page 51: GLBIO/CCBC Metagenomics Workshop

VISUALIZATION AND STATISTICSWhat is important?

Page 52: GLBIO/CCBC Metagenomics Workshop

Visualization and Statistics

• Various tools are available to determine statistically significant taxonomic differences across groups of samples– Excel– SigmaPlot– Past– R (many libraries)– Python (matplotlib)– STAMP

Page 53: GLBIO/CCBC Metagenomics Workshop

STAMP

Page 54: GLBIO/CCBC Metagenomics Workshop
Page 55: GLBIO/CCBC Metagenomics Workshop

STAMP Plots

Page 56: GLBIO/CCBC Metagenomics Workshop

STAMP• Input

1. “Profile file”: Table of features (samples by OTUs, samples by functions, etc.)• Features can form a heirarchy (e.g. Phylum, Order, Class, etc) to allow

data to be collapsed within the program

2. “Group file”: Contains different metadata for grouping samples• Can be two groups: (e.g. Healthy vs Sick) or multiple groups (e.g. Water

depth at 2M, 4M, and 6M)

• Output– PCA, heatmap, box, and bar plots– Tables of significantly different features

Page 57: GLBIO/CCBC Metagenomics Workshop

METAGENOMICS WORKFLOWPutting it all together

Page 58: GLBIO/CCBC Metagenomics Workshop

Microbiome Helper

• Standard Operating Procedures (SOPs)– 16S – Shotgun Metagenomics

• Scripts to wrap and integrate existing tools– Available as an Ubuntu Virtualbox

• Tutorials/Walkthroughs

• https://github.com/mlangill/microbiome_helper/wiki

Page 59: GLBIO/CCBC Metagenomics Workshop

IMR: Integrated Microbiome Resource

• Offers sequencing and bioinformatics for microbiome projects (http://cgeb-imr.ca)

Page 60: GLBIO/CCBC Metagenomics Workshop

QUESTIONS?

Page 61: GLBIO/CCBC Metagenomics Workshop

Tutorial