k-mers in metagenomics - bioinformaticsbioinformatics.org.au/.../sites/9/...donovan-parks.pdf ·...

28
K-mers in Metagenomics by donovan parks

Upload: others

Post on 21-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

K-mers in Metagenomicsby donovan parks

Page 2: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

2 of 27

metagenomicsenvironmental

sample

extract and sequence DNA

QC and errorcorrect reads(K‐mers!)

assemble(K‐mers!)

bin genomes(K‐mers!)

assign taxonomy(and function)

(K‐mers!)

refine genomes(K‐mers!)

Page 3: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

Assigning Taxonomic Labels to Metagenomic DNA Sequences

Page 4: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

4 of 27

a plethora of approaches

Homology: BLAST, MEGAN

Composition: Kraken, CLARK, Naïve Bayes

Hybrid: PhymmBL, FCP, PhyloPythia

Phylogenetic: Treephyler, AMPHORA, GraftM

Marker genes: 16S profiling, MetaPhlAn, PhyloSift

clas

sify

all

read

scl

assi

fy s

ubse

t

Page 5: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

5 of 27

exploiting genomic (K‐mer) signatures PhymmBL (K≤8): interpolated Markov model

PhyloPythia (K ≈6): multiclass support vector machine

Naïve Bayes (K ≈15): probability of observing a K‐mer

Kraken  (K ≈31): exact K‐mer matching

CLARK  (K ≈31): exact matching of discriminative K‐mers

dens

e pr

ofile

ssp

arse

pro

files

Page 6: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

6 of 27

Kraken: K‐mer LCA database

Wood and Salzberg, Genome Biology, 2014

Reference Genomes(2,256 RefSeq Genomes)

Lowest common ancestordatabase

K‐mer LCAACC … GT g__EscherichiaACG … GT s__E. coliAGT … AA p__Proteobacteria…TGA … TT d__Bacteria

Extract K-mers

(default, K = 31)

Page 7: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

7 of 27

Kraken: classification tree

Wood and Salzberg, Genome Biology, 2014

Page 8: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

8 of 27

assessment of methods

Results from Ounit et al., BMC Genomics, 2015and Wood and Salzberg, Genome Biology, 2014

Classifier Precision Sensitivity Speed

Megablast 99.0 79.0 -

Naïve Bayes (K = 15) 82.3 82.3 8

Naïve Bayes (K = 11) 59.0 59.0 20

PhymmBL 82.3 82.3 -

CLARK 99.3 77.2 3.1 million

Kraken (K = 31) 99.3 77.8 2.3 million

Kraken (K = 20) 80.2 82.7 1.5 million

Precision: (correct classifications) / (total classifications) Sensitivity: (correct classifications) / (total reads) Speed: reads per minute Results for simple simulated dataset

Page 9: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

9 of 27

impact of K and reference database sizeClassifier Precision Sensitivity Speed

Megablast 99.0 79.0 ‐

Naïve Bayes (K = 15) 82.3 82.3 8

Naïve Bayes (K = 11) 59.0 59.0 20

PhymmBL 82.3 82.3 ‐

CLARK 99.3 77.2 3.1 million

Kraken (K = 31) 99.3 77.8 2.3 million

Kraken (K = 20) 80.2 82.7 1.5 million

Kraken‐GB (K = 31) 99.5 93.8 ‐

Performance is sensitive to K Kraken‐GB: 8,517 reference genomes instead of 2,256

Page 10: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

10 of 27

impact of taxonomic novelty

Results from Wood and Salzberg, Genome Biology, 2014

Taxonomic Novelty

Measured Rank Species Genus Family

Domain 24.4 7.9 2.8

Phylum 23.9 7.2 2.5

Class 24.7 7.1 2.0

Order 24.1 6.8 2.0

Family 25.4 8.5 -

Genus 26.3 - -

Sensitivity decreases rapidly with taxonomic novelty

Page 11: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

11 of 27

Kraken: some practical numbers Applied to metagenome from coalbed methane well ~82 million paired end reads (2 x 100bp)

~30 minutes to process with 8 threads  Reference database requires ~70GB of RAM  Classified 7.7% of reads 

0

10

20

30

40

50

60

Rel

ativ

e ab

unda

nce

(%)

16S profile

Kraken

Page 12: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

12 of 27

take away points

K‐mers widely used to assign taxonomy to metagenomic reads

Active area of research

Resolution limited by reference genomes 16S profiling still the gold standard change is coming…

Page 13: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

Recovering Population Genomes from Metagenomic Data

shotgunsequencing assembly

bin contigs into genomes(genome‐centric metagenomics)

metagenome

reads

contigs

Page 14: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

14 of 27

recovering genomes from metagenomic data

shotgunsequencing assembly

metagenome

reads

contigs

population genomes

identifystrain‐specific SNPs

binning

classify using coverage and k‐mer profiles

Page 15: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

15 of 27

differential coverage signal 

contigs with similar coverage profiles likely belong to the same genome!

Page 16: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

16 of 27

K‐mers and coverage: complementary signals

microbial community from coalbed methane well

coverage

tetranucleotide (PC1)

Genome Comp. (%) Cont. (%) Length (Mbp)ArchaeaMethanobacteriaceae 1 98.4 1.6 2.32Methanobacteriaceae 2 96.8 0.8 2.23Methanobacteriaceae 3 88.6 0.0 1.57Methanobacteriaceae 4 96.0 0.0 1.71

BacteriaActinobacteria 1 95.0 0.9 2.56Actinobacteria 2 90.5 2.7 2.72Actinobacteria 3 88.4 2.7 2.48Clostridiales 1 92.6 9.4 2.91Clostridiales 2 80.2 0.0 2.74Elusimicrobia 95.7 2.2 2.03Thermodesulfovibrionaceae 83.9 0.0 2.66Syntrophus 92.9 0.8 2.31Rikenellaceae 86.7 2.3 2.72Candidate Phylum OP1 83.9 0.0 1.66Rhodocyclaceae 69.0 1.63 3.73

Page 17: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

17 of 27

many ways to combine coverage + K‐mer profiles

GroopM: http://minillinim.github.io/GroopM/

DBB: https://github.com/dparks1134/DBB

CONCOCT: https://github.com/BinPro/CONCOCT

MetaWatt: http://sourceforge.net/projects/metawatt/

MetaBAT: https://bitbucket.org/berkeleylab/metabat

Page 18: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

18 of 27

MetaBAT overview

Kang et al., bioRxiv, 2014

Page 19: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

19 of 27

MetaBAT: statistical model of tetranucleotide signatures

Empirical parameters from ~1500 reference genomes Posterior probability that two contigs are from different genomes:

Kang et al., bioRxiv, 2014

contig size = 10kb

||

tetranucleotide distance, D tetranucleotide distance, D

prob

ability, P(in

ter|D)

Page 20: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

20 of 27

rapidly filling out tree of life

60 bacterial phyla

>3000 population genomes

23 habitats

51 phyla with population genome representatives

Page 21: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

21 of 27

take away points

Population genomes can be recovered from metagenomic samples

K‐mer profiles complement differential coverage signal

Rapidly expanding reference genomes Improve gene‐centric metagenomics

Page 22: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

Assessing and Refining Population Genomes

Page 23: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

23 of 27

estimating quality of population genomes

Additional markersrefine quality estimates

Scaffolds

Gammaproteobacteria sp.80 % complete, 20% contaminated

105 bacterial marker genesestimates: 92% comp., 17% cont.

281 clade-specific marker genesestimates: 83% comp., 22% cont.

Parks et al., Genome Res., 2015

Estimates ± 5%

Page 24: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

24 of 27

varying quality of recovered genomes

microbial community from coalbed methane well

coverage

tetranucleotide (PC1)

Genome Comp. (%) Cont. (%) Length (Mbp)ArchaeaMethanobacteriaceae 1 98.4 1.6 2.32Methanobacteriaceae 2 96.8 0.8 2.23Methanobacteriaceae 3 88.6 0.0 1.57Methanobacteriaceae 4 96.0 0.0 1.71

BacteriaActinobacteria 1 95.0 0.9 2.56Actinobacteria 2 90.5 2.7 2.72Actinobacteria 3 88.4 2.7 2.48Clostridiales 1 92.6 9.4 2.91Clostridiales 2 80.2 0.0 2.74Elusimicrobia 95.7 2.2 2.03Thermodesulfovibrionaceae 83.9 0.0 2.66Syntrophus 92.9 0.8 2.31Rikenellaceae 86.7 2.3 2.72Candidate Phylum OP1 83.9 0.0 1.66Rhodocyclaceae 69.0 1.63 3.73

Page 25: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

25 of 27

identifying potential contamination

95th percentile

outliers… treat with caution

Page 26: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

26 of 27

K‐mer modeling: impact of evolution

Bacteria vs. Archaea(Intra‐genome 95th percentile; K=4)

Classes of Proteobacteria(Intra‐genome 95th percentiles; K=4)

Page 27: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

27 of 27

final thoughts K‐mers widely used in gene‐ and genome‐centric metagenomic

Population genomes substantially improving diversity of available reference genomes Big win for taxonomic attribution methods And CheckM, and many other bioinformatic programs

How best to exploit population genomes Looking at 100,000+ reference genomes in next few years Issues in terms of scalability Using ‘noisy’ population genomes raises interesting questions

Page 28: K-mers in Metagenomics - Bioinformaticsbioinformatics.org.au/.../sites/9/...Donovan-Parks.pdf · K-mers in Metagenomics by donovan parks. 2o 2f 7 metagenomics environmental sample

Thank you!