k-mers in metagenomics - bioinformaticsbioinformatics.org.au/.../sites/9/...donovan-parks.pdf ·...
TRANSCRIPT
K-mers in Metagenomicsby donovan parks
2 of 27
metagenomicsenvironmental
sample
extract and sequence DNA
QC and errorcorrect reads(K‐mers!)
assemble(K‐mers!)
bin genomes(K‐mers!)
assign taxonomy(and function)
(K‐mers!)
refine genomes(K‐mers!)
Assigning Taxonomic Labels to Metagenomic DNA Sequences
4 of 27
a plethora of approaches
Homology: BLAST, MEGAN
Composition: Kraken, CLARK, Naïve Bayes
Hybrid: PhymmBL, FCP, PhyloPythia
Phylogenetic: Treephyler, AMPHORA, GraftM
Marker genes: 16S profiling, MetaPhlAn, PhyloSift
clas
sify
all
read
scl
assi
fy s
ubse
t
5 of 27
exploiting genomic (K‐mer) signatures PhymmBL (K≤8): interpolated Markov model
PhyloPythia (K ≈6): multiclass support vector machine
Naïve Bayes (K ≈15): probability of observing a K‐mer
Kraken (K ≈31): exact K‐mer matching
CLARK (K ≈31): exact matching of discriminative K‐mers
dens
e pr
ofile
ssp
arse
pro
files
6 of 27
Kraken: K‐mer LCA database
Wood and Salzberg, Genome Biology, 2014
Reference Genomes(2,256 RefSeq Genomes)
Lowest common ancestordatabase
K‐mer LCAACC … GT g__EscherichiaACG … GT s__E. coliAGT … AA p__Proteobacteria…TGA … TT d__Bacteria
Extract K-mers
(default, K = 31)
7 of 27
Kraken: classification tree
Wood and Salzberg, Genome Biology, 2014
8 of 27
assessment of methods
Results from Ounit et al., BMC Genomics, 2015and Wood and Salzberg, Genome Biology, 2014
Classifier Precision Sensitivity Speed
Megablast 99.0 79.0 -
Naïve Bayes (K = 15) 82.3 82.3 8
Naïve Bayes (K = 11) 59.0 59.0 20
PhymmBL 82.3 82.3 -
CLARK 99.3 77.2 3.1 million
Kraken (K = 31) 99.3 77.8 2.3 million
Kraken (K = 20) 80.2 82.7 1.5 million
Precision: (correct classifications) / (total classifications) Sensitivity: (correct classifications) / (total reads) Speed: reads per minute Results for simple simulated dataset
9 of 27
impact of K and reference database sizeClassifier Precision Sensitivity Speed
Megablast 99.0 79.0 ‐
Naïve Bayes (K = 15) 82.3 82.3 8
Naïve Bayes (K = 11) 59.0 59.0 20
PhymmBL 82.3 82.3 ‐
CLARK 99.3 77.2 3.1 million
Kraken (K = 31) 99.3 77.8 2.3 million
Kraken (K = 20) 80.2 82.7 1.5 million
Kraken‐GB (K = 31) 99.5 93.8 ‐
Performance is sensitive to K Kraken‐GB: 8,517 reference genomes instead of 2,256
10 of 27
impact of taxonomic novelty
Results from Wood and Salzberg, Genome Biology, 2014
Taxonomic Novelty
Measured Rank Species Genus Family
Domain 24.4 7.9 2.8
Phylum 23.9 7.2 2.5
Class 24.7 7.1 2.0
Order 24.1 6.8 2.0
Family 25.4 8.5 -
Genus 26.3 - -
Sensitivity decreases rapidly with taxonomic novelty
11 of 27
Kraken: some practical numbers Applied to metagenome from coalbed methane well ~82 million paired end reads (2 x 100bp)
~30 minutes to process with 8 threads Reference database requires ~70GB of RAM Classified 7.7% of reads
0
10
20
30
40
50
60
Rel
ativ
e ab
unda
nce
(%)
16S profile
Kraken
12 of 27
take away points
K‐mers widely used to assign taxonomy to metagenomic reads
Active area of research
Resolution limited by reference genomes 16S profiling still the gold standard change is coming…
Recovering Population Genomes from Metagenomic Data
shotgunsequencing assembly
bin contigs into genomes(genome‐centric metagenomics)
metagenome
reads
contigs
14 of 27
recovering genomes from metagenomic data
shotgunsequencing assembly
metagenome
reads
contigs
population genomes
identifystrain‐specific SNPs
binning
classify using coverage and k‐mer profiles
15 of 27
differential coverage signal
contigs with similar coverage profiles likely belong to the same genome!
16 of 27
K‐mers and coverage: complementary signals
microbial community from coalbed methane well
coverage
tetranucleotide (PC1)
Genome Comp. (%) Cont. (%) Length (Mbp)ArchaeaMethanobacteriaceae 1 98.4 1.6 2.32Methanobacteriaceae 2 96.8 0.8 2.23Methanobacteriaceae 3 88.6 0.0 1.57Methanobacteriaceae 4 96.0 0.0 1.71
BacteriaActinobacteria 1 95.0 0.9 2.56Actinobacteria 2 90.5 2.7 2.72Actinobacteria 3 88.4 2.7 2.48Clostridiales 1 92.6 9.4 2.91Clostridiales 2 80.2 0.0 2.74Elusimicrobia 95.7 2.2 2.03Thermodesulfovibrionaceae 83.9 0.0 2.66Syntrophus 92.9 0.8 2.31Rikenellaceae 86.7 2.3 2.72Candidate Phylum OP1 83.9 0.0 1.66Rhodocyclaceae 69.0 1.63 3.73
17 of 27
many ways to combine coverage + K‐mer profiles
GroopM: http://minillinim.github.io/GroopM/
DBB: https://github.com/dparks1134/DBB
CONCOCT: https://github.com/BinPro/CONCOCT
MetaWatt: http://sourceforge.net/projects/metawatt/
MetaBAT: https://bitbucket.org/berkeleylab/metabat
18 of 27
MetaBAT overview
Kang et al., bioRxiv, 2014
19 of 27
MetaBAT: statistical model of tetranucleotide signatures
Empirical parameters from ~1500 reference genomes Posterior probability that two contigs are from different genomes:
Kang et al., bioRxiv, 2014
contig size = 10kb
||
tetranucleotide distance, D tetranucleotide distance, D
prob
ability, P(in
ter|D)
20 of 27
rapidly filling out tree of life
60 bacterial phyla
>3000 population genomes
23 habitats
51 phyla with population genome representatives
21 of 27
take away points
Population genomes can be recovered from metagenomic samples
K‐mer profiles complement differential coverage signal
Rapidly expanding reference genomes Improve gene‐centric metagenomics
Assessing and Refining Population Genomes
23 of 27
estimating quality of population genomes
Additional markersrefine quality estimates
Scaffolds
Gammaproteobacteria sp.80 % complete, 20% contaminated
105 bacterial marker genesestimates: 92% comp., 17% cont.
281 clade-specific marker genesestimates: 83% comp., 22% cont.
Parks et al., Genome Res., 2015
Estimates ± 5%
24 of 27
varying quality of recovered genomes
microbial community from coalbed methane well
coverage
tetranucleotide (PC1)
Genome Comp. (%) Cont. (%) Length (Mbp)ArchaeaMethanobacteriaceae 1 98.4 1.6 2.32Methanobacteriaceae 2 96.8 0.8 2.23Methanobacteriaceae 3 88.6 0.0 1.57Methanobacteriaceae 4 96.0 0.0 1.71
BacteriaActinobacteria 1 95.0 0.9 2.56Actinobacteria 2 90.5 2.7 2.72Actinobacteria 3 88.4 2.7 2.48Clostridiales 1 92.6 9.4 2.91Clostridiales 2 80.2 0.0 2.74Elusimicrobia 95.7 2.2 2.03Thermodesulfovibrionaceae 83.9 0.0 2.66Syntrophus 92.9 0.8 2.31Rikenellaceae 86.7 2.3 2.72Candidate Phylum OP1 83.9 0.0 1.66Rhodocyclaceae 69.0 1.63 3.73
25 of 27
identifying potential contamination
95th percentile
outliers… treat with caution
26 of 27
K‐mer modeling: impact of evolution
Bacteria vs. Archaea(Intra‐genome 95th percentile; K=4)
Classes of Proteobacteria(Intra‐genome 95th percentiles; K=4)
27 of 27
final thoughts K‐mers widely used in gene‐ and genome‐centric metagenomic
Population genomes substantially improving diversity of available reference genomes Big win for taxonomic attribution methods And CheckM, and many other bioinformatic programs
How best to exploit population genomes Looking at 100,000+ reference genomes in next few years Issues in terms of scalability Using ‘noisy’ population genomes raises interesting questions
Thank you!