2013 hmp-assembly-webinar

C. Titus BrownAssistant Professor

CSE, MMG, BEACONMichigan State University

[email protected]

HMP – Metagenome assembly

Acknowledgements

Lab members involved Collaborators• Adina Howe (w/Tiedje)• Jason Pell• Arend Hintze• Rosangela Canino-Koning• Qingpeng Zhang• Elijah Lowe• Likit Preeyanon• Jiarong Guo• Tim Brom• Kanchan Pavangadkar• Eric McDonald• Jordan Fish• Chris Welcher

• Jim Tiedje, MSU• Billie Swalla, UW• Janet Jansson, LBNL• Susannah Tringe, JGI

Funding

USDA NIFA; NSF IOS; BEACON.

Open, online science

All of the software and approaches I’m talking about today are available:

Assembling large, complex metagenomesarxiv.org/abs/1212.2832

khmer software:github.com/ged-lab/khmer/

Blog: http://ivory.idyll.org/blog/Twitter: @ctitusbrown

http://ivory.idyll.org/blog/

Illumina! De Bruijn graphs!

• Today I’ll be talking about Illumina data sets, and de Bruijn graph assembly (k-mer assembly).

• This is because my research has largely focused on scaling to large data sets (soil metagenomics!) and Illumina is the real scaling challenge.

Assembler heuristics

• In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.

• These heuristics may not be appropriate for your sample!– High polymorphism?– Mixed population vs clonal?– Genomic vs metagenomic vs mRNA– Low coverage drives differences in assembly.

Evaluating assembly

Evaluating correctness of metagenomes is still undiscovered country.

Shotgun sequencing

“Coverage” is simply the average number of reads that overlapeach true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Reducing to k-mers overlaps

Note that k-mer abundance is not properly represented here! Each blue k-mer will be present around 10 times.

Errors create new k-mers

Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.

So, k-mer abundance plots are mixtures of true and false k-mers.

Counting k-mers - histograms

Low-abundance peak (errors)

Counting k-mers - histograms

High-abundance peak(true k-mers)

Approach: Digital normalization(a computational version of library normalization)

Suppose you have a dilution factor of A (10) to B(1). To

get 10x of B you need to get 100x of A! Overkill!!

This 100x will consume disk space and, because of

errors, memory.

We can discard it for you…

Digital normalization

Digital normalization approach

A digital analog to cDNA library normalization, diginorm:

• Reference free.

• Is single pass: looks at each read only once;

• Does not “collect” the majority of errors;

• Keeps all low-coverage reads;

• Smooths out coverage of regions.

Coverage before digital normalization:

(MD amplified)

Coverage after digital normalization:

Normalizes coverage

Discards redundancy

Eliminates majority oferrors

Scales assembly dramatically.

Assembly is 98% identical.

In our experience…

• Digital normalization produces “good” metagenome assemblies.

• Smooths out abundance variation, strain variation.

• Reduces computational requirements for assembly.

• It also kinda makes sense :)

Additional Approach for Metagenomes: Data partitioning

(a computational version of cell sorting)

Split reads into “bins” belonging to different source species.

Can do this based almost entirely on connectivity of sequences.

“Divide and conquer”Memory-efficient

implementation helps to scale assembly.

Pell et al., 2012, PNAS

Partitioning separates reads by genome.Strain variants co-partition.

When computationally spiking HMP mock data with one E. coli genome (left) or multiple E. coli strains (right), majority of partitions

contain reads from only a single genome (blue) vs multi-genome partitions (green).

Partitions containing spiked data indicated with a * Adina Howe

**

Conclusions re strain variation/chimerism (previous slide)

• When spiking in intentionally complex mixtures, only a small fraction of partitions are chimeric.

• These means that only a small fraction of contigs could be chimeric.

• Strain variants will almost certainly assemble together.

• Can separate on abundance.See Sharon et al., 2013, PMID 22936250, for Banfield work on this.

Looking at k-mer histograms…

Diginorm shifts left

Partitioning picks out diff genomes

Error correction “fixes” k-mers

Jason Pell

Our experience

• Our metagenome assemblies compare well with others, but we have little in the way of ground truth with which to evaluate.

• Scaffold assembly is tricky; we believe in contig assembly for metagenomes, but not scaffolding.

• See arXiv paper, “Assembling large, complex metagenomes”, for our suggested pipeline and statistics & references.

Metagenomic assemblies are highly variable

Adina Howe et al., arXiv 1212.0159

High coverage is needed.

Low coverage is the dominant problem blocking assembly of your soil metagenome.

Strain variation (soil)To

p tw

o al

lele

freq

uenc

ies

Position within contig

Of 5000 most abundantcontigs, only 1 has apolymorphism rate > 5%

Can measure by read mapping.

Overconfident predictions

• We can assemble virtually anything but soil ;).– Genomes, transcriptomes, MDA, mixtures, etc.– Repeat resolution will be fundamentally limited by sequencing

technology (insert size; sampling depth)

• Strain variation confuses assembly, but does not prevent useful results.– Diginorm is systematic strategy to enable assembly.– Banfield has shown how to deconvolve strains at differential

abundance.– Kostas K. results suggest that there will be a species gap

sufficient to prevent contig misassembly.– Even genes “chimeric” between strains are useful.

Reasons why you shouldn’t believe me

1) Strain variation – when we get deeper in soil, we should see more (?). Not sure what will happen, and we do not (yet) have proven approaches.

2) We, by definition, are not yet seeing anything that doesn’t assemble.

3) We have not tackled scaffolding much. Serious investigation of scaffolding will be necessary for any good genome assembly, and scaffolding is weak point.

Metagenome assemblers

In addition to khmer prefiltering,

• SPADES• IDBA-UD• MetaVelvet• Ray Meta

Assembling in the cloud

• Most metagenomes require 50-150 GB of RAM.

• Many people don’t have access to computers of that size.

• Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.

• I will post instructions and sample data sets for using Amazon today at ged.msu.edu/angus/.

Current research

• Optimizing our programs => faster.

• Building an evaluation framework for metagenome assemblers.

• Error correction!

De novo metagenome error correction makes reads more mappable.

Jason Pell, unpub.

Concluding thoughts

• Achieving one or more assemblies is fairly straightforward.

• Evaluating them is challenging, however, and where you should be thinking hardest about assembly.

• There are relatively few pipelines available for analyzing assembled metagenomic data. MG-RAST does support this; others?

2013 hmp-assembly-webinar

Technology

metagenome assembly

high coverage

cdna library normalization

scaffold assembly

coverage of regions

abundance variation

mers histogramslow

mers histogramshigh