2013 hmp-assembly-webinar

41
C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University [email protected] HMP – Metagenome assembly

Upload: ctitusbrown

Post on 10-May-2015

1.503 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: 2013 hmp-assembly-webinar

C. Titus BrownAssistant Professor

CSE, MMG, BEACONMichigan State University

[email protected]

HMP – Metagenome assembly

Page 2: 2013 hmp-assembly-webinar

Acknowledgements

Lab members involved Collaborators• Adina Howe (w/Tiedje)• Jason Pell• Arend Hintze• Rosangela Canino-Koning• Qingpeng Zhang• Elijah Lowe• Likit Preeyanon• Jiarong Guo• Tim Brom• Kanchan Pavangadkar• Eric McDonald• Jordan Fish• Chris Welcher

• Jim Tiedje, MSU• Billie Swalla, UW• Janet Jansson, LBNL• Susannah Tringe, JGI

Funding

USDA NIFA; NSF IOS; BEACON.

Page 3: 2013 hmp-assembly-webinar

Open, online science

All of the software and approaches I’m talking about today are available:

Assembling large, complex metagenomesarxiv.org/abs/1212.2832

khmer software:github.com/ged-lab/khmer/

Blog: http://ivory.idyll.org/blog/Twitter: @ctitusbrown

Page 4: 2013 hmp-assembly-webinar

Illumina! De Bruijn graphs!

• Today I’ll be talking about Illumina data sets, and de Bruijn graph assembly (k-mer assembly).

• This is because my research has largely focused on scaling to large data sets (soil metagenomics!) and Illumina is the real scaling challenge.

Page 5: 2013 hmp-assembly-webinar

Assembler heuristics

• In order to build assemblies, each assembler makes choices – uses heuristics – to reach a conclusion.

• These heuristics may not be appropriate for your sample!– High polymorphism?– Mixed population vs clonal?– Genomic vs metagenomic vs mRNA– Low coverage drives differences in assembly.

Page 6: 2013 hmp-assembly-webinar

Evaluating assembly

Evaluating correctness of metagenomes is still undiscovered country.

Page 7: 2013 hmp-assembly-webinar

Shotgun sequencing

“Coverage” is simply the average number of reads that overlapeach true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Page 8: 2013 hmp-assembly-webinar

Reducing to k-mers overlaps

Note that k-mer abundance is not properly represented here! Each blue k-mer will be present around 10 times.

Page 9: 2013 hmp-assembly-webinar

Errors create new k-mers

Each single base error generates ~k new k-mers.Generally, erroneous k-mers show up only once – errors are random.

Page 10: 2013 hmp-assembly-webinar

So, k-mer abundance plots are mixtures of true and false k-mers.

Page 11: 2013 hmp-assembly-webinar

Counting k-mers - histograms

Low-abundance peak (errors)

Page 12: 2013 hmp-assembly-webinar

Counting k-mers - histograms

High-abundance peak(true k-mers)

Page 13: 2013 hmp-assembly-webinar

Approach: Digital normalization(a computational version of library normalization)

Suppose you have a dilution factor of A (10) to B(1). To

get 10x of B you need to get 100x of A! Overkill!!

This 100x will consume disk space and, because of

errors, memory.

We can discard it for you…

Page 14: 2013 hmp-assembly-webinar

Digital normalization

Page 15: 2013 hmp-assembly-webinar

Digital normalization

Page 16: 2013 hmp-assembly-webinar

Digital normalization

Page 17: 2013 hmp-assembly-webinar

Digital normalization

Page 18: 2013 hmp-assembly-webinar

Digital normalization

Page 19: 2013 hmp-assembly-webinar

Digital normalization

Page 20: 2013 hmp-assembly-webinar

Digital normalization approach

A digital analog to cDNA library normalization, diginorm:

• Reference free.

• Is single pass: looks at each read only once;

• Does not “collect” the majority of errors;

• Keeps all low-coverage reads;

• Smooths out coverage of regions.

Page 21: 2013 hmp-assembly-webinar

Coverage before digital normalization:

(MD amplified)

Page 22: 2013 hmp-assembly-webinar

Coverage after digital normalization:

Normalizes coverage

Discards redundancy

Eliminates majority oferrors

Scales assembly dramatically.

Assembly is 98% identical.

Page 23: 2013 hmp-assembly-webinar

In our experience…

• Digital normalization produces “good” metagenome assemblies.

• Smooths out abundance variation, strain variation.

• Reduces computational requirements for assembly.

• It also kinda makes sense :)

Page 24: 2013 hmp-assembly-webinar

Additional Approach for Metagenomes: Data partitioning

(a computational version of cell sorting)

Split reads into “bins” belonging to different source species.

Can do this based almost entirely on connectivity of sequences.

“Divide and conquer”Memory-efficient

implementation helps to scale assembly.

Pell et al., 2012, PNAS

Page 25: 2013 hmp-assembly-webinar

Partitioning separates reads by genome.Strain variants co-partition.

When computationally spiking HMP mock data with one E. coli genome (left) or multiple E. coli strains (right), majority of partitions

contain reads from only a single genome (blue) vs multi-genome partitions (green).

Partitions containing spiked data indicated with a * Adina Howe

**

Page 26: 2013 hmp-assembly-webinar

Conclusions re strain variation/chimerism (previous slide)

• When spiking in intentionally complex mixtures, only a small fraction of partitions are chimeric.

• These means that only a small fraction of contigs could be chimeric.

• Strain variants will almost certainly assemble together.

• Can separate on abundance.See Sharon et al., 2013, PMID 22936250, for Banfield work on this.

Page 27: 2013 hmp-assembly-webinar

Looking at k-mer histograms…

Page 28: 2013 hmp-assembly-webinar

Diginorm shifts left

Page 29: 2013 hmp-assembly-webinar

Partitioning picks out diff genomes

Page 30: 2013 hmp-assembly-webinar

Error correction “fixes” k-mers

Jason Pell

Page 31: 2013 hmp-assembly-webinar

Our experience

• Our metagenome assemblies compare well with others, but we have little in the way of ground truth with which to evaluate.

• Scaffold assembly is tricky; we believe in contig assembly for metagenomes, but not scaffolding.

• See arXiv paper, “Assembling large, complex metagenomes”, for our suggested pipeline and statistics & references.

Page 32: 2013 hmp-assembly-webinar

Metagenomic assemblies are highly variable

Adina Howe et al., arXiv 1212.0159

Page 33: 2013 hmp-assembly-webinar

High coverage is needed.

Low coverage is the dominant problem blocking assembly of your soil metagenome.

Page 34: 2013 hmp-assembly-webinar

Strain variation (soil)To

p tw

o al

lele

freq

uenc

ies

Position within contig

Of 5000 most abundantcontigs, only 1 has apolymorphism rate > 5%

Can measure by read mapping.

Page 35: 2013 hmp-assembly-webinar

Overconfident predictions

• We can assemble virtually anything but soil ;).– Genomes, transcriptomes, MDA, mixtures, etc.– Repeat resolution will be fundamentally limited by sequencing

technology (insert size; sampling depth)

• Strain variation confuses assembly, but does not prevent useful results.– Diginorm is systematic strategy to enable assembly.– Banfield has shown how to deconvolve strains at differential

abundance.– Kostas K. results suggest that there will be a species gap

sufficient to prevent contig misassembly.– Even genes “chimeric” between strains are useful.

Page 36: 2013 hmp-assembly-webinar

Reasons why you shouldn’t believe me

1) Strain variation – when we get deeper in soil, we should see more (?). Not sure what will happen, and we do not (yet) have proven approaches.

2) We, by definition, are not yet seeing anything that doesn’t assemble.

3) We have not tackled scaffolding much. Serious investigation of scaffolding will be necessary for any good genome assembly, and scaffolding is weak point.

Page 37: 2013 hmp-assembly-webinar

Metagenome assemblers

In addition to khmer prefiltering,

• SPADES• IDBA-UD• MetaVelvet• Ray Meta

Page 38: 2013 hmp-assembly-webinar

Assembling in the cloud

• Most metagenomes require 50-150 GB of RAM.

• Many people don’t have access to computers of that size.

• Amazon Web Services (aws.amazon.com) will happily rent you such computers for $1-2/hr.

• I will post instructions and sample data sets for using Amazon today at ged.msu.edu/angus/.

Page 39: 2013 hmp-assembly-webinar

Current research

• Optimizing our programs => faster.

• Building an evaluation framework for metagenome assemblers.

• Error correction!

Page 40: 2013 hmp-assembly-webinar

De novo metagenome error correction makes reads more mappable.

Jason Pell, unpub.

Page 41: 2013 hmp-assembly-webinar

Concluding thoughts

• Achieving one or more assemblies is fairly straightforward.

• Evaluating them is challenging, however, and where you should be thinking hardest about assembly.

• There are relatively few pipelines available for analyzing assembled metagenomic data. MG-RAST does support this; others?