Transcript
Page 1: Big Data Field Museum

RIDING THE BIG DATA

TIDAL WAVE OF MODERN

MICROBIOLOGY

Adina Howe

Argonne National Laboratory / Michigan State University

Iowa State University, Ag & Biosystems Engr (January)

Page 2: Big Data Field Museum

Understanding community

dynamics

Who is there?

What are they doing?

How are they doing it?

Page 3: Big Data Field Museum

Understanding community

dynamics

Who is there?

What are they doing?

How are they doing it?

Kim Lewis, 2010

Page 4: Big Data Field Museum

Gene / Genome Sequencing

Collect samples

Extract DNA

Sequence DNA

“Analyze” DNA to identify its content and origin

Taxonomy

(e.g., pathogenic E. Coli)

Function

(e.g., degrades cellulose)

Page 5: Big Data Field Museum

Effects of low cost

sequencing…

First free-living bacterium sequenced

for billions of dollars and years of

analysis

Personal genome can be

mapped in a few days and

hundreds to few thousand

dollars

Page 6: Big Data Field Museum

The experimental continuum

Single Isolate

Pure Culture

Enrichment

Mixed CulturesNatural systems

Page 7: Big Data Field Museum

The era of big data in biology

Stein, Genome Biology, 2010

Computational Hardware

(doubling time 14 months)

Sanger Sequencing

(doubling time 19 months)

NGS (Shotgun) Sequencing

(doubling time 5 months)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0

1

10

100

1,000

10,000

100,000

1,000,000

Dis

k S

tora

ge,

Mb/$

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

0.1

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Page 8: Big Data Field Museum

Postdoc experience with data

2003-2008 Cumulative sequencing in PhD = 2000 bp

2008-2009 Postdoc Year 1 = 50 Gbp

2009-2010 Postdoc Year 2 = 450 Gbp

2014 = 50 Tbp

2015 = 500 Tbp budgeted

Page 9: Big Data Field Museum

“Soil Census” to “Soil Catalogs”: Who is there?

Targetting conserved regions

of known genes

Most popular:

16S ribosomal RNA gene –

conserved in bacteria and

archaea

Who is there - community

profiling based on sequence

similarity

Must have previous

knowledge of genes

Must infer function based on

phylogeny – not advised

TARGETTED SEQUENCING

STRATEGY

Page 10: Big Data Field Museum

“Soil Census” to “Soil Catalogs”: Who is there?

Targetting conserved regions

of known genes

Most popular:

16S ribosomal RNA gene –

conserved in bacteria and

archaea

Who is there - community

profiling based on sequence

similarity

Must have previous

knowledge of genes

Must infer function based on

phylogeny – not advised

$15 / sample

TARGETTED SEQUENCING

STRATEGY

Page 11: Big Data Field Museum

Tackling Soil Biodiversity

Source: Chuck Haney

C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)

Janet Jansson, Susannah Tringe (JGI)

Page 12: Big Data Field Museum

THE DIRT ON SOIL

Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress

MAGNIFICENT BIODIVERSITY

Page 13: Big Data Field Museum

THE DIRT ON SOIL

SPATIAL HETEROGENEITY

http://www.fao.org/ www.cnr.uidaho.edu

Page 14: Big Data Field Museum

THE DIRT ON SOIL

DYNAMIC

Page 15: Big Data Field Museum

THE DIRT ON SOIL

INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES

Philippot, 2013, Nature Reviews Microbiology

Page 16: Big Data Field Museum

Our shared challenges

Climate Change

Energy Supply

USGCRP 2009

www.alutiiq.com

http://guardianlv.com/

Human Health

An understanding

of microbial ecology

Page 17: Big Data Field Museum

SOIL MICROBIOLOGY: CARBON

REGULATION

The anthropogenic CO2 production is only 10% of that of the soil

Sustainable agriculture permits carbon

sequestration in the range of 0.3 – 1 ton of

C/ha.yr ~ 10% of all carbon emitted by cars

(Denman et al., 2007; Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change)

Page 18: Big Data Field Museum

Tackling Soil Biodiversity

Source: Chuck Haney

C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)

Janet Jansson, Susannah Tringe (JGI)

Page 19: Big Data Field Museum

Lesson #1: Accessing information in

data

http://siliconangle.com/files/2010/09/image_thumb69.png

Page 20: Big Data Field Museum

de novo assembly

Compresses dataset size significantly

Improved data quality (longer sequences, gene order)

Reference not necessary (novelty)

Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes

Page 21: Big Data Field Museum

Metagenome assembly…a scaling

problem.

Page 22: Big Data Field Museum

Shotgun sequencing and de novo

assembly

It was the Gest of times, it was the wor

, it was the worst of timZs, it was the

isdom, it was the age of foolisXness

, it was the worVt of times, it was the

mes, it was Ahe age of wisdom, it was th

It was the best of times, it Gas the wor

mes, it was the age of witdom, it was th

isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

Page 23: Big Data Field Museum

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computer

Page 24: Big Data Field Museum

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computerAssembly of 300 Gbp can be

done with any assembly program

in less than 14 GB RAM and less

than 24 hours.

Page 25: Big Data Field Museum

Natural community characteristics

Diverse

Many organisms

(genomes)

Page 26: Big Data Field Museum

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x

Page 27: Big Data Field Museum

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x Sample 10x

Page 28: Big Data Field Museum

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x Sample 10x

Overkill

Page 29: Big Data Field Museum

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 30: Big Data Field Museum

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 31: Big Data Field Museum

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 32: Big Data Field Museum

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 33: Big Data Field Museum

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Page 34: Big Data Field Museum

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Zhang et al., 2014, PLOS One

Scales datasets for assembly up to 95% - same assembly

outputs.

Genomes, mRNA-seq, metagenomes (soils, gut, water)

Page 35: Big Data Field Museum

Tackling Soil Biodiversity

Source: Chuck Haney

C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)

Janet Jansson, Susannah Tringe (JGI)

Page 36: Big Data Field Museum

The reality?

Page 37: Big Data Field Museum

More like…

Source: Chuck HaneyHowe et. al, 2014, PNAS

Page 38: Big Data Field Museum

SOIL METAGENOME REALITY CHECK

Grand Challenge effort –

10% of soil biodiversity

sampled

Incredible soil biodiversity

(estimate required 10

Tbp/sample)

“To boldly go where no man

has gone before”: >60%

Unknown

0

100

200

300

400

am

ino a

cid

meta

bolis

m

carb

ohydra

te m

eta

bo

lism

mem

bra

ne tra

nspo

rt

sig

nal tr

ansdu

ction

transla

tion

fold

ing

, sort

ing a

nd d

egra

da

tion

meta

bolis

m o

f co

facto

rs a

nd v

itam

ins

energ

y m

eta

bolis

m

transp

ort

and

cata

bolis

m

lipid

meta

bolis

m

tra

nscri

ption

ce

ll g

row

th a

nd

dea

th

replic

ation

and

rep

air

xen

obio

tics b

iod

egra

datio

n a

nd m

eta

bo

lism

nucle

otide m

eta

bolis

m

gly

can b

iosynth

esis

and m

eta

bolis

m

meta

bolis

m o

f te

rpenoid

s a

nd

poly

ke

tides

cell

motilit

y

Tota

l C

ount

KO

corn and prairie

corn only

prairie only

Howe et al, 2014, PNAS

Managed agriculture soils exhibit less

diversity, likely from its history of

cultivation.

Page 39: Big Data Field Museum

Frustrating, but helpful

“Low input, high throughput, no output?” (Sean Eddy / Sydney Brenner)

Evaluation of sequencing as a tool

Broad characterization

“Right” kind of data

How much should I sequence?

Data characteristics

Breadth vs. depth of sampling

Computational tool development

Tr

Page 40: Big Data Field Museum

Lesson #2: Connecting the dots

from data to information

If 80% is

unknown…what

can one do?

Page 41: Big Data Field Museum

The idea of co-occurrence

Page 42: Big Data Field Museum

Co-occurrence to detect novelty

Williams et al., 2014, Frontiers of Micro

Page 43: Big Data Field Museum

Causation vs correlation

Wikipedia.com

Page 44: Big Data Field Museum
Page 45: Big Data Field Museum
Page 46: Big Data Field Museum

Success story in Human

Microbiome

Page 47: Big Data Field Museum

#3 Is more data better?

Bottlenecks for the emerging

microbiologists

Page 48: Big Data Field Museum

Technical challenges – many

solutions

Access to data and its value

Access to resources

Data volume and velocity “clog”

Data is very heterogeneous

Page 49: Big Data Field Museum

Software Developers

Computer Scientists

Clinicians

PIs

Data generators

Microbiologists

Data Analyzers

Statisticians

Bioinformaticians

http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html

Data intensive microbiology

Page 50: Big Data Field Museum
Page 51: Big Data Field Museum
Page 52: Big Data Field Museum

Social obstacles – the main

challenge

Shift of costs do not mean shift of

expectations

http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/

Dear PI,

It will take longer than

the time it took you to do

your experiment to

analyze the data. Please

do not write me for

results within 24 hours of

your sequences

becoming available.

- Adina

Page 53: Big Data Field Museum

Culture of sharing

http://www.heathershumaker.com/

Metagenomic Datasets

Page 54: Big Data Field Museum

Training / Incentives

Emails between collaborators don’t contain as

much “science” as I’d like:

Page 55: Big Data Field Museum

All analysis: accessible,

reproducible, and automated

Page 56: Big Data Field Museum

RIDING THE BIG DATA

TIDAL WAVE OF MODERN

MICROBIOLOGY

Adina Howe

Argonne National Laboratory / Michigan State University

Iowa State University, Ag & Biosystems Engr (January)

“”

Page 57: Big Data Field Museum

RIDING THE BIG DATA

TIDAL WAVE OF MODERN

MICROBIOLOGY

Adina Howe

Argonne National Laboratory / Michigan State University

Iowa State University, Ag & Biosystems Engr (January)

Page 58: Big Data Field Museum

Acknowledgements

C. Titus Brown (MSU)

James Tiedje (MSU)

Daina Ringus (UC)

Folker Meyer (ANL)

Eugene Chang (UC)

NSF Biology Postdoc Fellowship

DOE Great Lakes Bioenergy Research Center


Top Related