jonathan eisen talk at #ievobio 2010

135
Phylogenomics of microbes: the dark matter of biology Jonathan A. Eisen UC Davis Talk for iEVOBIO June 29, 2010 Tuesday, June 29, 2010

Upload: jonathan-eisen

Post on 17-Jul-2015

73.591 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Jonathan Eisen talk at #ievobio 2010

Phylogenomics of microbes: the dark matter of biology

Jonathan A. EisenUC Davis

Talk for iEVOBIOJune 29, 2010

Tuesday, June 29, 2010

Page 2: Jonathan Eisen talk at #ievobio 2010

Eisen Lab - Phylogenomics of Novelty

Origin of New Functions and

Processes

Species Evolution

Genome Dynamics

•New genes•Changes in old genes•Changes in pathways

•Phylogenetic history•Vertical vs. horizontal descent•Needed to track gain/loss of processes, infer convergence

•Evolvability•Repair and recombination processes•Intragenomic variation

Tuesday, June 29, 2010

Page 3: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 4: Jonathan Eisen talk at #ievobio 2010

Social Networking in Science

Tuesday, June 29, 2010

Page 5: Jonathan Eisen talk at #ievobio 2010

Bacterial evolve

Tuesday, June 29, 2010

Page 6: Jonathan Eisen talk at #ievobio 2010

• There are known knowns. These are things we know that we know.

• There are known unknowns. That is to say, there are things that we know we don't know.

• But there are also unknown unknowns. There are things we don't know we don't know.

An homage to Donald Rumsfeld

Tuesday, June 29, 2010

Page 7: Jonathan Eisen talk at #ievobio 2010

• Known knowns (background)–rRNA Tree of Life–Genomics–rRNA PCR–Metagenomics

• Known unknowns–GEBA project - past–GEBA project - present–GEBA project - future

• Unknown unknowns?

Outline

Tuesday, June 29, 2010

Page 8: Jonathan Eisen talk at #ievobio 2010

Known Knowns 1:

rRNA Tree of Life

Tuesday, June 29, 2010

Page 9: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 10: Jonathan Eisen talk at #ievobio 2010

rRNA Tree of Life

FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.

Based on tree from Pace NR, 2003.

Archaea

Eukaryotes

Bacteria

Tuesday, June 29, 2010

Page 11: Jonathan Eisen talk at #ievobio 2010

The Tree of Life: Three Main Domains

Eukaryotes

Archaea

Bacteria

The Tree of Life

Unrooted Tree of Life from Barton et al. EvolutionTuesday, June 29, 2010

Page 12: Jonathan Eisen talk at #ievobio 2010

Known Knowns 2:

Genomics and Phylogenomics

Tuesday, June 29, 2010

Page 13: Jonathan Eisen talk at #ievobio 2010

Fleischmann et al. 1995

Tuesday, June 29, 2010

Page 14: Jonathan Eisen talk at #ievobio 2010

Microbial genomes

From http://genomesonline.orgTuesday, June 29, 2010

Page 15: Jonathan Eisen talk at #ievobio 2010

Genome Sequences Have Revolutionized Microbiology

• Predictions of metabolic processes

• Better vaccine and drug design

• New insights into mechanisms of evolution

• Genomes serve as template for functional studies

• New enzymes and materials for engineering and synthetic biology

Tuesday, June 29, 2010

Page 16: Jonathan Eisen talk at #ievobio 2010

Microbes Run the Planet

Tuesday, June 29, 2010

Page 17: Jonathan Eisen talk at #ievobio 2010

4.

Microbes in the world I:rRNA PCR

Perna et al. 2003Tuesday, June 29, 2010

Page 18: Jonathan Eisen talk at #ievobio 2010

Lateral Transfer

from Doolittle, 1999Tuesday, June 29, 2010

Page 19: Jonathan Eisen talk at #ievobio 2010

from Lerat et alTuesday, June 29, 2010

Page 20: Jonathan Eisen talk at #ievobio 2010

Known Knowns 3:

rRNA PCR

Tuesday, June 29, 2010

Page 21: Jonathan Eisen talk at #ievobio 2010

Great Plate Count Anomaly

Culturing Microscope

CountCount

Tuesday, June 29, 2010

Page 22: Jonathan Eisen talk at #ievobio 2010

Great Plate Count Anomaly

Culturing Microscope

CountCount <<<<

Tuesday, June 29, 2010

Page 23: Jonathan Eisen talk at #ievobio 2010

Great Plate Count Anomaly

Culturing Microscope

CountCount <<<<

DNA

Tuesday, June 29, 2010

Page 24: Jonathan Eisen talk at #ievobio 2010

PCR RevolutionExtract DNA

PCR w/ Universal rDNA Primers

Sequence

Align and compareto other rDNAs

Phylogeneticclassification

OTUs Ecology

Tuesday, June 29, 2010

Page 25: Jonathan Eisen talk at #ievobio 2010

Uses of rDNA PCRBohannan and Hughes 2003

Hugenholtz 2002

Tuesday, June 29, 2010

Page 26: Jonathan Eisen talk at #ievobio 2010

rRNA challenges

• Massive amounts of data from next-gen• Need for full automation but

–Non overlapping–Alignments not always straightforward–BLAST insufficient–Phylogenetic methods that have been automated

still need work• Tree of everything might be useful

Tuesday, June 29, 2010

Page 27: Jonathan Eisen talk at #ievobio 2010

Known Knowns 4:

Metagenomics

Tuesday, June 29, 2010

Page 28: Jonathan Eisen talk at #ievobio 2010

4.

Microbes in the world I:rRNA PCR

Perna et al. 2003Tuesday, June 29, 2010

Page 29: Jonathan Eisen talk at #ievobio 2010

Metagenomics

shotgun

clone

Tuesday, June 29, 2010

Page 30: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 31: Jonathan Eisen talk at #ievobio 2010

Metagenomics Challenge

Tuesday, June 29, 2010

Page 32: Jonathan Eisen talk at #ievobio 2010

Metagenomics Challenge

Tuesday, June 29, 2010

Page 33: Jonathan Eisen talk at #ievobio 2010

Metagenomics Challenge

Who is out there?What are they doing?

Tuesday, June 29, 2010

Page 34: Jonathan Eisen talk at #ievobio 2010

rRNA phylotyping from metagenomics

Venter et al., 2004

Tuesday, June 29, 2010

Page 35: Jonathan Eisen talk at #ievobio 2010

Shotgun Sequencing Allows Use of Alternative Anchors (e.g., RecA)

Venter et al., 2004

Tuesday, June 29, 2010

Page 36: Jonathan Eisen talk at #ievobio 2010

0

0.1250

0.2500

0.3750

0.5000

Alphapro

teob

acte

ria

Betap

rote

obac

teria

Gamm

apro

teob

acte

ria

Epsilon

prote

obac

teria

Deltap

rote

obac

teria

Cyano

bacte

ria

Firm

icute

s

Actino

bacte

ria

Chloro

biCFB

Chloro

flexi

Spiroch

aete

s

Fuso

bacte

ria

Deinoc

occu

s-Th

erm

us

Eurya

rcha

eota

Crena

rcha

eota

Sargasso Phylotypes

Wei

ght

ed %

of

Clo

nes

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Shotgun Sequencing Allows Use of Other Markers

Venter et al., 2004

Tuesday, June 29, 2010

Page 37: Jonathan Eisen talk at #ievobio 2010

Functional Inference from Metagenomics

• Can work well for individual genes• Predicting “community” function is

challenging because treating community as a bag of genes does not work well

• Better to “compartmentalize” data ...

Tuesday, June 29, 2010

Page 38: Jonathan Eisen talk at #ievobio 2010

ABCDEFG

TUVWXYZ

Binning challenge

Tuesday, June 29, 2010

Page 39: Jonathan Eisen talk at #ievobio 2010

ABCDEFG

TUVWXYZ

Binning challenge

Best binning method: reference genomes

Tuesday, June 29, 2010

Page 40: Jonathan Eisen talk at #ievobio 2010

Reference Genomes Coming from Select Environment

Tuesday, June 29, 2010

Page 41: Jonathan Eisen talk at #ievobio 2010

ABCDEFG

TUVWXYZ

Binning challenge

No reference genome? What do you do?

Tuesday, June 29, 2010

Page 42: Jonathan Eisen talk at #ievobio 2010

ABCDEFG

TUVWXYZ

Binning challenge

No reference genome? What do you do?

Assembly? Composition? Get more references?Tuesday, June 29, 2010

Page 43: Jonathan Eisen talk at #ievobio 2010

ABCDEFG

TUVWXYZ

Binning challenge

No reference genome? What do you do?

Phylogeny ....Tuesday, June 29, 2010

Page 44: Jonathan Eisen talk at #ievobio 2010

Metagenomic challenges

• Massive amounts of data from next-gen• Need for full automation but

–Data fragmentary–BLAST insufficient–Automation of phylogenetic methods a bit better for

protein coding genes b/c alignments better–Reference databases incomplete

Tuesday, June 29, 2010

Page 45: Jonathan Eisen talk at #ievobio 2010

Known Unknowns 1:

GEBA Past

Tuesday, June 29, 2010

Page 46: Jonathan Eisen talk at #ievobio 2010

Microbial genomes

From http://genomesonline.orgTuesday, June 29, 2010

Page 47: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

2002

Based on Hugenholtz, 2002

Tuesday, June 29, 2010

Page 48: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Genome sequences are mostly from three phyla

Based on Hugenholtz, 2002

2002

Tuesday, June 29, 2010

Page 49: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Genome sequences are mostly from three phyla

• Some other phyla are only sparsely sampled

Based on Hugenholtz, 2002

2002

Tuesday, June 29, 2010

Page 50: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Genome sequences are mostly from three phyla

• Some other phyla are only sparsely sampled

• Same trend in Archaea

Based on Hugenholtz, 2002

2002

Tuesday, June 29, 2010

Page 51: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Genome sequences are mostly from three phyla

• Some other phyla are only sparsely sampled

• Same trend in Eukaryotes

Based on Hugenholtz, 2002

2002

Tuesday, June 29, 2010

Page 52: Jonathan Eisen talk at #ievobio 2010

The Tree is not Happy

FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.

Based on tree from Pace NR, 2003.

Archaea

Eukaryotes

Bacteria

Tuesday, June 29, 2010

Page 53: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Genome sequences are mostly from three phyla

• Some other phyla are only sparsely sampled

• Solution I: sequence more phyla

• NSF-funded Tree of Life Project

• A genome from each of eight phyla

Eisen & Ward, PIs

Tuesday, June 29, 2010

Page 54: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 55: Jonathan Eisen talk at #ievobio 2010

The Tree of Life is Still Angry

FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.

Based on tree from Pace NR, 2003.

Eukaryotes

Bacteria

Archaea

Tuesday, June 29, 2010

Page 56: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 100 phyla of bacteria• Genome sequences are mostly

from three phyla• Most phyla with cultured

species are sparsely sampled• Lineages with no cultured

taxa even more poorly sampled

• Solution - use tree to really fill gaps

Well sampled phyla

Tuesday, June 29, 2010

Page 57: Jonathan Eisen talk at #ievobio 2010

A Genomic Encyclopedia of Bacteria and Archaea (GEBA)

Tuesday, June 29, 2010

Page 58: Jonathan Eisen talk at #ievobio 2010

GEBA Pilot Project Overview

• Identify major branches in rRNA tree for which no genomes are available

• Identify branches with a cultured representative in DSMZ

• Grow > 200 of these and prep. DNA• Sequence and finish 100 (covering breadth of

bacterial/archaea diversity)• Annotate, analyze, release data• Assess benefits of tree guided sequencing

Tuesday, June 29, 2010

Page 59: Jonathan Eisen talk at #ievobio 2010

GEBA Pilot Project: Components• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan Eisen,

Eddy Rubin, Jim Bristow)• Project management (David Bruce, Eileen Dalin, Lynne Goodwin)• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus, Mat

Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al)• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor

Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)

• Adopt a microbe education project (Cheryl Kerfeld)• Outreach (David Gilbert)• $$$ (DOE, Eddy Rubin, Jim Bristow)

Tuesday, June 29, 2010

Page 60: Jonathan Eisen talk at #ievobio 2010

GEBA and Openness

• All data released as quickly as possible w/ no restrictions to IMG-GEBA; Genbank, etc

• Data also available in Biotorrents (http://biotorrents.net)

• Individual genome reports published in OA “Standards in Genome Sciences (SIGS)”

• 1st GEBA paper in Nature freely available and published using Creative Commons License

Tuesday, June 29, 2010

Page 61: Jonathan Eisen talk at #ievobio 2010

Known Unknowns 2:

GEBA present

Tuesday, June 29, 2010

Page 62: Jonathan Eisen talk at #ievobio 2010

GEBA Lesson 1

rRNA Tree is Useful for Identifying Phylogenetically Novel Genomes

Tuesday, June 29, 2010

Page 63: Jonathan Eisen talk at #ievobio 2010

rRNA Tree of Life

FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.

Based on tree from Pace NR, 2003.

Archaea

Eukaryotes

Bacteria

Tuesday, June 29, 2010

Page 64: Jonathan Eisen talk at #ievobio 2010

Network of Life

Figure from Barton, Eisen et al. “Evolution”, CSHL Press.

Based on tree from Pace NR, 2003.

Archaea

Eukaryotes

Bacteria

Tuesday, June 29, 2010

Page 65: Jonathan Eisen talk at #ievobio 2010

“Whole Genome” Concatenation Tree w/ AMPHORA

http://bobcat.genomecenter.ucdavis.edu/AMPHORA/See Wu and Eisen, Genome Biology 2008 9: R151

Tuesday, June 29, 2010

Page 66: Jonathan Eisen talk at #ievobio 2010

Compare PD in Trees

Tuesday, June 29, 2010

Page 67: Jonathan Eisen talk at #ievobio 2010

PD of rRNA, Genome Trees Similar

From Wu et al. 2009 Nature 462, 1056-1060Tuesday, June 29, 2010

Page 68: Jonathan Eisen talk at #ievobio 2010

GEBA Lesson 2

Phylogeny-driven genome selection helps discover new genetic diversity

Tuesday, June 29, 2010

Page 69: Jonathan Eisen talk at #ievobio 2010

Network of Life

FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.

Based on tree from Pace NR, 2003.

Archaea

Eukaryotes

Bacteria

Tuesday, June 29, 2010

Page 70: Jonathan Eisen talk at #ievobio 2010

Protein Family Rarefaction Curves

• Take data set of multiple complete genomes• Identify all protein families using MCL• Plot # of genomes vs. # of protein families

Tuesday, June 29, 2010

Page 71: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 72: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 73: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 74: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 75: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 76: Jonathan Eisen talk at #ievobio 2010

Synapomorphies exist

Tuesday, June 29, 2010

Page 77: Jonathan Eisen talk at #ievobio 2010

GEBA Lesson 3

Phylogeny-driven genome selection improves genome annotation

Tuesday, June 29, 2010

Page 78: Jonathan Eisen talk at #ievobio 2010

Predicting Function

• Key step in genome projects• More accurate predictions help guide

experimental and computational analyses• Many diverse approaches• Comparative and evolutionary analysis greatly

improves most predictions

Tuesday, June 29, 2010

Page 79: Jonathan Eisen talk at #ievobio 2010

Phylogeny vs. Blast

1 2 3 4 5 6

3 5

3

1A 2A 3A 1B 2B 3B

2A 1B

1

1 2

2

2 31

1A 3A

1A 2A 3A

1A 2A 3A

4 6

4 5 6

4 5 6

2B 3B

1B 2B 3B

1B 2B 3B

1A3A

1B 2B3B

12 4

62A

2A

53

5

EXAMPLE BMETHOD

Duplication?

Duplication?

IDENTIFY HOMOLOGS

OVERLAY KNOWNFUNCTIONS ONTO TREE

INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST

ALIGN SEQUENCES

CALCULATE GENE TREE

CHOOSE GENE(S) OF INTEREST

Species 3Species 1 Species 2

ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)

EXAMPLE A

Duplication?

Duplication

Ambiguous

Based on Eisen, 1998 Genome Res 8: 163-167.

Many methods focus on “top blast hits”

But much better to build phylogenetic trees of genes and compare to relatives

Allows better integration of evolutionary history (e.g., orthologs and paralogs)

Tuesday, June 29, 2010

Page 81: Jonathan Eisen talk at #ievobio 2010

Most/All Functional Prediction Improves w/ Better Phylogenetic Sampling

• Conversion of hypothetical into conserved hypotheticals

• Improved phylogenomics• Linking distantly related members of

protein families• Improved non-homology prediction

Tuesday, June 29, 2010

Page 82: Jonathan Eisen talk at #ievobio 2010

Known Unknowns 3:

GEBA future

Tuesday, June 29, 2010

Page 83: Jonathan Eisen talk at #ievobio 2010

GEBA Future 1

How much further should we go?

Tuesday, June 29, 2010

Page 84: Jonathan Eisen talk at #ievobio 2010

rRNA Tree of Life

FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.

Based on tree from Pace NR, 2003.

Archaea

Eukaryotes

Bacteria

Tuesday, June 29, 2010

Page 85: Jonathan Eisen talk at #ievobio 2010

Phylogenetic Diversity: Sequenced Bacteria & Archaea

From Wu et al. 2009Tuesday, June 29, 2010

Page 89: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria• Genome sequences are mostly

from three phyla• Most phyla with cultured

species are sparsely sampled• Lineages with no cultured

taxa even more poorly sampled

Well sampled phylaPoorly sampled

No cultured taxaTuesday, June 29, 2010

Page 90: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria• Genome sequences are mostly

from three phyla• Most phyla with cultured

species are sparsely sampled• Lineages with no cultured taxa

even more poorly sampled

Well sampled phyla

Poorly sampled

No cultured taxaTuesday, June 29, 2010

Page 91: Jonathan Eisen talk at #ievobio 2010

Uncultured Lineages:Technical Approaches

• Get into culture• Enrichment cultures• If abundant in low diversity ecosystems• Flow sorting• Microbeads• Microfluidic sorting• Single cell amplification

Tuesday, June 29, 2010

Page 92: Jonathan Eisen talk at #ievobio 2010

GEBA Future 2

How many gene families are there?

Tuesday, June 29, 2010

Page 93: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 94: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 95: Jonathan Eisen talk at #ievobio 2010

Compare PD in Trees

Tuesday, June 29, 2010

Page 96: Jonathan Eisen talk at #ievobio 2010

Gene Families vs PD

0

0.1

0.2

0.3

0.4

0 275 550 825 1100

PD vs. Gene Families (per genome)

PD

/Gen

ome

Gene families / genome

Tuesday, June 29, 2010

Page 97: Jonathan Eisen talk at #ievobio 2010

How many protein families?

From Wu et al. 2009

Text

GEBA Genomes PD/Genome ~0.1

PFAMs/Genome ~1000

PFAMs/PD~10000

Total PFAMS~10,000,000

Tuesday, June 29, 2010

Page 98: Jonathan Eisen talk at #ievobio 2010

Caveats (of many)

• Novel protein families per genome likely taxon specific

• Parameters other than PD clearly important

• Does not include viruses, eukaryotes

Tuesday, June 29, 2010

Page 99: Jonathan Eisen talk at #ievobio 2010

GEBA Future 3

Need to better leverage improved phylogenetic sampling

Tuesday, June 29, 2010

Page 100: Jonathan Eisen talk at #ievobio 2010

Example 1: Protein Family Space

• Much less biased sampling of protein family space now available

• Need to rebuild / reassess many protein family databases (e.g., HMMs)

• Structural space

Tuesday, June 29, 2010

Page 101: Jonathan Eisen talk at #ievobio 2010

Example 2: Experiments

Tuesday, June 29, 2010

Page 102: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

As of 2002

Based on Hugenholtz, 2002

Tuesday, June 29, 2010

Page 103: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Experimental studies are mostly from three phyla

As of 2002

Based on Hugenholtz, 2002

Tuesday, June 29, 2010

Page 104: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Experimental studies are mostly from three phyla

• Some studies in other phyla

As of 2002

Based on Hugenholtz, 2002

Tuesday, June 29, 2010

Page 105: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Genome sequences are mostly from three phyla

• Some other phyla are only sparsely sampled

• Same trend in Eukaryotes

As of 2002

Based on Hugenholtz, 2002

Tuesday, June 29, 2010

Page 106: Jonathan Eisen talk at #ievobio 2010

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

• At least 40 phyla of bacteria

• Genome sequences are mostly from three phyla

• Some other phyla are only sparsely sampled

• Same trend in Viruses

As of 2002

Based on Hugenholtz, 2002

Tuesday, June 29, 2010

Page 107: Jonathan Eisen talk at #ievobio 2010

0.1

Acidobacteria

Bacteroides

Fibrobacteres

Gemmimonas

Verrucomicrobia

Planctomycetes

Chloroflexi

Proteobacteria

Chlorobi

FirmicutesFusobacteria Actinobacteria

Cyanobacteria

Chlamydia

Spriochaetes

Deinococcus-Thermus

Aquificae

Thermotogae

TM6OS-K

Termite GroupOP8

Marine GroupAWS3

OP9

NKB19

OP3

OP10

TM7

OP1OP11

Nitrospira

SynergistesDeferribacteres

Thermudesulfobacteria

Chrysiogenetes

Thermomicrobia

Dictyoglomus

Coprothmermobacter

Tree based on Hugenholtz (2002) with some modifications.

Need experimental studies from across the tree too

Tuesday, June 29, 2010

Page 108: Jonathan Eisen talk at #ievobio 2010

Example 3: Improving the tree

• To make best use of GEBA data we need a better tree

Tuesday, June 29, 2010

Page 109: Jonathan Eisen talk at #ievobio 2010

Wh

Concatenated alignment “whole genome tree” built using AMPHORA

Tuesday, June 29, 2010

Page 110: Jonathan Eisen talk at #ievobio 2010

Wh

Whole genome tree built using AMPHORAby Martin Wu and Dongying Wu

Why does the

tree matter?

Tuesday, June 29, 2010

Page 111: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 112: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 113: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 114: Jonathan Eisen talk at #ievobio 2010

Many Alternatives to Concatenation

• Gene presence/absence• Supertrees / consensus methods• Separate phylogeny of genes and then

integration of results (e.g., networks)• Models that incorporate gain/loss as well as

gene phylogeny

Tuesday, June 29, 2010

Page 115: Jonathan Eisen talk at #ievobio 2010

Example 4: Metagenomic Analysis

Tuesday, June 29, 2010

Page 116: Jonathan Eisen talk at #ievobio 2010

0

0.1250

0.2500

0.3750

0.5000

Alphapro

teob

acte

ria

Betap

rote

obac

teria

Gamm

apro

teob

acte

ria

Epsilon

prote

obac

teria

Deltap

rote

obac

teria

Cyano

bacte

ria

Firm

icute

s

Actino

bacte

ria

Chloro

biCFB

Chloro

flexi

Spiroch

aete

s

Fuso

bacte

ria

Deinoc

occu

s-Th

erm

us

Eurya

rcha

eota

Crena

rcha

eota

Sargasso Phylotypes

Wei

ght

ed %

of

Clo

nes

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Phylogeny for Typing and Binning

Venter et al., 2004

Tuesday, June 29, 2010

Page 117: Jonathan Eisen talk at #ievobio 2010

0

0.1250

0.2500

0.3750

0.5000

Alphapro

teob

acte

ria

Betap

rote

obac

teria

Gamm

apro

teob

acte

ria

Epsilon

prote

obac

teria

Deltap

rote

obac

teria

Cyano

bacte

ria

Firm

icute

s

Actino

bacte

ria

Chloro

biCFB

Chloro

flexi

Spiroch

aete

s

Fuso

bacte

ria

Deinoc

occu

s-Th

erm

us

Eurya

rcha

eota

Crena

rcha

eota

Sargasso Phylotypes

Wei

ght

ed %

of

Clo

nes

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Venter et al., 2004

Should improve with better genomic sampling

Phylogeny for Typing and Binning

Tuesday, June 29, 2010

Page 118: Jonathan Eisen talk at #ievobio 2010

0

0.1250

0.2500

0.3750

0.5000

Alphapro

teob

acte

ria

Betap

rote

obac

teria

Gamm

apro

teob

acte

ria

Epsilon

prote

obac

teria

Deltap

rote

obac

teria

Cyano

bacte

ria

Firm

icute

s

Actino

bacte

ria

Chloro

biCFB

Chloro

flexi

Spiroch

aete

s

Fuso

bacte

ria

Deinoc

occu

s-Th

erm

us

Eurya

rcha

eota

Crena

rcha

eota

Sargasso Phylotypes

Wei

ght

ed %

of

Clo

nes

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Venter et al., 2004

Only improved a little

Phylogeny for Typing and Binning

Tuesday, June 29, 2010

Page 119: Jonathan Eisen talk at #ievobio 2010

How to improve phylogenetic analysis of metagenomic data

• Better phylogenetic and OTU methods for fragmented data

• Better assessment of which genes to use?

• More automation of all methods

Tuesday, June 29, 2010

Page 120: Jonathan Eisen talk at #ievobio 2010

Phylogenetic challenge

How place all in one tree?How identify OTUs including all fragments?Can you analyze more than 1 gene family at a time?

Tuesday, June 29, 2010

Page 121: Jonathan Eisen talk at #ievobio 2010

Approach 1:Place Reads on Reference Tree

• Examples–AMPHORA (Wu and Eisen)–PPlacer (Erik Matsen)–FastTree (Morgan Price)

• General approach–Precompute reference tree for full length sequences–Place individual reads on reference tree–Merge trees

Tuesday, June 29, 2010

Page 122: Jonathan Eisen talk at #ievobio 2010

Variants

• Use concatenated alignment of markers not just individual genes (Steven Kembel)

• Apply to OTU identification not just classification (Thomas Sharpton)

• CoBinning: look for linkage among fragments/genes (Aaron Darling)

Tuesday, June 29, 2010

Page 123: Jonathan Eisen talk at #ievobio 2010

How to improve phylogenetic analysis of metagenomic data

• Better phylogenetic and OTU methods for fragmented data

• Better assessment of which genes to use?

• More automation

Tuesday, June 29, 2010

Page 124: Jonathan Eisen talk at #ievobio 2010

New “Marker Genes”

• 100 representative genomes, including many GEBAs

• MCL gene families• Identify gene families w/

–High universality–High uniformity of copy number–Phylogenetic tree similar to “whole genome tree”

Tuesday, June 29, 2010

Page 125: Jonathan Eisen talk at #ievobio 2010

0 1 2 3 4 5 6

rRNA16SruvBnusArplBpurArpsJsecYrpsIpyrHrpsErplPrplNrpsCruvArplFrplAserSrplKrpsKpriAsmpBrpsGguaArpsQrpsLrplUrplOrpsMinfCrplSrplVrplCrpsPrplErplTrplLrplQrpsHmraWrpsOrpsBrplIrplMrplRttffrrtsfrplDradArpsStrmDcoaErpmA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

nusArpsCrpsEpriArplBsecY

rRNA16SrpsJrpsBruvBguaArplNserSrplFfrrrplArplErplCinfCrplDrplKpurAradAruvArpsMpyrHrplIrplMrpsGrpsLmraWrpsIttfrplStrmDtsfrplUrpsKrpsPrplOrplTrplVrpsSrplPrpsOsmpBrpsHrplQrplRrpsQrplLrpmAcoaE

Ribosomal protein Transcription/translation related protein DNA repair protein Protein of other functionAMPHORA marker

Distance between the genome tree and 100 random trees (average ± standard deviation)

NODAL distance SPLIT distance

Distances between gene trees and the AMPHORA concatenated genome tree

Tuesday, June 29, 2010

Page 126: Jonathan Eisen talk at #ievobio 2010

Screen gene markers for different taxonomic groups

phylum Genome Number

Gene NumberActinobacteria 63 267783Alphaproteobacteria

94 347287Betaproteobacteria

56 266362Gammaproteobacteria

126 483632Deltaproteobacteria

25 102115Epislonproteobacteria

18 33416Bacteriodes 25 71531Chlamydae 13 13823Chloroflexi 10 33577Cyanobacteria 36 124080Firmicutes 106 312309Spirochaetes 18 38832Thermi 5 14160Thermotogae 9 17037

Tuesday, June 29, 2010

Page 127: Jonathan Eisen talk at #ievobio 2010

Keep only the families with:

Universality * Evenness * monophyly >= 90*90*90

Phylogenetic group Genome Number Gene Number Maker Candidates

Archaea 62 145415 102

Actinobacteria 63 267783 136

Alphaproteobacteria 94 347287 142

Betaproteobacteria 56 266362 294

Gammaproteobacteria 126 483632 141

Deltaproteobacteria 25 102115 44

Epislonproteobacteria 18 33416 446

Bacteriodes 25 71531 179

Chlamydae 13 13823 561

Chloroflexi 10 33577 140

Cyanobacteria 36 124080 532

Firmicutes 106 312309 80

Spirochaetes 18 38832 72

Thermi 5 14160 727

Thermotogae 9 17037 646

Tuesday, June 29, 2010

Page 128: Jonathan Eisen talk at #ievobio 2010

How to improve phylogenetic analysis of metagenomic data

• Better phylogenetic and OTU methods for fragmented data

• Better assessment of which genes to use?

• More automation

Tuesday, June 29, 2010

Page 129: Jonathan Eisen talk at #ievobio 2010

AMPHORA

Guide treeTuesday, June 29, 2010

Page 130: Jonathan Eisen talk at #ievobio 2010

Phylogenetic Binning Using AMPHORA

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Alph

apro

teob

acteria

Betapr

oteo

bacter

ia

Gammap

roteob

acteria

Deltapr

oteo

bacter

ia

Epsil

onpr

oteo

bacter

ia

Uncla

ssified

Pro

teob

acteria

Cyan

obac

teria

Chlamyd

iae

Acidob

acteria

Bacter

oide

tes

Actin

obac

teria

Aquific

ae

Plan

ctom

ycetes

Spiro

chae

tes

Firmicu

tes

Chloro

flexi

Chloro

bi

Uncla

ssified

Bac

teria

dnaGfrrinfCnusApgkpyrGrplArplBrplCrplDrplErplFrplKrplLrplMrplNrplPrplSrplTrpmArpoBrpsBrpsCrpsErpsIrpsJrpsKrpsMrpsSsmpBtsf

AMPHORA - each read on its own treeTuesday, June 29, 2010

Page 131: Jonathan Eisen talk at #ievobio 2010

Zorro

• http://sourceforge.net/projects/probmask/• ZORRO is a probabilistic masking program

that assigns confidence scores to each column in a multiple sequence alignment. These scores can then be used to account for alignment accuracy in phylogenetic inference pipelines

• Wu, Chatterji, Eisen submitted

Tuesday, June 29, 2010

Page 132: Jonathan Eisen talk at #ievobio 2010

Conclusions

• Phylogeny-driven sampling produces many benefits immediately

• For the most benefits to come, we need to re-direct many informatics efforts to take advantage of less biased data

• Still a long way away from full benefits• Note - most of the benefits can come from

(aack) - unfinished genomes

Tuesday, June 29, 2010

Page 133: Jonathan Eisen talk at #ievobio 2010

Tuesday, June 29, 2010

Page 134: Jonathan Eisen talk at #ievobio 2010

MICROBES

Tuesday, June 29, 2010

Page 135: Jonathan Eisen talk at #ievobio 2010

A Happy Tree of Life

Tuesday, June 29, 2010