jonathan eisen talk at #ievobio 2010
TRANSCRIPT
Phylogenomics of microbes: the dark matter of biology
Jonathan A. EisenUC Davis
Talk for iEVOBIOJune 29, 2010
Tuesday, June 29, 2010
Eisen Lab - Phylogenomics of Novelty
Origin of New Functions and
Processes
Species Evolution
Genome Dynamics
•New genes•Changes in old genes•Changes in pathways
•Phylogenetic history•Vertical vs. horizontal descent•Needed to track gain/loss of processes, infer convergence
•Evolvability•Repair and recombination processes•Intragenomic variation
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Social Networking in Science
Tuesday, June 29, 2010
Bacterial evolve
Tuesday, June 29, 2010
• There are known knowns. These are things we know that we know.
• There are known unknowns. That is to say, there are things that we know we don't know.
• But there are also unknown unknowns. There are things we don't know we don't know.
An homage to Donald Rumsfeld
Tuesday, June 29, 2010
• Known knowns (background)–rRNA Tree of Life–Genomics–rRNA PCR–Metagenomics
• Known unknowns–GEBA project - past–GEBA project - present–GEBA project - future
• Unknown unknowns?
Outline
Tuesday, June 29, 2010
Known Knowns 1:
rRNA Tree of Life
Tuesday, June 29, 2010
Tuesday, June 29, 2010
rRNA Tree of Life
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Archaea
Eukaryotes
Bacteria
Tuesday, June 29, 2010
The Tree of Life: Three Main Domains
Eukaryotes
Archaea
Bacteria
The Tree of Life
Unrooted Tree of Life from Barton et al. EvolutionTuesday, June 29, 2010
Known Knowns 2:
Genomics and Phylogenomics
Tuesday, June 29, 2010
Fleischmann et al. 1995
Tuesday, June 29, 2010
Microbial genomes
From http://genomesonline.orgTuesday, June 29, 2010
Genome Sequences Have Revolutionized Microbiology
• Predictions of metabolic processes
• Better vaccine and drug design
• New insights into mechanisms of evolution
• Genomes serve as template for functional studies
• New enzymes and materials for engineering and synthetic biology
Tuesday, June 29, 2010
Microbes Run the Planet
Tuesday, June 29, 2010
4.
Microbes in the world I:rRNA PCR
Perna et al. 2003Tuesday, June 29, 2010
Lateral Transfer
from Doolittle, 1999Tuesday, June 29, 2010
from Lerat et alTuesday, June 29, 2010
Known Knowns 3:
rRNA PCR
Tuesday, June 29, 2010
Great Plate Count Anomaly
Culturing Microscope
CountCount
Tuesday, June 29, 2010
Great Plate Count Anomaly
Culturing Microscope
CountCount <<<<
Tuesday, June 29, 2010
Great Plate Count Anomaly
Culturing Microscope
CountCount <<<<
DNA
Tuesday, June 29, 2010
PCR RevolutionExtract DNA
PCR w/ Universal rDNA Primers
Sequence
Align and compareto other rDNAs
Phylogeneticclassification
OTUs Ecology
Tuesday, June 29, 2010
Uses of rDNA PCRBohannan and Hughes 2003
Hugenholtz 2002
Tuesday, June 29, 2010
rRNA challenges
• Massive amounts of data from next-gen• Need for full automation but
–Non overlapping–Alignments not always straightforward–BLAST insufficient–Phylogenetic methods that have been automated
still need work• Tree of everything might be useful
Tuesday, June 29, 2010
Known Knowns 4:
Metagenomics
Tuesday, June 29, 2010
4.
Microbes in the world I:rRNA PCR
Perna et al. 2003Tuesday, June 29, 2010
Metagenomics
shotgun
clone
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Metagenomics Challenge
Tuesday, June 29, 2010
Metagenomics Challenge
Tuesday, June 29, 2010
Metagenomics Challenge
Who is out there?What are they doing?
Tuesday, June 29, 2010
rRNA phylotyping from metagenomics
Venter et al., 2004
Tuesday, June 29, 2010
Shotgun Sequencing Allows Use of Alternative Anchors (e.g., RecA)
Venter et al., 2004
Tuesday, June 29, 2010
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Shotgun Sequencing Allows Use of Other Markers
Venter et al., 2004
Tuesday, June 29, 2010
Functional Inference from Metagenomics
• Can work well for individual genes• Predicting “community” function is
challenging because treating community as a bag of genes does not work well
• Better to “compartmentalize” data ...
Tuesday, June 29, 2010
ABCDEFG
TUVWXYZ
Binning challenge
Tuesday, June 29, 2010
ABCDEFG
TUVWXYZ
Binning challenge
Best binning method: reference genomes
Tuesday, June 29, 2010
Reference Genomes Coming from Select Environment
Tuesday, June 29, 2010
ABCDEFG
TUVWXYZ
Binning challenge
No reference genome? What do you do?
Tuesday, June 29, 2010
ABCDEFG
TUVWXYZ
Binning challenge
No reference genome? What do you do?
Assembly? Composition? Get more references?Tuesday, June 29, 2010
ABCDEFG
TUVWXYZ
Binning challenge
No reference genome? What do you do?
Phylogeny ....Tuesday, June 29, 2010
Metagenomic challenges
• Massive amounts of data from next-gen• Need for full automation but
–Data fragmentary–BLAST insufficient–Automation of phylogenetic methods a bit better for
protein coding genes b/c alignments better–Reference databases incomplete
Tuesday, June 29, 2010
Known Unknowns 1:
GEBA Past
Tuesday, June 29, 2010
Microbial genomes
From http://genomesonline.orgTuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
2002
Based on Hugenholtz, 2002
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
Based on Hugenholtz, 2002
2002
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
Based on Hugenholtz, 2002
2002
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Archaea
Based on Hugenholtz, 2002
2002
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Eukaryotes
Based on Hugenholtz, 2002
2002
Tuesday, June 29, 2010
The Tree is not Happy
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Archaea
Eukaryotes
Bacteria
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Solution I: sequence more phyla
• NSF-funded Tree of Life Project
• A genome from each of eight phyla
Eisen & Ward, PIs
Tuesday, June 29, 2010
Tuesday, June 29, 2010
The Tree of Life is Still Angry
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Eukaryotes
Bacteria
Archaea
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 100 phyla of bacteria• Genome sequences are mostly
from three phyla• Most phyla with cultured
species are sparsely sampled• Lineages with no cultured
taxa even more poorly sampled
• Solution - use tree to really fill gaps
Well sampled phyla
Tuesday, June 29, 2010
A Genomic Encyclopedia of Bacteria and Archaea (GEBA)
Tuesday, June 29, 2010
GEBA Pilot Project Overview
• Identify major branches in rRNA tree for which no genomes are available
• Identify branches with a cultured representative in DSMZ
• Grow > 200 of these and prep. DNA• Sequence and finish 100 (covering breadth of
bacterial/archaea diversity)• Annotate, analyze, release data• Assess benefits of tree guided sequencing
Tuesday, June 29, 2010
GEBA Pilot Project: Components• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan Eisen,
Eddy Rubin, Jim Bristow)• Project management (David Bruce, Eileen Dalin, Lynne Goodwin)• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla Lapidus, Mat
Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al)• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor
Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)
• Adopt a microbe education project (Cheryl Kerfeld)• Outreach (David Gilbert)• $$$ (DOE, Eddy Rubin, Jim Bristow)
Tuesday, June 29, 2010
GEBA and Openness
• All data released as quickly as possible w/ no restrictions to IMG-GEBA; Genbank, etc
• Data also available in Biotorrents (http://biotorrents.net)
• Individual genome reports published in OA “Standards in Genome Sciences (SIGS)”
• 1st GEBA paper in Nature freely available and published using Creative Commons License
Tuesday, June 29, 2010
Known Unknowns 2:
GEBA present
Tuesday, June 29, 2010
GEBA Lesson 1
rRNA Tree is Useful for Identifying Phylogenetically Novel Genomes
Tuesday, June 29, 2010
rRNA Tree of Life
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Archaea
Eukaryotes
Bacteria
Tuesday, June 29, 2010
Network of Life
Figure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Archaea
Eukaryotes
Bacteria
Tuesday, June 29, 2010
“Whole Genome” Concatenation Tree w/ AMPHORA
http://bobcat.genomecenter.ucdavis.edu/AMPHORA/See Wu and Eisen, Genome Biology 2008 9: R151
Tuesday, June 29, 2010
Compare PD in Trees
Tuesday, June 29, 2010
PD of rRNA, Genome Trees Similar
From Wu et al. 2009 Nature 462, 1056-1060Tuesday, June 29, 2010
GEBA Lesson 2
Phylogeny-driven genome selection helps discover new genetic diversity
Tuesday, June 29, 2010
Network of Life
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Archaea
Eukaryotes
Bacteria
Tuesday, June 29, 2010
Protein Family Rarefaction Curves
• Take data set of multiple complete genomes• Identify all protein families using MCL• Plot # of genomes vs. # of protein families
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Synapomorphies exist
Tuesday, June 29, 2010
GEBA Lesson 3
Phylogeny-driven genome selection improves genome annotation
Tuesday, June 29, 2010
Predicting Function
• Key step in genome projects• More accurate predictions help guide
experimental and computational analyses• Many diverse approaches• Comparative and evolutionary analysis greatly
improves most predictions
Tuesday, June 29, 2010
Phylogeny vs. Blast
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
1A3A
1B 2B3B
12 4
62A
2A
53
5
EXAMPLE BMETHOD
Duplication?
Duplication?
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
ALIGN SEQUENCES
CALCULATE GENE TREE
CHOOSE GENE(S) OF INTEREST
Species 3Species 1 Species 2
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
EXAMPLE A
Duplication?
Duplication
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Many methods focus on “top blast hits”
But much better to build phylogenetic trees of genes and compare to relatives
Allows better integration of evolutionary history (e.g., orthologs and paralogs)
Tuesday, June 29, 2010
Wu et al. 2005 PLoS Genetics 1: e65. Tuesday, June 29, 2010
Most/All Functional Prediction Improves w/ Better Phylogenetic Sampling
• Conversion of hypothetical into conserved hypotheticals
• Improved phylogenomics• Linking distantly related members of
protein families• Improved non-homology prediction
Tuesday, June 29, 2010
Known Unknowns 3:
GEBA future
Tuesday, June 29, 2010
GEBA Future 1
How much further should we go?
Tuesday, June 29, 2010
rRNA Tree of Life
FIgure from Barton, Eisen et al. “Evolution”, CSHL Press.
Based on tree from Pace NR, 2003.
Archaea
Eukaryotes
Bacteria
Tuesday, June 29, 2010
Phylogenetic Diversity: Sequenced Bacteria & Archaea
From Wu et al. 2009Tuesday, June 29, 2010
Phylogenetic Diversity with GEBA
From Wu et al. 2009Tuesday, June 29, 2010
Phylogenetic Diversity: Isolates
From Wu et al. 2009Tuesday, June 29, 2010
Phylogenetic Diversity: All
From Wu et al. 2009
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria• Genome sequences are mostly
from three phyla• Most phyla with cultured
species are sparsely sampled• Lineages with no cultured
taxa even more poorly sampled
Well sampled phylaPoorly sampled
No cultured taxaTuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria• Genome sequences are mostly
from three phyla• Most phyla with cultured
species are sparsely sampled• Lineages with no cultured taxa
even more poorly sampled
Well sampled phyla
Poorly sampled
No cultured taxaTuesday, June 29, 2010
Uncultured Lineages:Technical Approaches
• Get into culture• Enrichment cultures• If abundant in low diversity ecosystems• Flow sorting• Microbeads• Microfluidic sorting• Single cell amplification
Tuesday, June 29, 2010
GEBA Future 2
How many gene families are there?
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Compare PD in Trees
Tuesday, June 29, 2010
Gene Families vs PD
0
0.1
0.2
0.3
0.4
0 275 550 825 1100
PD vs. Gene Families (per genome)
PD
/Gen
ome
Gene families / genome
Tuesday, June 29, 2010
How many protein families?
From Wu et al. 2009
Text
GEBA Genomes PD/Genome ~0.1
PFAMs/Genome ~1000
PFAMs/PD~10000
Total PFAMS~10,000,000
Tuesday, June 29, 2010
Caveats (of many)
• Novel protein families per genome likely taxon specific
• Parameters other than PD clearly important
• Does not include viruses, eukaryotes
Tuesday, June 29, 2010
GEBA Future 3
Need to better leverage improved phylogenetic sampling
Tuesday, June 29, 2010
Example 1: Protein Family Space
• Much less biased sampling of protein family space now available
• Need to rebuild / reassess many protein family databases (e.g., HMMs)
• Structural space
Tuesday, June 29, 2010
Example 2: Experiments
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
As of 2002
Based on Hugenholtz, 2002
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
As of 2002
Based on Hugenholtz, 2002
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Experimental studies are mostly from three phyla
• Some studies in other phyla
As of 2002
Based on Hugenholtz, 2002
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Eukaryotes
As of 2002
Based on Hugenholtz, 2002
Tuesday, June 29, 2010
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
• At least 40 phyla of bacteria
• Genome sequences are mostly from three phyla
• Some other phyla are only sparsely sampled
• Same trend in Viruses
As of 2002
Based on Hugenholtz, 2002
Tuesday, June 29, 2010
0.1
Acidobacteria
Bacteroides
Fibrobacteres
Gemmimonas
Verrucomicrobia
Planctomycetes
Chloroflexi
Proteobacteria
Chlorobi
FirmicutesFusobacteria Actinobacteria
Cyanobacteria
Chlamydia
Spriochaetes
Deinococcus-Thermus
Aquificae
Thermotogae
TM6OS-K
Termite GroupOP8
Marine GroupAWS3
OP9
NKB19
OP3
OP10
TM7
OP1OP11
Nitrospira
SynergistesDeferribacteres
Thermudesulfobacteria
Chrysiogenetes
Thermomicrobia
Dictyoglomus
Coprothmermobacter
Tree based on Hugenholtz (2002) with some modifications.
Need experimental studies from across the tree too
Tuesday, June 29, 2010
Example 3: Improving the tree
• To make best use of GEBA data we need a better tree
Tuesday, June 29, 2010
Wh
Concatenated alignment “whole genome tree” built using AMPHORA
Tuesday, June 29, 2010
Wh
Whole genome tree built using AMPHORAby Martin Wu and Dongying Wu
Why does the
tree matter?
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Tuesday, June 29, 2010
Many Alternatives to Concatenation
• Gene presence/absence• Supertrees / consensus methods• Separate phylogeny of genes and then
integration of results (e.g., networks)• Models that incorporate gain/loss as well as
gene phylogeny
Tuesday, June 29, 2010
Example 4: Metagenomic Analysis
Tuesday, June 29, 2010
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Phylogeny for Typing and Binning
Venter et al., 2004
Tuesday, June 29, 2010
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Venter et al., 2004
Should improve with better genomic sampling
Phylogeny for Typing and Binning
Tuesday, June 29, 2010
0
0.1250
0.2500
0.3750
0.5000
Alphapro
teob
acte
ria
Betap
rote
obac
teria
Gamm
apro
teob
acte
ria
Epsilon
prote
obac
teria
Deltap
rote
obac
teria
Cyano
bacte
ria
Firm
icute
s
Actino
bacte
ria
Chloro
biCFB
Chloro
flexi
Spiroch
aete
s
Fuso
bacte
ria
Deinoc
occu
s-Th
erm
us
Eurya
rcha
eota
Crena
rcha
eota
Sargasso Phylotypes
Wei
ght
ed %
of
Clo
nes
Major Phylogenetic Group
EFGEFTuHSP70RecARpoBrRNA
Venter et al., 2004
Only improved a little
Phylogeny for Typing and Binning
Tuesday, June 29, 2010
How to improve phylogenetic analysis of metagenomic data
• Better phylogenetic and OTU methods for fragmented data
• Better assessment of which genes to use?
• More automation of all methods
Tuesday, June 29, 2010
Phylogenetic challenge
How place all in one tree?How identify OTUs including all fragments?Can you analyze more than 1 gene family at a time?
Tuesday, June 29, 2010
Approach 1:Place Reads on Reference Tree
• Examples–AMPHORA (Wu and Eisen)–PPlacer (Erik Matsen)–FastTree (Morgan Price)
• General approach–Precompute reference tree for full length sequences–Place individual reads on reference tree–Merge trees
Tuesday, June 29, 2010
Variants
• Use concatenated alignment of markers not just individual genes (Steven Kembel)
• Apply to OTU identification not just classification (Thomas Sharpton)
• CoBinning: look for linkage among fragments/genes (Aaron Darling)
Tuesday, June 29, 2010
How to improve phylogenetic analysis of metagenomic data
• Better phylogenetic and OTU methods for fragmented data
• Better assessment of which genes to use?
• More automation
Tuesday, June 29, 2010
New “Marker Genes”
• 100 representative genomes, including many GEBAs
• MCL gene families• Identify gene families w/
–High universality–High uniformity of copy number–Phylogenetic tree similar to “whole genome tree”
Tuesday, June 29, 2010
0 1 2 3 4 5 6
rRNA16SruvBnusArplBpurArpsJsecYrpsIpyrHrpsErplPrplNrpsCruvArplFrplAserSrplKrpsKpriAsmpBrpsGguaArpsQrpsLrplUrplOrpsMinfCrplSrplVrplCrpsPrplErplTrplLrplQrpsHmraWrpsOrpsBrplIrplMrplRttffrrtsfrplDradArpsStrmDcoaErpmA
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
nusArpsCrpsEpriArplBsecY
rRNA16SrpsJrpsBruvBguaArplNserSrplFfrrrplArplErplCinfCrplDrplKpurAradAruvArpsMpyrHrplIrplMrpsGrpsLmraWrpsIttfrplStrmDtsfrplUrpsKrpsPrplOrplTrplVrpsSrplPrpsOsmpBrpsHrplQrplRrpsQrplLrpmAcoaE
Ribosomal protein Transcription/translation related protein DNA repair protein Protein of other functionAMPHORA marker
Distance between the genome tree and 100 random trees (average ± standard deviation)
NODAL distance SPLIT distance
Distances between gene trees and the AMPHORA concatenated genome tree
Tuesday, June 29, 2010
Screen gene markers for different taxonomic groups
phylum Genome Number
Gene NumberActinobacteria 63 267783Alphaproteobacteria
94 347287Betaproteobacteria
56 266362Gammaproteobacteria
126 483632Deltaproteobacteria
25 102115Epislonproteobacteria
18 33416Bacteriodes 25 71531Chlamydae 13 13823Chloroflexi 10 33577Cyanobacteria 36 124080Firmicutes 106 312309Spirochaetes 18 38832Thermi 5 14160Thermotogae 9 17037
Tuesday, June 29, 2010
Keep only the families with:
Universality * Evenness * monophyly >= 90*90*90
Phylogenetic group Genome Number Gene Number Maker Candidates
Archaea 62 145415 102
Actinobacteria 63 267783 136
Alphaproteobacteria 94 347287 142
Betaproteobacteria 56 266362 294
Gammaproteobacteria 126 483632 141
Deltaproteobacteria 25 102115 44
Epislonproteobacteria 18 33416 446
Bacteriodes 25 71531 179
Chlamydae 13 13823 561
Chloroflexi 10 33577 140
Cyanobacteria 36 124080 532
Firmicutes 106 312309 80
Spirochaetes 18 38832 72
Thermi 5 14160 727
Thermotogae 9 17037 646
Tuesday, June 29, 2010
How to improve phylogenetic analysis of metagenomic data
• Better phylogenetic and OTU methods for fragmented data
• Better assessment of which genes to use?
• More automation
Tuesday, June 29, 2010
AMPHORA
Guide treeTuesday, June 29, 2010
Phylogenetic Binning Using AMPHORA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Alph
apro
teob
acteria
Betapr
oteo
bacter
ia
Gammap
roteob
acteria
Deltapr
oteo
bacter
ia
Epsil
onpr
oteo
bacter
ia
Uncla
ssified
Pro
teob
acteria
Cyan
obac
teria
Chlamyd
iae
Acidob
acteria
Bacter
oide
tes
Actin
obac
teria
Aquific
ae
Plan
ctom
ycetes
Spiro
chae
tes
Firmicu
tes
Chloro
flexi
Chloro
bi
Uncla
ssified
Bac
teria
dnaGfrrinfCnusApgkpyrGrplArplBrplCrplDrplErplFrplKrplLrplMrplNrplPrplSrplTrpmArpoBrpsBrpsCrpsErpsIrpsJrpsKrpsMrpsSsmpBtsf
AMPHORA - each read on its own treeTuesday, June 29, 2010
Zorro
• http://sourceforge.net/projects/probmask/• ZORRO is a probabilistic masking program
that assigns confidence scores to each column in a multiple sequence alignment. These scores can then be used to account for alignment accuracy in phylogenetic inference pipelines
• Wu, Chatterji, Eisen submitted
Tuesday, June 29, 2010
Conclusions
• Phylogeny-driven sampling produces many benefits immediately
• For the most benefits to come, we need to re-direct many informatics efforts to take advantage of less biased data
• Still a long way away from full benefits• Note - most of the benefits can come from
(aack) - unfinished genomes
Tuesday, June 29, 2010
Tuesday, June 29, 2010
MICROBES
Tuesday, June 29, 2010
A Happy Tree of Life
Tuesday, June 29, 2010