"phylogenomic approaches to microbial diversity" talk by jonathan eisen at...

85
Phylogenomic Approaches to the Study of Microbial Diversity September 6, 2012 Bay Area Illumina User’s Meeting Jonathan A. Eisen University of California, Davis @phylogenomics Thursday, September 6, 12

Upload: jonathan-eisen

Post on 10-May-2015

2.889 views

Category:

Health & Medicine


2 download

DESCRIPTION

Talk by Jonathan Eisen at the Bay Area Illumina Users meeting 9/6/12 ""Phylogenomic approaches to microbial diversity"

TRANSCRIPT

Page 1: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylogenomic Approaches to the Study of Microbial Diversity

September 6, 2012Bay Area Illumina User’s Meeting

Jonathan A. EisenUniversity of California, Davis

@phylogenomics

Thursday, September 6, 12

Page 2: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylogenomic Approaches to Studying Microbial Diversity

Example 1:

Phylotyping and

Phylogenetic Diversity

Thursday, September 6, 12

Page 3: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

DNA extraction

PCRSequence

rRNA genes

Sequence alignment = Data matrix

PCR

rRNA1

rRNA2

Makes lots of copies of the rRNA genes in sample

rRNA1 5’...ACACACATAGGTGGAGCTA

GCGATCGATCGA... 3’

E. coli

Humans

A

T

T

A

G

A

A

C

A

T

C

A

C

A

A

C

A

G

G

A

G

T

T

CrRNA2

5’..TACAGTATAGGTGGAGCTAGCGACGATCGA... 3’

rRNA3 5’...ACGGCAAAATAGGTGGATT

CTAGCGATATAGA... 3’

rRNA4 5’...ACGGCCCGATAGGTGGATT

CTAGCGCCATAGA... 3’

rRNA3 C A C T G T

rRNA4 C A C A G T

Yeast T A C A G T

rRNA Phylotyping

Thursday, September 6, 12

Page 4: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylotyping

Thursday, September 6, 12

Page 5: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

E. coli Humans

Yeast

Phylotyping

Thursday, September 6, 12

Page 6: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

OTU2

E. coli Humans

OTU1

Yeast

OTU3 OTU4

E. coli Humans

Yeast

Phylotyping

Thursday, September 6, 12

Page 7: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

Cluster

Phylotyping

Thursday, September 6, 12

Page 8: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

Cluster

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

OTUs

Phylotyping

Thursday, September 6, 12

Page 9: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

Cluster

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

OTUs

OTU1

OTU2

OTU3

OTU4

Phylotyping

Thursday, September 6, 12

Page 10: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

Cluster

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

OTUs

OTU2

E. coli Humans

OTU1

Yeast

OTU3 OTU4OTU1

OTU2

OTU3

OTU4

Phylotyping

Thursday, September 6, 12

Page 11: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

E. coli Humans

Yeast

Phylotyping

Thursday, September 6, 12

Page 12: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

E. coli Humans

Yeast

Just Phylogeny

Phylotyping

Thursday, September 6, 12

Page 13: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

Cluster

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

OTUs

OTU2

E. coli Humans

OTU1

Yeast

OTU3 OTU4OTU1

OTU2

OTU3

OTU4

E. coli Humans

Yeast

Just Phylogeny

Phylotyping

Thursday, September 6, 12

Page 14: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

• OTUs• Taxonomic lists• Relative abundance of taxa• Ecological metrics (alpha and beta diversity)

• Phylogenetic metrics• Binning• Identification of novel groups• Clades• Rates of change• LGT• Convergence• PD• Phylogenetic ecology (e.g., Unifrac)

Phylotyping

Thursday, September 6, 12

Page 15: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

What’s New in Phylotyping

Thursday, September 6, 12

Page 16: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

What’s New in Phylotyping I

• More PCR products

• Deeper sequencing• The rare biosphere• Relative abundance estimates

• More samples (with barcoding)• Times series• Spatially diverse sampling• Fine scale sampling

Thursday, September 6, 12

Page 17: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Earth Microbiome Project

Thursday, September 6, 12

Page 18: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Thursday, September 6, 12

Page 19: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Things You Could Do

• Mississippi River: 2320 miles long

Thursday, September 6, 12

Page 20: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Things You Could Do

• Mississippi River: 2320 miles long• 1 site / mile• 3 samples / site• 6960 samples

• rRNA PCR w/ barcodes• metagenomics w/ barcodes

• Miseq Run: • 30 million sequence reads• 4310 sequences / sample

• Hiseq 2000• 6 billion sequence reads• 862,068 sequences / sample

Thursday, September 6, 12

Page 21: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Things You Could Do

• Mississippi River: 12,249,600 feet long• 1 site / 500 feet• 3 samples / site• 73497 samples

• rRNA PCR w/ barcodes• metagenomics w/ barcodes

• Miseq Run: • 30 million sequence reads• 408 sequences / sample

• Hiseq 2000• 6 billion sequence reads• 81,635 sequences / sample

Thursday, September 6, 12

Page 22: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

What’s New in Phylotyping II

• Metagenomics avoids biases of rRNA PCR

shotgunsequence

Thursday, September 6, 12

Page 23: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Metagenomic Phylotyping

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

Cluster

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

OTUs

OTU2

E. coli Humans

OTU1

Yeast

OTU3 OTU4OTU1

OTU2

OTU3

OTU4

E. coli Humans

Yeast

Just Phylogeny

Thursday, September 6, 12

Page 24: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylogenetic Challenge

??

Thursday, September 6, 12

Page 25: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylogenetic Challenge

??

Thursday, September 6, 12

Page 26: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylogenetic Challenge

Multiple approaches

Thursday, September 6, 12

Page 27: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Method 1: Each is an island

Thursday, September 6, 12

Page 28: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Method 1: Each is an island

• Build alignment, models, trees for full length seqs• Analyze fragmented reads one at a time

Thursday, September 6, 12

Page 29: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Method 1: Each is an island

• Build alignment, models, trees for full length seqs• Analyze fragmented reads one at a time

Thursday, September 6, 12

Page 30: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Method 1: Each is an island

• Build alignment, models, trees for full length seqs• Analyze fragmented reads one at a time

Thursday, September 6, 12

Page 31: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

STAP

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

Wu et al. 2008 PLoS One

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

Each sequence analyzed separately

Thursday, September 6, 12

Page 32: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

AMPHORA

Guide tree

Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151

Thursday, September 6, 12

Page 33: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylotyping w/ Proteins

Wu and Eisen Genome Biology 2008 9:R151 doi:10.1186/gb-2008-9-10-r151

Thursday, September 6, 12

Page 34: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Method 2: Most in the Family

Thursday, September 6, 12

Page 35: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylogenetic Challenge

??

xxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxx

Thursday, September 6, 12

Page 36: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Method 2: Most in family

One tree for those w/ overlap

xxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxx

Thursday, September 6, 12

Page 39: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Betap

roteobacteria

Gamm

aproteobacteria

Epsilo

nproteobacteria

Deltapro

teobacteria

Cyanobacteria

Firmicutes

Actinobacteria

Chlorobi

CFB

Chloroflexi

Spirochaetes

Fusobacteria

Deinococcus-Th

ermus

Euryarchaeota

Crenarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFG EFTu HSP70 RecA RpoB rRNA

Sargasso Phylotyping

Venter et al., Science 304: 66. 2004

Thursday, September 6, 12

Page 40: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

STAP, QIIME, Mothur

STAP database, and the query sequence is aligned to them usingthe CLUSTALW profile alignment algorithm [40] as describedabove for domain assignment. By adapting the profile alignment

algorithm, the alignments from the STAP database remain intact,while gaps are inserted and nucleotides are trimmed for the querysequence according to the profile defined by the previousalignments from the databases. Thus the accuracy and quality ofthe alignment generated at this step depends heavily on the qualityof the Bacterial/Archaeal ss-rRNA alignments from theGreengenes project or the Eukaryotic ss-rRNA alignments fromthe RDPII project.

Phylogenetic analysis using multiple sequence alignments rests onthe assumption that the residues (nucleotides or amino acids) at thesame position in every sequence in the alignment are homologous.Thus, columns in the alignment for which ‘‘positional homology’’cannot be robustly determined must be excluded from subsequentanalyses. This process of evaluating homology and eliminatingquestionable columns, known as masking, typically requires time-consuming, skillful, human intervention. We designed an automat-ed masking method for ss-rRNA alignments, thus eliminating thisbottleneck in high-throughput processing.

First, an alignment score is calculated for each aligned columnby a method similar to that used in the CLUSTALX package [42].Specifically, an R-dimensional sequence space representing all thepossible nucleotide character states is defined. Then for eachaligned column, the nucleotide populating that column in each ofthe aligned sequences is assigned a score in each of the Rdimensions (Sr) according to the IUB matrix [42]. The consensus‘‘nucleotide’’ for each column (X) also has R dimensions, with thescore for each dimension (Xr) calculated as the average of thescores for that column in that dimension (average of Sr). Thus thescore of the consensus nucleotide is a mathematical expressiondescribing the average ‘‘nucleotide’’ in that column for thatalignment.

Figure 2. Domain assignment. In Step 1, STAP assigns a domain toeach query sequence based on its position in a maximum likelihoodtree of representative ss-rRNA sequences. Because the tree illustratedhere is not rooted, domain assignment would not be accurate andreliable (sequence similarity based methods cannot make an accurateassignment in this case either). However the figure illustrates animportant role of the tree-based domain assignment step, namelyautomatic identification of deep-branching environmental ss-rRNAs.doi:10.1371/journal.pone.0002566.g002

Figure 1. A flow chart of the STAP pipeline.doi:10.1371/journal.pone.0002566.g001

ss-rRNA Taxonomy Pipeline

PLoS ONE | www.plosone.org 5 July 2008 | Volume 3 | Issue 7 | e2566

Combine all into one alignment

Thursday, September 6, 12

Page 41: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Method 3: All in the family

Thursday, September 6, 12

Page 42: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

??

Phylogenetic Challenge

Thursday, September 6, 12

Page 43: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

A single tree with everything?

Phylogenetic Challenge

Thursday, September 6, 12

Page 44: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

rRNA analysis

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

Cluster

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

OTUs

OTU2

E. coli Humans

OTU1

Yeast

OTU3 OTU4OTU1

OTU2

OTU3

OTU4

E. coli Humans

Yeast

Just Phylogeny

Thursday, September 6, 12

Page 45: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

alignment used to build the profile, resulting in a multiplesequence alignment of full-length reference sequences andmetagenomic reads. The final step of the alignment process is aquality control filter that 1) ensures that only homologous SSU-rRNA sequences from the appropriate phylogenetic domain areincluded in the final alignment, and 2) masks highly gappedalignment columns (see Text S1).We use this high quality alignment of metagenomic reads and

references sequences to construct a fully-resolved, phylogenetictree and hence determine the evolutionary relationships betweenthe reads. Reference sequences are included in this stage of theanalysis to guide the phylogenetic assignment of the relativelyshort metagenomic reads. While the software can be easilyextended to incorporate a number of different phylogenetic toolscapable of analyzing metagenomic data (e.g., RAxML [27],pplacer [28], etc.), PhylOTU currently employs FastTree as adefault method due to its relatively high speed-to-performanceratio and its ability to construct accurate trees in the presence ofhighly-gapped data [29]. After construction of the phylogeny,lineages representing reference sequences are pruned from thetree. The resulting phylogeny of metagenomic reads is then used tocompute a PD distance matrix in which the distance between apair of reads is defined as the total tree path distance (i.e., branchlength) separating the two reads [30]. This tree-based distancematrix is subsequently used to hierarchically cluster metagenomicreads via MOTHUR into OTUs in a fashion similar to traditionalPID-based analysis [31]. As with PID clustering, the hierarchicalalgorithm can be tuned to produce finer or courser clusters,corresponding to different taxonomic levels, by adjusting theclustering threshold and linkage method.To evaluate the performance of PhylOTU, we employed

statistical comparisons of distance matrices and clustering resultsfor a variety of data sets. These investigations aimed 1) to compare

PD versus PID clustering, 2) to explore overlap between PhylOTUclusters and recognized taxonomic designations, and 3) to quantifythe accuracy of PhylOTU clusters from shotgun reads relative tothose obtained from full-length sequences.

PhylOTU Clusters Recapitulate PID ClustersWe sought to identify how PD-based clustering compares to

commonly employed PID-based clustering methods by applyingthe two methods to the same set of sequences. Both PID-basedclustering and PhylOTU may be used to identify OTUs fromoverlapping sequences. Therefore we applied both methods to adataset of 508 full-length bacterial SSU-rRNA sequences (refer-ence sequences; see above) obtained from the Ribosomal DatabaseProject (RDP) [25]. Recent work has demonstrated that PID ismore accurately calculated from pairwise alignments than multiplesequence alignments [32–33], so we used ESPRIT, whichimplements pairwise alignments, to obtain a PID distance matrixfor the reference sequences [32]. We used PhylOTU to compute aPD distance matrix for the same data. Then, we used MOTHUR tohierarchically cluster sequences into OTUs based on both PIDand PD. For each of the two distance matrices, we employed arange of clustering thresholds and three different definitions oflinkage in the hierarchical clustering algorithm: nearest-neighbor,average, and furthest-neighbor.To statistically evaluate the similarity of cluster composition

between of each pair of clustering results, we used two summarystatistics that together capture the frequency with which sequencesare co-clustered in both analyses: true conjunction rate (i.e., theproportion of pairs of sequences derived from the same cluster inthe first analysis that also are clustered together in the secondanalysis) and true disjunction rate (i.e., the proportion of pairs ofsequences derived from different clusters in the first analysis thatalso are not clustered together in the second analysis) (see Methods

Figure 1. PhylOTU Workflow. Computational processes are represented as squares and databases are represented as cylinders in this generalizeworkflow of PhylOTU. See Results section for details.doi:10.1371/journal.pcbi.1001061.g001

Finding Metagenomic OTUs

PLoS Computational Biology | www.ploscompbiol.org 3 January 2011 | Volume 7 | Issue 1 | e1001061

Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O'Dwyer JP, Green JL, Eisen JA, Pollard KS. (2011) PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data. PLoS Comput Biol 7(1): e1001061. doi:10.1371/journal.pcbi.1001061

PhylOTU

Thursday, September 6, 12

Page 46: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GOS 1

GOS 2

GOS 3

GOS 4

GOS 5

RecA, RpoB in GOS

Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, et al. (2011) Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011

Thursday, September 6, 12

Page 47: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Phylosift/ pplacer

Aaron Darling, Guillaume Jospin, Holly Bik, Erik Matsen, Eric Lowe, and others

Thursday, September 6, 12

Page 48: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Method 4: All in the genome

Thursday, September 6, 12

Page 49: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Multiple Genes?

A single tree with everything?

Thursday, September 6, 12

Page 50: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Kembel Combiner

Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214

Thursday, September 6, 12

Page 51: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Kembel Combiner

cally defined by a sequence similarity threshold) in the sampleas equally related. Newer ! diversity measures that incorporatephylogenetic information are more powerful because they ac-count for the degree of divergence between sequences (13, 18,29, 30). Phylogenetic ! diversity measures can also be eitherquantitative or qualitative depending on whether abundance istaken into account. The original, unweighted UniFrac measure(13) is a qualitative measure. Unweighted UniFrac measuresthe distance between two communities by calculating the frac-tion of the branch length in a phylogenetic tree that leads todescendants in either, but not both, of the two communities(Fig. 1A). The fixation index (FST), which measures thedistance between two communities by comparing the geneticdiversity within each community to the total genetic diversity ofthe communities combined (18), is a quantitative measure thataccounts for different levels of divergence between sequences.The phylogenetic test (P test), which measures the significanceof the association between environment and phylogeny (18), istypically used as a qualitative measure because duplicate se-quences are usually removed from the tree. However, the Ptest may be used in a semiquantitative manner if all clones,even those with identical or near-identical sequences, are in-cluded in the tree (13).

Here we describe a quantitative version of UniFrac that wecall “weighted UniFrac.” We show that weighted UniFrac be-haves similarly to the FST test in situations where both are

applicable. However, weighted UniFrac has a major advantageover FST because it can be used to combine data in whichdifferent parts of the 16S rRNA were sequenced (e.g., whennonoverlapping sequences can be combined into a single treeusing full-length sequences as guides). We use two differentdata sets to illustrate how analyses with quantitative and qual-itative ! diversity measures can lead to dramatically differentconclusions about the main factors that structure microbialdiversity. Specifically, qualitative measures that disregard rel-ative abundance can better detect effects of different foundingpopulations, such as the source of bacteria that first colonizethe gut of newborn mice and the effects of factors that arerestrictive for microbial growth such as temperature. In con-trast, quantitative measures that account for the relative abun-dance of microbial lineages can reveal the effects of moretransient factors such as nutrient availability.

MATERIALS AND METHODS

Weighted UniFrac. Weighted UniFrac is a new variant of the original un-weighted UniFrac measure that weights the branches of a phylogenetic treebased on the abundance of information (Fig. 1B). Weighted UniFrac is thus aquantitative measure of ! diversity that can detect changes in how many se-quences from each lineage are present, as well as detect changes in which taxaare present. This ability is important because the relative abundance of differentkinds of bacteria can be critical for describing community changes. In contrast,the original, unweighted UniFrac (Fig. 1A) is a qualitative ! diversity measurebecause duplicate sequences contribute no additional branch length to the tree(by definition, the branch length that separates a pair of duplicate sequences iszero, because no substitutions separate them).

The first step in applying weighted UniFrac is to calculate the raw weightedUniFrac value (u), according to the first equation:

u ! !i

n

bi " "Ai

AT#

Bi

BT"

Here, n is the total number of branches in the tree, bi is the length of branch i,Ai and Bi are the numbers of sequences that descend from branch i in commu-nities A and B, respectively, and AT and BT are the total numbers of sequencesin communities A and B, respectively. In order to control for unequal samplingeffort, Ai and Bi are divided by AT and BT.

If the phylogenetic tree is not ultrametric (i.e., if different sequences in thesample have evolved at different rates), clustering with weighted UniFrac willplace more emphasis on communities that contain quickly evolving taxa. Sincethese taxa are assigned more branch length, a comparison of the communitiesthat contain them will tend to produce higher values of u. In some situations, itmay be desirable to normalize u so that it has a value of 0 for identical commu-nities and 1 for nonoverlapping communities. This is accomplished by dividing uby a scaling factor (D), which is the average distance of each sequence from theroot, as shown in the equation as follows:

D ! !j

n

dj " #Aj

AT$

Bj

BT$

Here, dj is the distance of sequence j from the root, Aj and Bj are the numbersof times the sequences were observed in communities A and B, respectively, andAT and BT are the total numbers of sequences from communities A and B,respectively.

Clustering with normalized u values treats each sample equally instead of

TABLE 1. Measurements of diversity

Measure Measurement of " diversity Measurement of ! diversity

Only presence/absence of taxa considered Qualitative (species richness) QualitativeAdditionally accounts for the no. of times that

each taxon was observedQuantitative (species richness and evenness) Quantitative

FIG. 1. Calculation of the unweighted and the weighted UniFracmeasures. Squares and circles represent sequences from two differentenvironments. (a) In unweighted UniFrac, the distance between thecircle and square communities is calculated as the fraction of thebranch length that has descendants from either the square or the circleenvironment (black) but not both (gray). (b) In weighted UniFrac,branch lengths are weighted by the relative abundance of sequences inthe square and circle communities; square sequences are weightedtwice as much as circle sequences because there are twice as many totalcircle sequences in the data set. The width of branches is proportionalto the degree to which each branch is weighted in the calculations, andgray branches have no weight. Branches 1 and 2 have heavy weightssince the descendants are biased toward the square and circles, respec-tively. Branch 3 contributes no value since it has an equal contributionfrom circle and square sequences after normalization.

VOL. 73, 2007 PHYLOGENETICALLY COMPARING MICROBIAL COMMUNITIES 1577

Kembel SW, Eisen JA, Pollard KS, Green JL (2011) The Phylogenetic Diversity of Metagenomes. PLoS ONE 6(8): e23214. doi:10.1371/journal.pone.0023214

Thursday, September 6, 12

Page 52: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Uses of Phylogeny in Genomics and Metagenomics

Example 2:

Functional Diversity and Functional Predictions

Thursday, September 6, 12

Page 53: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

PHYLOGENENETIC PREDICTION OF GENE FUNCTION

IDENTIFY HOMOLOGS

OVERLAY KNOWNFUNCTIONS ONTO TREE

INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST

1 2 3 4 5 6

3 5

3

1A 2A 3A 1B 2B 3B

2A 1B

1A

3A

1B2B

3B

ALIGN SEQUENCES

CALCULATE GENE TREE

12

4

6

CHOOSE GENE(S) OF INTEREST

2A

2A

5

3

Species 3Species 1 Species 2

1

1 2

2

2 31

1A 3A

1A 2A 3A

1A 2A 3A

4 6

4 5 6

4 5 6

2B 3B

1B 2B 3B

1B 2B 3B

ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)

Duplication?

EXAMPLE A EXAMPLE B

Duplication?

Duplication?

Duplication

5

METHOD

Ambiguous

Based on Eisen, 1998 Genome Res 8: 163-167.

Thursday, September 6, 12

Page 54: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Diversity of Proteorhodopsins

Venter et al., 2004. Science 304: 66.

Thursday, September 6, 12

Page 55: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Improving Functional Predictions

• Same methods discussed for phylotyping improve phylogenomic functional prediction for protein families

• Increase in sequence diversity helps too

Thursday, September 6, 12

Page 56: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

NMF in MetagenomesCharacterizing the niche-space distributions of components

Sit

es

N orth American E ast C oast_G S 005_E mbayment

N orth American E ast C oast_G S 002_C oasta l

N orth American E ast C oast_G S 003_C oasta l

N orth American E ast C oast_G S 007_C oasta l

N orth American E ast C oast_G S 004_C oasta l

N orth American E ast C oast_G S 013_C oasta l

N orth American E ast C oast_G S 008_C oasta l

N orth American E ast C oast_G S 011_E stuary

N orth American E ast C oast_G S 009_C oasta l

E astern Tropica l Pacific_G S 021_C oasta l

N orth American E ast C oast_G S 006_E stuary

N orth American E ast C oast_G S 014_C oasta l

Polynesia Archipelagos_G S 051_C ora l R eef Atoll

G alapagos Islands_G S 036_C oasta l

G alapagos Islands_G S 028_C oasta l

Indian O cean_G S 117a_C oasta l sample

G alapagos Islands_G S 031_C oasta l upwelling

G alapagos Islands_G S 029_C oasta l

G alapagos Islands_G S 030_W arm S eep

G alapagos Islands_G S 035_C oasta l

S argasso S ea_G S 001c_O pen O cean

E astern Tropica l Pacific_G S 022_O pen O cean

G alapagos Islands_G S 027_C oasta l

Indian O cean_G S 149_H arbor

Indian O cean_G S 123_O pen O cean

C aribbean S ea_G S 016_C oasta l S ea

Indian O cean_G S 148_Fringing R eef

Indian O cean_G S 113_O pen O cean

Indian O cean_G S 112a_O pen O cean

C aribbean S ea_G S 017_O pen O cean

Indian O cean_G S 121_O pen O cean

Indian O cean_G S 122a_O pen O cean

G alapagos Islands_G S 034_C oasta l

C aribbean S ea_G S 018_O pen O cean

Indian O cean_G S 108a_Lagoon R eef

Indian O cean_G S 110a_O pen O cean

E astern Tropica l Pacific_G S 023_O pen O cean

Indian O cean_G S 114_O pen O cean

C aribbean S ea_G S 019_C oasta l

C aribbean S ea_G S 015_C oasta l

Indian O cean_G S 119_O pen O cean

G alapagos Islands_G S 026_O pen O cean

Polynesia Archipelagos_G S 049_C oasta l

Indian O cean_G S 120_O pen O cean

Polynesia Archipelagos_G S 048a_C ora l R eef

Component 1

Component 2

Component 3

Component 4

Component 5

0 .1 0 .2 0 .3 0 .4 0 .5 0 .6

0 .2 0 .4 0 .6 0 .8 1 .0

Salin

ity

Sam

ple

Dep

th

Ch

loro

ph

yll

Tem

pera

ture

Inso

lati

on

Wate

r D

ep

th

G enera l

H ighM ediumLowN A

H ighM ediumLowN A

W ater depth

>4000m2000!4000m900!2000m100!200m20!100m0!20m

>4000m2000!4000m900!2000m100!200m20!100m0!20m

(a) (b) (c)

Figure 3: a) Niche-space distributions for our five components (HT ); b) the site-similarity matrix (HT H); c) environmental variables for the sites. The matrices arealigned so that the same row corresponds to the same site in each matrix. Sites areordered by applying spectral reordering to the similarity matrix (see Materials andMethods). Rows are aligned across the three matrices.

Figure 3a shows the estimated niche-space distribution for each of the five com-ponents. Components 2 (Photosystem) and 4 (Unidentified) are broadly distributed;Components 1 (Signalling) and 5 (Unidentified) are largely restricted to a handful ofsites; and component 3 shows an intermediate pattern. There is a great deal of overlapbetween niche-space distributions for di�erent components.

Figure 3b shows the pattern of filtered similarity between sites. We see clear pat-terns of grouping, that do not emerge when we calculate functional distances withoutfiltering, or using PCA rather than NMF filtering (Figure 3 in Text S1). As withthe Pfams, we see clusters roughly associated with our components, but there is moreoverlapping than with the Pfam clusters (Figure 2b).

Figure 3c shows the distribution of environmental variables measured at each site.Inspection of Figure 3 reveals qualitative correspondence between environmental factorsand clusters of similar sites in the similarity matrix. For example, the “North AmericanEast Coast” samples are divided into two groups, one in the top left and the other in thebottom right of the similarity matrix. Inspection of the environmental features suggeststhat the split in these samples could be mostly due to the di�erences in insolation andwater depth.

We can also examine patterns of similarity between the components themselves,using niche-site distributions or functional profiles (see Figure 5 in Text S1). All 5

8

Non-negative matrix factorizationJiang et al. In press PLoS One.

w/ Weitz, Dushoff, Langille, Neches, Levin, etc

Thursday, September 6, 12

Page 57: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Uses of Phylogeny in Genomics and Metagenomics

Example 3:

Selecting Organisms for Study

Thursday, September 6, 12

Page 58: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GEBA

http://www.jgi.doe.gov/programs/GEBA/pilot.html

Thursday, September 6, 12

Page 59: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GEBA: Components• Project overview (Phil Hugenholtz, Nikos Kyrpides, Jonathan

Eisen, Eddy Rubin, Jim Bristow)• Project management (David Bruce, Eileen Dalin, Lynne

Goodwin)• Culture collection and DNA prep (DSMZ, Hans-Peter Klenk)• Sequencing and closure (Eileen Dalin, Susan Lucas, Alla

Lapidus, Mat Nolan, Alex Copeland, Cliff Han, Feng Chen, Jan-Fang Cheng)

• Annotation and data release (Nikos Kyrpides, Victor Markowitz, et al)

• Analysis (Dongying Wu, Kostas Mavrommatis, Martin Wu, Victor Kunin, Neil Rawlings, Ian Paulsen, Patrick Chain, Patrik D’Haeseleer, Sean Hooper, Iain Anderson, Amrita Pati, Natalia N. Ivanova, Athanasios Lykidis, Adam Zemla)

• Adopt a microbe education project (Cheryl Kerfeld)• Outreach (David Gilbert)• $$$ (DOE, Eddy Rubin, Jim Bristow)

Thursday, September 6, 12

Page 60: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GEBA Now

• 300+ genomes• Rich sampling of major groups of

cultured organisms

Thursday, September 6, 12

Page 61: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GEBA Lesson 1

Thursday, September 6, 12

Page 62: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Protein Family Rarefaction

• Take data set of multiple complete genomes

• Identify all protein families using MCL• Plot # of genomes vs. # of protein families

Thursday, September 6, 12

Page 68: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Synapomorphies exist

Wu et al. 2009 Nature 462, 1056-1060

Thursday, September 6, 12

Page 69: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GEBA Lesson 2

Thursday, September 6, 12

Page 70: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Betap

roteobacteria

Gamm

aproteobacteria

Epsilo

nproteobacteria

Deltapro

teobacteria

Cyanobacteria

Firmicutes

Actinobacteria

Chlorobi

CFB

Chloroflexi

Spirochaetes

Fusobacteria

Deinococcus-Th

ermus

Euryarchaeota

Crenarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

Metagenomic Phylotyping

GEBA benefits phylotyping & functional prediction

Venter et al., Science 304: 66-74. 2004Thursday, September 6, 12

Page 71: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GEBA improves genome annotation

• Took 56 GEBA genomes and compared results vs. 56 randomly sampled new genomes

• Better definition of protein family sequence “patterns”• Greatly improves “comparative” and “evolutionary”

based predictions• Conversion of hypothetical into conserved hypotheticals• Linking distantly related members of protein families• Improved non-homology prediction

Thursday, September 6, 12

Page 72: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

0

0.125

0.250

0.375

0.500

Alphapro

teobacteria

Betap

roteobacteria

Gamm

aproteobacteria

Epsilo

nproteobacteria

Deltapro

teobacteria

Cyanobacteria

Firmicutes

Actinobacteria

Chlorobi

CFB

Chloroflexi

Spirochaetes

Fusobacteria

Deinococcus-Th

ermus

Euryarchaeota

Crenarchaeota

Sargasso Phylotypes

Wei

ghte

d %

of C

lone

s

Major Phylogenetic Group

EFGEFTuHSP70RecARpoBrRNA

But not a lot

Metagenomic Phylotyping

Venter et al., Science 304: 66-74. 2004Thursday, September 6, 12

Page 73: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Improving Functional Predictions

Thursday, September 6, 12

Page 74: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Sifting FamiliesRepresentative

Genomes

ExtractProtein

Annotation

All v. AllBLAST

HomologyClustering

(MCL)

SFams

Align & Build

HMMs

HMMs

Screen forHomologs

NewGenomes

ExtractProtein

Annotation

Figure 1Sharpton et al. submitted

AB

C

��

�� �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� �

��

� �

��

��

� �

��

��

� �

� �

� �

��

��

� ��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

��

��

� �

��

��

��

��

��

��

��

� �

��

��

���

��

��

� �

��

��

��

� ��

��

� �

��

��

� �

� �� �

� �

��

��

��

��

���

� �

��

� �

��

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

��

��

��

��

��

���

��

��

��

��

��

� �

��

� �

��

�� �

��

��

� �

��

��

��

��

��

��

��

��

�� �

��

��

��

���

��

��

��

��

��

�� �

�� �

��

��

��

��

��

�� �

��

� ��

� �

��

��

��

� �

��

� �

��

� �

��

��

��

��

��

� �

��

��

��

� �

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

� �

��

Thursday, September 6, 12

Page 75: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Improving Phylotyping

Thursday, September 6, 12

Page 76: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

More Markers

Phylogenetic group Genome Number

Gene Number

Maker Candidates

Archaea 62 145415 106Actinobacteria 63 267783 136Alphaproteobacteria 94 347287 121Betaproteobacteria 56 266362 311Gammaproteobacteria 126 483632 118Deltaproteobacteria 25 102115 206Epislonproteobacteria 18 33416 455Bacteriodes 25 71531 286Chlamydae 13 13823 560Chloroflexi 10 33577 323Cyanobacteria 36 124080 590Firmicutes 106 312309 87Spirochaetes 18 38832 176Thermi 5 14160 974Thermotogae 9 17037 684

Thursday, September 6, 12

Page 77: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Better Reference Tree

Morgan et al. submitted

Thursday, September 6, 12

Page 78: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GEBA Lesson 3

We have still only scratched the surface of microbial diversity

Thursday, September 6, 12

Page 80: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

76

Number of SAGs from Candidate Phyla

OD

1

OP

11

OP

3

SA

R4

06

Site A: Hydrothermal vent 4 1 - -Site B: Gold Mine 6 13 2 -Site C: Tropical gyres (Mesopelagic) - - - 2Site D: Tropical gyres (Photic zone) 1 - - -

Sample collections at 4 additional sites are underway.

Phil Hugenholtz

GEBA uncultured

Thursday, September 6, 12

Page 81: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

GEBA Lesson IV

Need Experiments from Across the Tree of Life too

Thursday, September 6, 12

Page 82: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Conclusion

Thursday, September 6, 12

Page 83: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Thursday, September 6, 12

Page 84: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

MICROBES

Thursday, September 6, 12

Page 85: "Phylogenomic approaches to microbial diversity" Talk by Jonathan Eisen at #IlluminaBayArea meeting

Acknowledgements

• $$$• DOE• NSF• GBMF• Sloan• DARPA• DSMZ• DHS

• People, places• DOE JGI: Eddy Rubin, Phil Hugenholtz, Nikos Kyrpides• UC Davis: Aaron Darling, Dongying Wu, Holly Bik, Russell

Neches, Jenna Morgan-Lang• Other: Jessica Green, Katie Pollard, Martin Wu, Tom Slezak,

Jack Gilbert, Steven Kembel, J. Craig Venter, Naomi Ward, Hans-Peter Klenk

Thursday, September 6, 12