uc davis eve161 lecture 18 by @phylogenomics

108
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Lecture 18: EVE 161: Microbial Phylogenomics Lecture #18: Era IV: Metagenomics Case Study UC Davis, Winter 2014 Instructor: Jonathan Eisen 1

Upload: jonathan-eisen

Post on 10-May-2015

537 views

Category:

Education


3 download

DESCRIPTION

Slides for Lecture 18 in EVE 161 Course by Jonathan Eisen at UC Davis

TRANSCRIPT

Page 1: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Lecture 18:

EVE 161:Microbial Phylogenomics

!Lecture #18:

Era IV: Metagenomics Case Study !

UC Davis, Winter 2014 Instructor: Jonathan Eisen

!1

Page 2: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

• Next week Student Presentations

• Each student gets 10 minutes total

• Eight minutes to present and 2 minutes for questions

• Possible presentation timing ! 2 minutes Overview and Methods ! 4 minutes R & D ! 2 minutes Conclusions and Future Ideas ! 2 minutes Questions

• Contact Holly Ganz [email protected] for non Eisen guidance

Page 3: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

ARTICLES

A human gut microbial gene catalogueestablished by metagenomic sequencingJunjie Qin1*, Ruiqiang Li1*, Jeroen Raes2,3, Manimozhiyan Arumugam2, Kristoffer Solvsten Burgdorf4,Chaysavanh Manichanh5, Trine Nielsen4, Nicolas Pons6, Florence Levenez6, Takuji Yamada2, Daniel R. Mende2,Junhua Li1,7, Junming Xu1, Shaochuan Li1, Dongfang Li1,8, Jianjun Cao1, Bo Wang1, Huiqing Liang1, Huisong Zheng1,Yinlong Xie1,7, Julien Tap6, Patricia Lepage6, Marcelo Bertalan9, Jean-Michel Batto6, Torben Hansen4, Denis LePaslier10, Allan Linneberg11, H. Bjørn Nielsen9, Eric Pelletier10, Pierre Renault6, Thomas Sicheritz-Ponten9,Keith Turner12, Hongmei Zhu1, Chang Yu1, Shengting Li1, Min Jian1, Yan Zhou1, Yingrui Li1, Xiuqing Zhang1,Songgang Li1, Nan Qin1, Huanming Yang1, Jian Wang1, Søren Brunak9, Joel Dore6, Francisco Guarner5,Karsten Kristiansen13, Oluf Pedersen4,14, Julian Parkhill12, Jean Weissenbach10, MetaHIT Consortium{, Peer Bork2,S. Dusko Ehrlich6 & Jun Wang1,13

To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Herewe describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundantmicrobial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set,,150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent)microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. Thegenes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entirecohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which arealso largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms offunctions present in all individuals and most bacteria, respectively.

It has been estimated that the microbes in our bodies collectivelymake up to 100 trillion cells, tenfold the number of human cells,and suggested that they encode 100-fold more unique genes thanour own genome1. The majority of microbes reside in the gut, havea profound influence on human physiology and nutrition, and arecrucial for human life2,3. Furthermore, the gut microbes contribute toenergy harvest from food, and changes of gut microbiome may beassociated with bowel diseases or obesity4–8.

To understand and exploit the impact of the gut microbes onhuman health and well-being it is necessary to decipher the content,diversity and functioning of the microbial gut community. 16S ribo-somal RNA gene (rRNA) sequence-based methods9 revealed that twobacterial divisions, the Bacteroidetes and the Firmicutes, constituteover 90% of the known phylogenetic categories and dominate thedistal gut microbiota10. Studies also showed substantial diversity ofthe gut microbiome between healthy individuals4,8,10,11. Although thisdifference is especially marked among infants12, later in life the gutmicrobiome converges to more similar phyla.

Metagenomic sequencing represents a powerful alternative torRNA sequencing for analysing complex microbial communities13–15.Applied to the human gut, such studies have already generated some3 gigabases (Gb) of microbial sequence from faecal samples of 33

individuals from the United States or Japan8,16,17. To get a broaderoverview of the human gut microbial genes we used the IlluminaGenome Analyser (GA) technology to carry out deep sequencing oftotal DNA from faecal samples of 124 European adults. We generated576.7 Gb of sequence, almost 200 times more than in all previousstudies, assembled it into contigs and predicted 3.3 million uniqueopen reading frames (ORFs). This gene catalogue contains virtuallyall of the prevalent gut microbial genes in our cohort, provides abroad view of the functions important for bacterial life in the gutand indicates that many bacterial species are shared by differentindividuals. Our results also show that short-read metagenomicsequencing can be used for global characterization of the geneticpotential of ecologically complex environments.

Metagenomic sequencing of gut microbiomes

As part of the MetaHIT (Metagenomics of the Human IntestinalTract) project, we collected faecal specimens from 124 healthy, over-weight and obese individual human adults, as well as inflammatorybowel disease (IBD) patients, from Denmark and Spain (Supplemen-tary Table 1). Total DNA was extracted from the faecal specimens18

and an average of 4.5 Gb (ranging between 2 and 7.3 Gb) of sequencewas generated for each sample, allowing us to capture most of the

*These authors contributed equally to this work.{Lists of authors and affiliations appear at the end of the paper.

1BGI-Shenzhen, Shenzhen 518083, China. 2European Molecular Biology Laboratory, 69117 Heidelberg, Germany. 3VIB—Vrije Universiteit Brussel, 1050 Brussels, Belgium. 4HagedornResearch Institute, DK 2820 Copenhagen, Denmark. 5Hospital Universitari Val d’Hebron, Ciberehd, 08035 Barcelona, Spain. 6Institut National de la Recherche Agronomique, 78350Jouy en Josas, France. 7School of Software Engineering, South China University of Technology, Guangzhou 510641, China. 8Genome Research Institute, Shenzhen University MedicalSchool, Shenzhen 518000, China. 9Center for Biological Sequence Analysis, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark. 10Commissariat a l’EnergieAtomique, Genoscope, 91000 Evry, France. 11Research Center for Prevention and Health, DK-2600 Glostrup, Denmark. 12The Wellcome Trust Sanger Institute, Hinxton, CambridgeCB10 1SA, UK. 13Department of Biology, University of Copenhagen, DK-2200 Copenhagen, Denmark. 14Institute of Biomedical Sciences, University of Copenhagen & Faculty of HealthScience, University of Aarhus, 8000 Aarhus, Denmark.

Vol 464 | 4 March 2010 | doi:10.1038/nature08821

59Macmillan Publishers Limited. All rights reserved©2010

Page 4: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

ARTICLES

A human gut microbial gene catalogueestablished by metagenomic sequencingJunjie Qin1*, Ruiqiang Li1*, Jeroen Raes2,3, Manimozhiyan Arumugam2, Kristoffer Solvsten Burgdorf4,Chaysavanh Manichanh5, Trine Nielsen4, Nicolas Pons6, Florence Levenez6, Takuji Yamada2, Daniel R. Mende2,Junhua Li1,7, Junming Xu1, Shaochuan Li1, Dongfang Li1,8, Jianjun Cao1, Bo Wang1, Huiqing Liang1, Huisong Zheng1,Yinlong Xie1,7, Julien Tap6, Patricia Lepage6, Marcelo Bertalan9, Jean-Michel Batto6, Torben Hansen4, Denis LePaslier10, Allan Linneberg11, H. Bjørn Nielsen9, Eric Pelletier10, Pierre Renault6, Thomas Sicheritz-Ponten9,Keith Turner12, Hongmei Zhu1, Chang Yu1, Shengting Li1, Min Jian1, Yan Zhou1, Yingrui Li1, Xiuqing Zhang1,Songgang Li1, Nan Qin1, Huanming Yang1, Jian Wang1, Søren Brunak9, Joel Dore6, Francisco Guarner5,Karsten Kristiansen13, Oluf Pedersen4,14, Julian Parkhill12, Jean Weissenbach10, MetaHIT Consortium{, Peer Bork2,S. Dusko Ehrlich6 & Jun Wang1,13

To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Herewe describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundantmicrobial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set,,150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent)microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. Thegenes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entirecohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which arealso largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms offunctions present in all individuals and most bacteria, respectively.

It has been estimated that the microbes in our bodies collectivelymake up to 100 trillion cells, tenfold the number of human cells,and suggested that they encode 100-fold more unique genes thanour own genome1. The majority of microbes reside in the gut, havea profound influence on human physiology and nutrition, and arecrucial for human life2,3. Furthermore, the gut microbes contribute toenergy harvest from food, and changes of gut microbiome may beassociated with bowel diseases or obesity4–8.

To understand and exploit the impact of the gut microbes onhuman health and well-being it is necessary to decipher the content,diversity and functioning of the microbial gut community. 16S ribo-somal RNA gene (rRNA) sequence-based methods9 revealed that twobacterial divisions, the Bacteroidetes and the Firmicutes, constituteover 90% of the known phylogenetic categories and dominate thedistal gut microbiota10. Studies also showed substantial diversity ofthe gut microbiome between healthy individuals4,8,10,11. Although thisdifference is especially marked among infants12, later in life the gutmicrobiome converges to more similar phyla.

Metagenomic sequencing represents a powerful alternative torRNA sequencing for analysing complex microbial communities13–15.Applied to the human gut, such studies have already generated some3 gigabases (Gb) of microbial sequence from faecal samples of 33

individuals from the United States or Japan8,16,17. To get a broaderoverview of the human gut microbial genes we used the IlluminaGenome Analyser (GA) technology to carry out deep sequencing oftotal DNA from faecal samples of 124 European adults. We generated576.7 Gb of sequence, almost 200 times more than in all previousstudies, assembled it into contigs and predicted 3.3 million uniqueopen reading frames (ORFs). This gene catalogue contains virtuallyall of the prevalent gut microbial genes in our cohort, provides abroad view of the functions important for bacterial life in the gutand indicates that many bacterial species are shared by differentindividuals. Our results also show that short-read metagenomicsequencing can be used for global characterization of the geneticpotential of ecologically complex environments.

Metagenomic sequencing of gut microbiomes

As part of the MetaHIT (Metagenomics of the Human IntestinalTract) project, we collected faecal specimens from 124 healthy, over-weight and obese individual human adults, as well as inflammatorybowel disease (IBD) patients, from Denmark and Spain (Supplemen-tary Table 1). Total DNA was extracted from the faecal specimens18

and an average of 4.5 Gb (ranging between 2 and 7.3 Gb) of sequencewas generated for each sample, allowing us to capture most of the

*These authors contributed equally to this work.{Lists of authors and affiliations appear at the end of the paper.

1BGI-Shenzhen, Shenzhen 518083, China. 2European Molecular Biology Laboratory, 69117 Heidelberg, Germany. 3VIB—Vrije Universiteit Brussel, 1050 Brussels, Belgium. 4HagedornResearch Institute, DK 2820 Copenhagen, Denmark. 5Hospital Universitari Val d’Hebron, Ciberehd, 08035 Barcelona, Spain. 6Institut National de la Recherche Agronomique, 78350Jouy en Josas, France. 7School of Software Engineering, South China University of Technology, Guangzhou 510641, China. 8Genome Research Institute, Shenzhen University MedicalSchool, Shenzhen 518000, China. 9Center for Biological Sequence Analysis, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark. 10Commissariat a l’EnergieAtomique, Genoscope, 91000 Evry, France. 11Research Center for Prevention and Health, DK-2600 Glostrup, Denmark. 12The Wellcome Trust Sanger Institute, Hinxton, CambridgeCB10 1SA, UK. 13Department of Biology, University of Copenhagen, DK-2200 Copenhagen, Denmark. 14Institute of Biomedical Sciences, University of Copenhagen & Faculty of HealthScience, University of Aarhus, 8000 Aarhus, Denmark.

Vol 464 | 4 March 2010 | doi:10.1038/nature08821

59Macmillan Publishers Limited. All rights reserved©2010

THAT”S A LOT OF AUTHORS

Page 5: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

METHODS

Page 6: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Human faecal samples were collected, frozen immediately and DNA was purified by standard methods22. For all 124 individuals, paired-end libraries were constructed with different clone insert sizes and subjected to Illumina GA sequencing. All reads were assembled using SOAPdenovo19, with specific parameter ‘2M 3’ for metagenomics data. MetaGene was used for gene prediction. A non-redundant gene set was constructed by pair-wise comparison of all genes, using BLAT36 under the criteria of identity .95% and overlap .90%. Gene taxonomic assignments were made on the basis of BLASTP37 search (e-value ,1 3 1025) of the NCBI-NR database and 126 known gut bacteria genomes. Gene functional annotations were made by BLASTP search (e-value ,1 3 1025) with eggNOG and KEGG (v48.2) databases. The total and shared number of orthologous groups and/or gene families were computed using a random combination of n individuals (with n 5 2 to 124, 100 replicates per bin).

Page 7: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

As part of the MetaHIT (Metagenomics of the Human Intestinal Tract) project, we collected faecal specimens from 124 healthy, over- weight and obese individual human adults, as well as inflammatory bowel disease (IBD) patients, from Denmark and Spain (Supplementary Table 1)

Page 8: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

As part of the MetaHIT (Metagenomics of the Human Intestinal Tract) project, we collected faecal specimens from 124 healthy, over- weight and obese individual human adults, as well as inflammatory bowel disease (IBD) patients, from Denmark and Spain (Supplementary Table 1)

Page 9: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Supplementary Tables

Table 1 | DNA sample information.

All Danish individuals in the present subsample were originally recruited from a larger population-based sample of middle-aged people living in the northern part of Copenhagen region and sampled from the centralized personal number register. At the original recruitment the individuals included in the present study had normal fasting plasma glucose and normal 2 hour plasma glucose following an oral glucose tolerance test. At the time of fecal sampling all were examined in the fasting state and had non-diabetic fasting plasma glucose levels below 7,0 mmol/l. All of the IBD patients were in clinical remission at the time of fecal sampling. N refers to no IBD, CD & UC to Crohn’s disease and ulcerative colitis, respectively.

Sample Name Country Gender Age BMI IBD MH0001 Denmark female 49 25.55 N

MH0002 Denmark female 59 27.28 N

MH0003 Denmark male 69 33.19 N

MH0004 Denmark male 59 31.18 N

MH0005 Denmark male 64 21.68 N

MH0006 Denmark female 59 22.38 N

MH0007 Denmark male 69 33.60 N

MH0008 Denmark male 59 24.35 N

MH0009 Denmark male 64 29.04 N

MH0010 Denmark male 64 33.27 N

MH0011 Denmark female 0 22.31 N

MH0012 Denmark female 42 32.10 N

MH0013 Denmark male 54 20.46 N

MH0014 Denmark female 54 38.49 N

MH0015 Denmark male 59 25.47 N

MH0016 Denmark female 49 30.50 N

MH0017 Denmark male 64 21.81 N

MH0018 Denmark male 49 31.37 N

MH0019 Denmark female 44 20.01 N

MH0020 Denmark female 63 33.23 N

MH0021 Denmark female 49 25.42 N

MH0022 Denmark male 64 24.42 N

MH0023 Denmark male 69 31.74 N

MH0024 Denmark female 59 22.72 N

MH0025 Denmark female 49 34.20 N

MH0026 Denmark female 49 37.32 N

MH0027 Denmark female 59 23.07 N

MH0028 Denmark female 44 22.70 N

MH0030 Denmark male 59 35.21 N

MH0031 Denmark male 69 22.34 N

MH0032 Denmark male 69 35.28 N

MH0033 Denmark female 59 31.95 N

www.nature.com / nature 12

Page 10: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

METHODS!& !

RESULTS!(mixed)

Page 11: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Total DNA was extracted from the faecal specimens18 and an average of 4.5 Gb (ranging between 2 and 7.3 Gb) of sequence was generated for each sample, allowing us to capture most of the novelty (see Methods and Supplementary Table 2). In total, we obtained 576.7 Gb of sequence (Supplementary Table 3).

!

Page 12: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Total DNA was extracted from the faecal specimens18 and an average of 4.5 Gb (ranging between 2 and 7.3 Gb) of sequence was generated for each sample, allowing us to capture most of the novelty (see Methods and Supplementary Table 2). In total, we obtained 576.7 Gb of sequence (Supplementary Table 3).

!

Page 13: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Table 2 | Summary of Sanger reads. The reads were sequenced by 3730xl. Low-quality sequences at both ends with phred score less than 20 were trimmed. Very short reads with length less than 100 bp were filtered.

Sample ID # Sanger reads Average length (bp) Total length (bp)

MH0006 237,567 660.65 156,949,306

MH0012 230,768 670.26 154,675,458

www.nature.com / nature 15

Page 14: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Table 3 | Summary of Illumina GA reads. We constructed libraries with three different insert sizes of about 135 bp, 200 bp, and 400 bp. The insert sizes of each library were estimated by re-aligning the paired-end reads on the assembled contigs.

Sample ID

Paired-end insert size (bp)

Read length (bp) # of reads

Data (Gb)

human reads, %

# of high quality reads

MH0047 136/378 75 35,355,400 2.65 0.18 26,932,064 MH0021 134/354 75 36,454,400 2.73 0.12 26,258,326 MH0079 135/360 75 38,011,600 2.85 0.40 27,418,899 MH0078 146/373 75 38,038,200 2.85 1.56 26,051,537 MH0052 141/367 75 39,538,000 2.97 0.08 28,575,036 MH0049 134/343 75 40,444,200 3.03 0.06 30,654,842 MH0076 134/409 75 40,697,000 3.05 0.42 30,650,106 MH0051 143/374 75 41,911,800 3.14 0.32 25,963,104 MH0048 143/349 75 42,923,600 3.22 0.26 26,972,970

O2.UC-14 141/355 75 43,343,000 3.25 0.06 26,942,750 MH0015 235 44 44,671,400 1.97 0.04 33,014,675 MH0018 233 44 45,081,400 1.98 2.14 36,609,695 MH0027 238 44 45,190,000 1.99 0.09 32,377,390 MH0017 223 44 45,557,200 2.00 0.04 36,154,362 MH0022 256 44 46,415,000 2.04 0.21 37,112,508 MH0023 237 44 48,598,400 2.14 0.04 37,782,998 MH0019 249 44 49,229,400 2.17 0.06 38,856,780 MH0026 156/398 75 49,812,000 3.74 0.05 37,484,066 MH0013 238 44 50,257,200 2.21 1.63 40,028,120 MH0005 237 44 50,704,800 2.23 0.23 39,407,333 MH0007 195 44 50,719,800 2.23 0.31 36,956,284 MH0008 219 44 51,411,000 2.26 0.10 38,156,496 V1.UC-7 141/356 75 51,911,400 3.89 14.67 36,788,540 MH0010 220 44 52,218,200 2.30 0.08 39,169,850

V1.CD-12 148/361 75 53,519,400 4.01 0.02 40,609,134 O2.UC-20 141/362 75 53,637,200 4.02 0.03 38,376,747 V1.CD-15 143/351 75 53,938,600 4.05 2.85 40,560,446 O2.UC-19 133/352 75 54,537,600 4.09 0.01 38,459,550 MH0004 218 44 55,829,800 2.46 0.95 40,288,492 MH0062 144/357 75 57,128,400 4.28 14.32 36,809,224 MH0066 147/429 75 57,234,200 4.29 0.05 36,114,997

O2.UC-21 142/362 75 57,856,000 4.34 0.03 34,832,308 V1.CD-13 139/352 75 58,145,800 4.36 0.04 42,560,831 MH0080 140/376 75 58,220,800 4.37 0.13 46,590,749

V1.UC-13 131/352 75 58,381,400 4.38 9.77 38,553,580 MH0032 142/370 75 58,822,400 3.93 0.39 50,110,067

O2.UC-12 153/384 75 58,927,800 4.42 0.12 36,908,526 MH0001 214 44 59,239,200 2.61 0.06 45,016,612 V1.UC-6 142/376 75 59,270,800 4.45 0.41 43,150,856 MH0060 142/367 75 60,156,000 4.51 0.07 41,112,227 MH0053 137/416 75 60,788,600 4.56 0.07 43,283,564 MH0002 139/370 75 61,077,000 4.58 0.15 46,570,095

O2.UC-11 142/377 75 61,253,800 4.59 0.14 38,507,042 MH0059 136/370 75 61,574,600 4.62 0.18 41,025,606

www.nature.com / nature 16

Page 15: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Wanting to generate an extensive catalogue of microbial genes from the human gut, we first assembled the short Illumina reads into longer contigs, which could then be analysed and annotated by standard methods. Using SOAPdenovo19, a de Bruijn graph-based tool specially designed for assembling very short reads, we performed de novo assembly for all of the Illumina GA sequence data. Because a high diversity between individuals is expected8,16,17, we first assembled each sample independently (Supplementary Fig. 3). As much as 42.7% of the Illumina GA reads was assembled into a total of 6.58 million contigs of a length .500 bp, giving a total contig length of 10.3 Gb, with an N50 length of 2.2 kb (Supplementary Fig. 4) and the range of 12.3 to 237.6 Mb (Supplementary Table 4). Almost 35% of reads from any one sample could be mapped to contigs from other samples, indicating the existence of a common sequence core.

Page 16: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Wanting to generate an extensive catalogue of microbial genes from the human gut, we first assembled the short Illumina reads into longer contigs, which could then be analysed and annotated by standard methods. Using SOAPdenovo19, a de Bruijn graph-based tool specially designed for assembling very short reads, we performed de novo assembly for all of the Illumina GA sequence data. Because a high diversity between individuals is expected8,16,17, we first assembled each sample independently (Supplementary Fig. 3). As much as 42.7% of the Illumina GA reads was assembled into a total of 6.58 million contigs of a length .500 bp, giving a total contig length of 10.3 Gb, with an N50 length of 2.2 kb (Supplementary Fig. 4) and the range of 12.3 to 237.6 Mb (Supplementary Table 4). Almost 35% of reads from any one sample could be mapped to contigs from other samples, indicating the existence of a common sequence core.

Page 17: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Figure 3 | Flowchart of human gut microbiome data analysis process. We performed de novo short reads assembly for each sample independently, then all the unassembled reads were pooled for another round of assembly. ORFs were predicted in each of the contig set, and were merged by removing redundancy. The non-redundant gene set was used in all further analysis.

www.nature.com / nature 3

Page 18: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Figure 4 | Length distribution of assembled contigs. The number of contigs in different length bins for each individual was computed, and the data from all 124 individuals were pooled. Boxes denote 25% and 75% percentiles, the red line corresponds to the median, and the “whiskers” indicate interquartile range from either or both ends of the box

www.nature.com / nature 4

Page 19: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Table 4 | Summary of de novo assembly results. Assembled sequences with length below 500 bp were excluded from the contig set.

Sample ID # of contigs Contig N50 (bp)

Total length (Mb)

% reads assembled

Unassembled reads (Gb)

MH0001 14,301 1,618 19.69 46.34 1.06 MH0002 65,392 1,680 88.77 45.31 1.91 MH0003 68,658 2,640 119.59 54.40 1.72 MH0004 23,793 1,681 31.92 41.54 1.05 MH0005 14,339 1,684 19.62 40.22 1.04 MH0006 144,440 2,025 217.77 52.39 5.24 MH0007 28,108 1,270 32.00 29.15 1.16 MH0008 26,506 1,768 37.24 43.53 0.95 MH0009 70,014 2,440 112.96 44.14 2.45 MH0010 25,674 1,815 36.52 48.77 0.88 MH0011 86,201 2,158 134.25 46.09 2.37 MH0012 140,991 2,478 237.58 42.77 7.99 MH0013 20,495 2,332 32.20 41.22 1.05 MH0014 66,724 2,957 120.54 50.90 2.08 MH0015 25,933 1,645 34.46 35.53 0.94 MH0016 64,124 2,915 114.03 53.89 1.88 MH0017 24,948 1,679 34.06 39.57 0.96 MH0018 13,247 1,619 17.73 35.23 1.07 MH0019 28,786 1,977 41.95 46.76 0.91 MH0020 44,930 4,708 98.78 56.81 1.49 MH0021 54,101 1,608 70.67 46.49 1.06 MH0022 21,872 1,773 30.00 35.04 1.06 MH0023 16,214 2,100 25.57 35.80 1.07 MH0024 43,145 1,512 54.45 33.21 2.41 MH0025 76,287 1,968 111.48 42.24 2.11 MH0026 33,408 3,769 69.38 43.97 1.58 MH0027 20,369 985 19.72 19.96 1.14 MH0028 61,004 2,630 104.54 49.65 1.80 MH0030 39,267 2,828 66.60 36.73 2.15 MH0031 53,292 1,878 75.84 36.41 2.33 MH0032 37,287 1,921 54.46 20.72 2.60 MH0033 61,782 2,616 102.04 49.87 1.69 MH0034 24,508 2,107 37.63 14.53 2.40 MH0035 68,287 2,075 102.94 48.22 1.93 MH0036 58,690 2,330 94.35 49.44 1.81 MH0037 48,356 2,526 80.44 48.21 1.63 MH0038 50,381 2,921 90.35 47.75 1.79 MH0039 66,509 2,087 104.17 47.66 1.68 MH0040 73,068 2,225 115.15 49.53 1.68

www.nature.com / nature 19

Page 20: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To assess the quality of the Illumina GA-based assembly we mapped the contigs of samples MH0006 and MH0012 to the Sanger reads from the same samples (Supplementary Table 2). A total of 98.7% of the contigs that map to at least one Sanger read were collinear over 99.6% of the mapped regions. This is comparable to the contigs that were generated by 454 sequencing for one of the two samples (MH0006) as a control, of which 97.9% were collinear over 99.5% of the mapped regions. We estimate assembly errors to be 14.2 and 20.7 per megabase (Mb) of Illumina- and 454-based contigs, respectively (see Methods and Supplementary Fig. 5), indicating that the short- and long-read- based assemblies have comparable accuracies.

Page 21: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To assess the quality of the Illumina GA-based assembly we mapped the contigs of samples MH0006 and MH0012 to the Sanger reads from the same samples (Supplementary Table 2). A total of 98.7% of the contigs that map to at least one Sanger read were collinear over 99.6% of the mapped regions. This is comparable to the contigs that were generated by 454 sequencing for one of the two samples (MH0006) as a control, of which 97.9% were collinear over 99.5% of the mapped regions. We estimate assembly errors to be 14.2 and 20.7 per megabase (Mb) of Illumina- and 454-based contigs, respectively (see Methods and Supplementary Fig. 5), indicating that the short- and long-read- based assemblies have comparable accuracies.

Page 22: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Table 2 | Summary of Sanger reads. The reads were sequenced by 3730xl. Low-quality sequences at both ends with phred score less than 20 were trimmed. Very short reads with length less than 100 bp were filtered.

Sample ID # Sanger reads Average length (bp) Total length (bp)

MH0006 237,567 660.65 156,949,306

MH0012 230,768 670.26 154,675,458

www.nature.com / nature 15

Page 23: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Figure 5 | Validating Illumina contigs using Sanger reads. Illumina/454 contigs from samples MH0006 and MH0012 were mapped to Sanger reads from the same samples. Aligned regions were scanned for breakage of collinearity, and each unique break is counted as an error. a. number of errors per Mb of Illumina/454 contigs mapped to Sanger reads. b. percentage of collinear Illumina/454 contigs and collinear basepairs in those contigs.

b�

www.nature.com / nature 5

Page 24: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To complete the contig set we pooled the unassembled reads from all 124 samples, and repeated the de novo assembly process. About 0.4 million additional contigs were thus generated, having a length of 370 Mb and an N50 length of 939 bp. The total length of our final contig set was thus 10.7 Gb. Some 80% of the 576.7 Gb of Illumina GA sequence could be aligned to the contigs at a threshold of 90% identity, allowing for accommodation of sequencing errors and strain variability in the gut (Fig. 1), almost twice the 42.7% of sequence that was assembled into contigs by SOAPdenovo, because assembly uses more stringent criteria. This indicates that a vast majority of the Illumina sequence is represented by our contigs.

Page 25: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To complete the contig set we pooled the unassembled reads from all 124 samples, and repeated the de novo assembly process. About 0.4 million additional contigs were thus generated, having a length of 370 Mb and an N50 length of 939 bp. The total length of our final contig set was thus 10.7 Gb. Some 80% of the 576.7 Gb of Illumina GA sequence could be aligned to the contigs at a threshold of 90% identity, allowing for accommodation of sequencing errors and strain variability in the gut (Fig. 1), almost twice the 42.7% of sequence that was assembled into contigs by SOAPdenovo, because assembly uses more stringent criteria. This indicates that a vast majority of the Illumina sequence is represented by our contigs.

Page 26: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

novelty (see Methods and Supplementary Table 2). In total, weobtained 576.7 Gb of sequence (Supplementary Table 3).

Wanting to generate an extensive catalogue of microbial genes fromthe human gut, we first assembled the short Illumina reads into longercontigs, which could then be analysed and annotated by standardmethods. Using SOAPdenovo19, a de Bruijn graph-based tool speciallydesigned for assembling very short reads, we performed de novoassembly for all of the Illumina GA sequence data. Because a highdiversity between individuals is expected8,16,17, we first assembled eachsample independently (Supplementary Fig. 3). As much as 42.7% ofthe Illumina GA reads was assembled into a total of 6.58 millioncontigs of a length .500 bp, giving a total contig length of 10.3 Gb,with an N50 length of 2.2 kb (Supplementary Fig. 4) and the range of12.3 to 237.6 Mb (Supplementary Table 4). Almost 35% of reads fromany one sample could be mapped to contigs from other samples,indicating the existence of a common sequence core.

To assess the quality of the Illumina GA-based assembly we mappedthe contigs of samples MH0006 and MH0012 to the Sanger reads fromthe same samples (Supplementary Table 2). A total of 98.7% of thecontigs that map to at least one Sanger read were collinear over 99.6%of the mapped regions. This is comparable to the contigs that weregenerated by 454 sequencing for one of the two samples (MH0006) asa control, of which 97.9% were collinear over 99.5% of the mappedregions. We estimate assembly errors to be 14.2 and 20.7 per megabase(Mb) of Illumina- and 454-based contigs, respectively (see Methodsand Supplementary Fig. 5), indicating that the short- and long-read-based assemblies have comparable accuracies.

To complete the contig set we pooled the unassembled reads fromall 124 samples, and repeated the de novo assembly process. About 0.4million additional contigs were thus generated, having a length of370 Mb and an N50 length of 939 bp. The total length of our finalcontig set was thus 10.7 Gb. Some 80% of the 576.7 Gb of IlluminaGA sequence could be aligned to the contigs at a threshold of 90%identity, allowing for accommodation of sequencing errors andstrain variability in the gut (Fig. 1), almost twice the 42.7% ofsequence that was assembled into contigs by SOAPdenovo, becauseassembly uses more stringent criteria. This indicates that a vastmajority of the Illumina sequence is represented by our contigs.

To compare the representation of the human gut microbiome inour contigs with that from previous work, we aligned them to thereads from the two largest published gut metagenome studies(1.83 Gb of Roche/454 sequencing reads from 18 US adults8, and0.79 Gb of Sanger reads from 13 Japanese adults and infants17), usingthe 90% identity threshold. A total of 70.1% and 85.9% of the readsfrom the Japanese and US samples, respectively, could be aligned to

our contigs (Fig. 1), showing that the contigs include a high fractionof sequences from previous studies. In contrast, 85.7% and 69.5% ofour contigs were not covered by the reads from the Japanese and USsamples, respectively, highlighting the novelty we captured.

Only 31.0–48.8% of the reads from the two previous studies andthe present study could be aligned to 194 public human gut bacterialgenomes (Supplementary Table 5), and 7.6–21.2% to the bacterialgenomes deposited in GenBank (Fig. 1). This indicates that thereference gene set obtained by sequencing genomes of isolated bac-terial strains is still of a limited scale.

A gene catalogue of the human gut microbiome

To establish a non-redundant human gut microbiome gene set wefirst used the MetaGene20 program to predict ORFs in our contigsand found 14,048,045 ORFs longer than 100 bp (SupplementaryTable 6). They occupied 86.7% of the contigs, comparable to thevalue found for fully sequenced genomes (,86%). Two-thirds ofthe ORFs appeared incomplete, possibly due to the size of our contigs(N50 of 2.2 kb). We next removed the redundant ORFs, by pair-wisecomparison, using a very stringent criterion of 95% identity over90% of the shorter ORF length, which can fuse orthologues butavoids inflation of the data set due to possible sequencing errors(see Methods). Yet, the final non-redundant gene set contained asmany as 3,299,822 ORFs with an average length of 704 bp (Sup-plementary Table 7).

We term the genes of the non-redundant set ‘prevalent genes’, asthey are encoded on contigs assembled from the most abundant reads(see Methods). The minimal relative abundance of the prevalentgenes was ,6 3 1027, as estimated from the minimum sequencecoverage of the unique genes (close to 3), and the total Illuminasequence length generated for each individual (on average, 4.5 Gb),assuming the average gene length of 0.85 kb (that is, 3 3 0.85 3 103/4.5 3 109).

We mapped the 3.3 million gut ORFs to the 319,812 genes (targetgenes) of the 89 frequent reference microbial genomes in the humangut. At a 90% identity threshold, 80% of the target genes had at least80% of their length covered by a single gut ORF (Fig. 2b). Thisindicates that the gene set includes most of the known human gutbacterial genes.

We examined the number of prevalent genes identified across allindividuals as a function of the extent of sequencing, demanding atleast two supporting reads for a gene call (Fig. 2a). The incidence-based coverage richness estimator (ICE), determined at 100 individuals(the highest number the EstimateS21 program could accommodate),indicates that our catalogue captures 85.3% of the prevalent genes.Although this is probably an underestimate, it nevertheless indicatesthat the catalogue contains an overwhelming majority of the prevalentgenes of the cohort.

Each individual carried 536,112 6 12,167 (mean 6 s.e.m.) prevalentgenes (Supplementary Fig. 6b), indicating that most of the 3.3 milliongene pool must be shared. However, most of the prevalent genes werefound in only a few individuals: 2,375,655 were present in less than20%, whereas 294,110 were found in at least 50% of individuals (weterm these ‘common’ genes). These values depend on the samplingdepth; sequencing of MH0006 and MH0012 revealed more of thecatalogue genes, present at a low abundance (Supplementary Fig. 7).Nevertheless, even at our routine sampling depth, each individualharboured 204,056 6 3,603 (mean 6 s.e.m.) common genes, indi-cating that about 38% of an individual’s total gene pool is shared.Interestingly, the IBD patients harboured, on average, 25% fewer genesthan the individuals not suffering from IBD (Supplementary Fig. 8),consistent with the observation that the former have lower bacterialdiversity than the latter22.

Common bacterial core

Deep metagenomic sequencing provides the opportunity to explorethe existence of a common set of microbial species (common core) in

100

50

0Assembledcontig set

Known humangut bacteria

GenBankbacteria

Cov

erag

e of

seq

uenc

ing

read

s (%

)

Figure 1 | Coverage of human gut microbiome. The three human microbialsequencing read sets—Illumina GA reads generated from 124 individuals inthis study (black; n 5 124), Roche/454 reads from 18 human twins and theirmothers (grey; n 5 18) and Sanger reads from 13 Japanese individuals(white; n 5 13)—were aligned to each of the reference sequence sets. Meanvalues 6 s.e.m. are plotted.

ARTICLES NATURE | Vol 464 | 4 March 2010

60Macmillan Publishers Limited. All rights reserved©2010

Page 27: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To compare the representation of the human gut microbiome in our contigs with that from previous work, we aligned them to the reads from the two largest published gut metagenome studies (1.83Gb of Roche/454 sequencing reads from 18 US adults8, and 0.79 Gb of Sanger reads from 13 Japanese adults and infants17), using the 90% identity threshold. A total of 70.1% and 85.9% of the reads from the Japanese and US samples, respectively, could be aligned to our contigs (Fig. 1), showing that the contigs include a high fraction of sequences from previous studies. In contrast, 85.7% and 69.5% of our contigs were not covered by the reads from the Japanese and US samples, respectively, highlighting the novelty we captured.

!

Page 28: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To compare the representation of the human gut microbiome in our contigs with that from previous work, we aligned them to the reads from the two largest published gut metagenome studies (1.83Gb of Roche/454 sequencing reads from 18 US adults8, and 0.79 Gb of Sanger reads from 13 Japanese adults and infants17), using the 90% identity threshold. A total of 70.1% and 85.9% of the reads from the Japanese and US samples, respectively, could be aligned to our contigs (Fig. 1), showing that the contigs include a high fraction of sequences from previous studies. In contrast, 85.7% and 69.5% of our contigs were not covered by the reads from the Japanese and US samples, respectively, highlighting the novelty we captured. !

Page 29: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

novelty (see Methods and Supplementary Table 2). In total, weobtained 576.7 Gb of sequence (Supplementary Table 3).

Wanting to generate an extensive catalogue of microbial genes fromthe human gut, we first assembled the short Illumina reads into longercontigs, which could then be analysed and annotated by standardmethods. Using SOAPdenovo19, a de Bruijn graph-based tool speciallydesigned for assembling very short reads, we performed de novoassembly for all of the Illumina GA sequence data. Because a highdiversity between individuals is expected8,16,17, we first assembled eachsample independently (Supplementary Fig. 3). As much as 42.7% ofthe Illumina GA reads was assembled into a total of 6.58 millioncontigs of a length .500 bp, giving a total contig length of 10.3 Gb,with an N50 length of 2.2 kb (Supplementary Fig. 4) and the range of12.3 to 237.6 Mb (Supplementary Table 4). Almost 35% of reads fromany one sample could be mapped to contigs from other samples,indicating the existence of a common sequence core.

To assess the quality of the Illumina GA-based assembly we mappedthe contigs of samples MH0006 and MH0012 to the Sanger reads fromthe same samples (Supplementary Table 2). A total of 98.7% of thecontigs that map to at least one Sanger read were collinear over 99.6%of the mapped regions. This is comparable to the contigs that weregenerated by 454 sequencing for one of the two samples (MH0006) asa control, of which 97.9% were collinear over 99.5% of the mappedregions. We estimate assembly errors to be 14.2 and 20.7 per megabase(Mb) of Illumina- and 454-based contigs, respectively (see Methodsand Supplementary Fig. 5), indicating that the short- and long-read-based assemblies have comparable accuracies.

To complete the contig set we pooled the unassembled reads fromall 124 samples, and repeated the de novo assembly process. About 0.4million additional contigs were thus generated, having a length of370 Mb and an N50 length of 939 bp. The total length of our finalcontig set was thus 10.7 Gb. Some 80% of the 576.7 Gb of IlluminaGA sequence could be aligned to the contigs at a threshold of 90%identity, allowing for accommodation of sequencing errors andstrain variability in the gut (Fig. 1), almost twice the 42.7% ofsequence that was assembled into contigs by SOAPdenovo, becauseassembly uses more stringent criteria. This indicates that a vastmajority of the Illumina sequence is represented by our contigs.

To compare the representation of the human gut microbiome inour contigs with that from previous work, we aligned them to thereads from the two largest published gut metagenome studies(1.83 Gb of Roche/454 sequencing reads from 18 US adults8, and0.79 Gb of Sanger reads from 13 Japanese adults and infants17), usingthe 90% identity threshold. A total of 70.1% and 85.9% of the readsfrom the Japanese and US samples, respectively, could be aligned to

our contigs (Fig. 1), showing that the contigs include a high fractionof sequences from previous studies. In contrast, 85.7% and 69.5% ofour contigs were not covered by the reads from the Japanese and USsamples, respectively, highlighting the novelty we captured.

Only 31.0–48.8% of the reads from the two previous studies andthe present study could be aligned to 194 public human gut bacterialgenomes (Supplementary Table 5), and 7.6–21.2% to the bacterialgenomes deposited in GenBank (Fig. 1). This indicates that thereference gene set obtained by sequencing genomes of isolated bac-terial strains is still of a limited scale.

A gene catalogue of the human gut microbiome

To establish a non-redundant human gut microbiome gene set wefirst used the MetaGene20 program to predict ORFs in our contigsand found 14,048,045 ORFs longer than 100 bp (SupplementaryTable 6). They occupied 86.7% of the contigs, comparable to thevalue found for fully sequenced genomes (,86%). Two-thirds ofthe ORFs appeared incomplete, possibly due to the size of our contigs(N50 of 2.2 kb). We next removed the redundant ORFs, by pair-wisecomparison, using a very stringent criterion of 95% identity over90% of the shorter ORF length, which can fuse orthologues butavoids inflation of the data set due to possible sequencing errors(see Methods). Yet, the final non-redundant gene set contained asmany as 3,299,822 ORFs with an average length of 704 bp (Sup-plementary Table 7).

We term the genes of the non-redundant set ‘prevalent genes’, asthey are encoded on contigs assembled from the most abundant reads(see Methods). The minimal relative abundance of the prevalentgenes was ,6 3 1027, as estimated from the minimum sequencecoverage of the unique genes (close to 3), and the total Illuminasequence length generated for each individual (on average, 4.5 Gb),assuming the average gene length of 0.85 kb (that is, 3 3 0.85 3 103/4.5 3 109).

We mapped the 3.3 million gut ORFs to the 319,812 genes (targetgenes) of the 89 frequent reference microbial genomes in the humangut. At a 90% identity threshold, 80% of the target genes had at least80% of their length covered by a single gut ORF (Fig. 2b). Thisindicates that the gene set includes most of the known human gutbacterial genes.

We examined the number of prevalent genes identified across allindividuals as a function of the extent of sequencing, demanding atleast two supporting reads for a gene call (Fig. 2a). The incidence-based coverage richness estimator (ICE), determined at 100 individuals(the highest number the EstimateS21 program could accommodate),indicates that our catalogue captures 85.3% of the prevalent genes.Although this is probably an underestimate, it nevertheless indicatesthat the catalogue contains an overwhelming majority of the prevalentgenes of the cohort.

Each individual carried 536,112 6 12,167 (mean 6 s.e.m.) prevalentgenes (Supplementary Fig. 6b), indicating that most of the 3.3 milliongene pool must be shared. However, most of the prevalent genes werefound in only a few individuals: 2,375,655 were present in less than20%, whereas 294,110 were found in at least 50% of individuals (weterm these ‘common’ genes). These values depend on the samplingdepth; sequencing of MH0006 and MH0012 revealed more of thecatalogue genes, present at a low abundance (Supplementary Fig. 7).Nevertheless, even at our routine sampling depth, each individualharboured 204,056 6 3,603 (mean 6 s.e.m.) common genes, indi-cating that about 38% of an individual’s total gene pool is shared.Interestingly, the IBD patients harboured, on average, 25% fewer genesthan the individuals not suffering from IBD (Supplementary Fig. 8),consistent with the observation that the former have lower bacterialdiversity than the latter22.

Common bacterial core

Deep metagenomic sequencing provides the opportunity to explorethe existence of a common set of microbial species (common core) in

100

50

0Assembledcontig set

Known humangut bacteria

GenBankbacteria

Cov

erag

e of

seq

uenc

ing

read

s (%

)

Figure 1 | Coverage of human gut microbiome. The three human microbialsequencing read sets—Illumina GA reads generated from 124 individuals inthis study (black; n 5 124), Roche/454 reads from 18 human twins and theirmothers (grey; n 5 18) and Sanger reads from 13 Japanese individuals(white; n 5 13)—were aligned to each of the reference sequence sets. Meanvalues 6 s.e.m. are plotted.

ARTICLES NATURE | Vol 464 | 4 March 2010

60Macmillan Publishers Limited. All rights reserved©2010

Page 30: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Only 31.0–48.8% of the reads from the two previous studies and the present study could be aligned to 194 public human gut bacterial genomes (Supplementary Table 5), and 7.6–21.2% to the bacterial genomes deposited in GenBank (Fig. 1). This indicates that the reference gene set obtained by sequencing genomes of isolated bac- terial strains is still of a limited scale.

Page 31: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Only 31.0–48.8% of the reads from the two previous studies and the present study could be aligned to 194 public human gut bacterial genomes (Supplementary Table 5), and 7.6–21.2% to the bacterial genomes deposited in GenBank (Fig. 1). This indicates that the reference gene set obtained by sequencing genomes of isolated bacterial strains is still of a limited scale.

Page 32: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

novelty (see Methods and Supplementary Table 2). In total, weobtained 576.7 Gb of sequence (Supplementary Table 3).

Wanting to generate an extensive catalogue of microbial genes fromthe human gut, we first assembled the short Illumina reads into longercontigs, which could then be analysed and annotated by standardmethods. Using SOAPdenovo19, a de Bruijn graph-based tool speciallydesigned for assembling very short reads, we performed de novoassembly for all of the Illumina GA sequence data. Because a highdiversity between individuals is expected8,16,17, we first assembled eachsample independently (Supplementary Fig. 3). As much as 42.7% ofthe Illumina GA reads was assembled into a total of 6.58 millioncontigs of a length .500 bp, giving a total contig length of 10.3 Gb,with an N50 length of 2.2 kb (Supplementary Fig. 4) and the range of12.3 to 237.6 Mb (Supplementary Table 4). Almost 35% of reads fromany one sample could be mapped to contigs from other samples,indicating the existence of a common sequence core.

To assess the quality of the Illumina GA-based assembly we mappedthe contigs of samples MH0006 and MH0012 to the Sanger reads fromthe same samples (Supplementary Table 2). A total of 98.7% of thecontigs that map to at least one Sanger read were collinear over 99.6%of the mapped regions. This is comparable to the contigs that weregenerated by 454 sequencing for one of the two samples (MH0006) asa control, of which 97.9% were collinear over 99.5% of the mappedregions. We estimate assembly errors to be 14.2 and 20.7 per megabase(Mb) of Illumina- and 454-based contigs, respectively (see Methodsand Supplementary Fig. 5), indicating that the short- and long-read-based assemblies have comparable accuracies.

To complete the contig set we pooled the unassembled reads fromall 124 samples, and repeated the de novo assembly process. About 0.4million additional contigs were thus generated, having a length of370 Mb and an N50 length of 939 bp. The total length of our finalcontig set was thus 10.7 Gb. Some 80% of the 576.7 Gb of IlluminaGA sequence could be aligned to the contigs at a threshold of 90%identity, allowing for accommodation of sequencing errors andstrain variability in the gut (Fig. 1), almost twice the 42.7% ofsequence that was assembled into contigs by SOAPdenovo, becauseassembly uses more stringent criteria. This indicates that a vastmajority of the Illumina sequence is represented by our contigs.

To compare the representation of the human gut microbiome inour contigs with that from previous work, we aligned them to thereads from the two largest published gut metagenome studies(1.83 Gb of Roche/454 sequencing reads from 18 US adults8, and0.79 Gb of Sanger reads from 13 Japanese adults and infants17), usingthe 90% identity threshold. A total of 70.1% and 85.9% of the readsfrom the Japanese and US samples, respectively, could be aligned to

our contigs (Fig. 1), showing that the contigs include a high fractionof sequences from previous studies. In contrast, 85.7% and 69.5% ofour contigs were not covered by the reads from the Japanese and USsamples, respectively, highlighting the novelty we captured.

Only 31.0–48.8% of the reads from the two previous studies andthe present study could be aligned to 194 public human gut bacterialgenomes (Supplementary Table 5), and 7.6–21.2% to the bacterialgenomes deposited in GenBank (Fig. 1). This indicates that thereference gene set obtained by sequencing genomes of isolated bac-terial strains is still of a limited scale.

A gene catalogue of the human gut microbiome

To establish a non-redundant human gut microbiome gene set wefirst used the MetaGene20 program to predict ORFs in our contigsand found 14,048,045 ORFs longer than 100 bp (SupplementaryTable 6). They occupied 86.7% of the contigs, comparable to thevalue found for fully sequenced genomes (,86%). Two-thirds ofthe ORFs appeared incomplete, possibly due to the size of our contigs(N50 of 2.2 kb). We next removed the redundant ORFs, by pair-wisecomparison, using a very stringent criterion of 95% identity over90% of the shorter ORF length, which can fuse orthologues butavoids inflation of the data set due to possible sequencing errors(see Methods). Yet, the final non-redundant gene set contained asmany as 3,299,822 ORFs with an average length of 704 bp (Sup-plementary Table 7).

We term the genes of the non-redundant set ‘prevalent genes’, asthey are encoded on contigs assembled from the most abundant reads(see Methods). The minimal relative abundance of the prevalentgenes was ,6 3 1027, as estimated from the minimum sequencecoverage of the unique genes (close to 3), and the total Illuminasequence length generated for each individual (on average, 4.5 Gb),assuming the average gene length of 0.85 kb (that is, 3 3 0.85 3 103/4.5 3 109).

We mapped the 3.3 million gut ORFs to the 319,812 genes (targetgenes) of the 89 frequent reference microbial genomes in the humangut. At a 90% identity threshold, 80% of the target genes had at least80% of their length covered by a single gut ORF (Fig. 2b). Thisindicates that the gene set includes most of the known human gutbacterial genes.

We examined the number of prevalent genes identified across allindividuals as a function of the extent of sequencing, demanding atleast two supporting reads for a gene call (Fig. 2a). The incidence-based coverage richness estimator (ICE), determined at 100 individuals(the highest number the EstimateS21 program could accommodate),indicates that our catalogue captures 85.3% of the prevalent genes.Although this is probably an underestimate, it nevertheless indicatesthat the catalogue contains an overwhelming majority of the prevalentgenes of the cohort.

Each individual carried 536,112 6 12,167 (mean 6 s.e.m.) prevalentgenes (Supplementary Fig. 6b), indicating that most of the 3.3 milliongene pool must be shared. However, most of the prevalent genes werefound in only a few individuals: 2,375,655 were present in less than20%, whereas 294,110 were found in at least 50% of individuals (weterm these ‘common’ genes). These values depend on the samplingdepth; sequencing of MH0006 and MH0012 revealed more of thecatalogue genes, present at a low abundance (Supplementary Fig. 7).Nevertheless, even at our routine sampling depth, each individualharboured 204,056 6 3,603 (mean 6 s.e.m.) common genes, indi-cating that about 38% of an individual’s total gene pool is shared.Interestingly, the IBD patients harboured, on average, 25% fewer genesthan the individuals not suffering from IBD (Supplementary Fig. 8),consistent with the observation that the former have lower bacterialdiversity than the latter22.

Common bacterial core

Deep metagenomic sequencing provides the opportunity to explorethe existence of a common set of microbial species (common core) in

100

50

0Assembledcontig set

Known humangut bacteria

GenBankbacteria

Cov

erag

e of

seq

uenc

ing

read

s (%

)

Figure 1 | Coverage of human gut microbiome. The three human microbialsequencing read sets—Illumina GA reads generated from 124 individuals inthis study (black; n 5 124), Roche/454 reads from 18 human twins and theirmothers (grey; n 5 18) and Sanger reads from 13 Japanese individuals(white; n 5 13)—were aligned to each of the reference sequence sets. Meanvalues 6 s.e.m. are plotted.

ARTICLES NATURE | Vol 464 | 4 March 2010

60Macmillan Publishers Limited. All rights reserved©2010

Page 33: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

A gene catalogue of the human gut microbiome To establish a non-redundant human gut microbiome gene set we first used the MetaGene20 program to predict ORFs in our contigs and found 14,048,045 ORFs longer than 100bp (Supplementary Table 6). They occupied 86.7% of the contigs, comparable to the value found for fully sequenced genomes (,86%). Two-thirds of the ORFs appeared incomplete, possibly due to the size of our contigs (N50 of 2.2 kb). We next removed the redundant ORFs, by pair-wise comparison, using a very stringent criterion of 95% identity over 90% of the shorter ORF length, which can fuse orthologues but avoids inflation of the data set due to possible sequencing errors (see Methods). Yet, the final non-redundant gene set contained as many as 3,299,822 ORFs with an average length of 704 bp (Supplementary Table 7).

Page 34: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

A gene catalogue of the human gut microbiome To establish a non-redundant human gut microbiome gene set we first used the MetaGene20 program to predict ORFs in our contigs and found 14,048,045 ORFs longer than 100bp (Supplementary Table 6). They occupied 86.7% of the contigs, comparable to the value found for fully sequenced genomes (,86%). Two-thirds of the ORFs appeared incomplete, possibly due to the size of our contigs (N50 of 2.2 kb). We next removed the redundant ORFs, by pair-wise comparison, using a very stringent criterion of 95% identity over 90% of the shorter ORF length, which can fuse orthologues but avoids inflation of the data set due to possible sequencing errors (see Methods). Yet, the final non-redundant gene set contained as many as 3,299,822 ORFs with an average length of 704 bp (Supplementary Table 7).

Page 35: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Table 7 | Non-redundant genes. Genes were compared at 95 % identity cut-off. Those that were overlapped over 90% length were considered redundant and removed. Common and rare genes were present in >50% and < 20% of individuals, respectively. � # of genes Total length (bp) Mean length (bp) Non-redundant gene set 3,299,822 2,323,171,095 704.03 Common 294,110 292,960,308 996.09 Rare 2,375,655 1,510,527,924 635.84

www.nature.com / nature 31

Page 36: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

We term the genes of the non-redundant set ‘prevalent genes’, as they are encoded on contigs assembled from the most abundant reads (see Methods). The minimal relative abundance of the prevalent genes was ,631027, as estimated from the minimum sequence coverage of the unique genes (close to 3), and the total Illumina sequence length generated for each individual (on average, 4.5 Gb), assuming the average gene length of 0.85 kb (that is, 3 3 0.85 3 103/ 4.5 3 109).

Page 37: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

We term the genes of the non-redundant set ‘prevalent genes’, as they are encoded on contigs assembled from the most abundant reads (see Methods). The minimal relative abundance of the prevalent genes was ,63x-7, as estimated from the minimum sequence coverage of the unique genes (close to 3), and the total Illumina sequence length generated for each individual (on average, 4.5 Gb), assuming the average gene length of 0.85 kb (that is, 3 3 0.85 3 103/ 4.5 3 109).

Page 38: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

We mapped the 3.3 million gut ORFs to the 319,812 genes (target genes) of the 89 frequent reference microbial genomes in the human gut. At a 90% identity threshold, 80% of the target genes had at least 80% of their length covered by a single gut ORF (Fig. 2b). This indicates that the gene set includes most of the known human gut bacterial genes.

Page 39: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

We mapped the 3.3 million gut ORFs to the 319,812 genes (target genes) of the 89 frequent reference microbial genomes in the human gut. At a 90% identity threshold, 80% of the target genes had at least 80% of their length covered by a single gut ORF (Fig. 2b). This indicates that the gene set includes most of the known human gut bacterial genes.

Page 40: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

We examined the number of prevalent genes identified across all individuals as a function of the extent of sequencing, demanding at least two supporting reads for a gene call (Fig. 2a). The incidence- based coverage richness estimator (ICE), determined at 100 individuals (the highest number the EstimateS21 program could accommodate), indicates that our catalogue captures 85.3% of the prevalent genes. Although this is probably an underestimate, it nevertheless indicates that the catalogue contains an overwhelming majority of the prevalent genes of the cohort.

Page 41: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

We examined the number of prevalent genes identified across all individuals as a function of the extent of sequencing, demanding at least two supporting reads for a gene call (Fig. 2a). The incidence- based coverage richness estimator (ICE), determined at 100 individuals (the highest number the EstimateS21 program could accommodate), indicates that our catalogue captures 85.3% of the prevalent genes. Although this is probably an underestimate, it nevertheless indicates that the catalogue contains an overwhelming majority of the prevalent genes of the cohort.

Page 42: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014the cohort. For this purpose, we used a non-redundant set of 650sequenced bacterial and archaeal genomes (see Methods). We alignedthe Illumina GA reads of each human gut microbial sample onto thegenome set, using a 90% identity threshold, and determined theproportion of the genomes covered by the reads that aligned ontoonly a single position in the set. At a 1% coverage, which for a typicalgut bacterial genome corresponds to an average length of about40 kb, some 25-fold more than that of the 16S gene generally usedfor species identification, we detected 18 species in all individuals, 57in $90% and 75 in $50% of individuals (Supplementary Table 8). At10% coverage, requiring ,10-fold higher abundance in a sample, westill found 13 of the above species in $90% of individuals and 35in $50%.

When the cumulated sequence length increased from 3.96 Gb to8.74 Gb and from 4.41 Gb to 11.6 Gb, for samples MH0006 andMH0012, respectively, the number of strains common to the twoat the 1% coverage threshold increased by 25%, from 135 to 169.This indicates the existence of a significantly larger common corethan the one we could observe at the sequence depth routinely usedfor each individual.

The variability of abundance of microbial species in individualscan greatly affect identification of the common core. To visualizethis variability, we compared the number of sequencing reads alignedto different genomes across the individuals of our cohort. Even forthe most common 57 species present in $90% of individuals withgenome coverage .1% (Supplementary Table 8), the inter-individualvariability was between 12- and 2,187-fold (Fig. 3). As expected10,23,Bacteroidetes and Firmicutes had the highest abundance.

A complex pattern of species relatedness, characterized by clustersat the genus and family levels, emerges from the analysis of the net-work based on the pair-wise Pearson correlation coefficients of 155species present in at least one individual at $1% coverage(Supplementary Fig. 9). Prominent clusters include some of the mostabundant gut species, such as members of the Bacteroidetes andDorea/Eubacterium/Ruminococcus groups and also bifidobacteria,Proteobacteria and streptococci/lactobacilli groups. These observa-tions indicate that similar constellations of bacteria may be present indifferent individuals of our cohort, for reasons that remain to beestablished.

The above result indicates that the Illumina-based bacterial pro-filing should reveal differences between the healthy individuals andpatients. To test this hypothesis we compared the IBD patients andhealthy controls (Supplementary Table 1), as it was previouslyreported that the two have different microbiota22. The principal com-ponent analysis, based on the same 155 species, clearly separatespatients from healthy individuals and the ulcerative colitis fromthe Crohn’s disease patients (Fig. 4), confirming our hypothesis.

Functions encoded by the prevalent gene set

We classified the predicted genes by aligning them to the integratedNCBI-NR database of non-redundant protein sequences, the genes inthe KEGG (Kyoto Encyclopedia of Genes and Genomes)24 pathways,and COG (Clusters of Orthologous Groups)25 and eggNOG26 data-bases. There were 77.1% genes classified into phylotypes, 57.5% toeggNOG clusters, 47.0% to KEGG orthology and 18.7% genesassigned to KEGG pathways, respectively (Supplementary Table 9).

Relative abundance (log10)

Blautia hanseniiClostridium scindensEnterococcus faecalis TX0104Clostridium asparagiformeBacteroides fragilis 3_1_12Bacteroides intestinalisRuminococcus gnavusAnaerotruncus colihominisBacteroides pectinophilusClostridium nexileClostridium sp. L2−50Parabacteroides johnsoniiBacteroides finegoldiiButyrivibrio crossotusBacteroides eggerthiiClostridium sp. M62 1Coprococcus eutactusBacteroides stercorisHoldemania filiformisClostridium leptumStreptococcus thermophilus LMD−9Bacteroides capillosusSubdoligranulum variabileRuminococcus obeum A2−162Bacteroides doreiEubacterium ventriosumBacteroides sp. D4Bacteroides sp. D1Coprococcus comes SL7 1Bacteriodes xylanisolvens XB1AEubacterium rectale M104 1Bacteroides sp. 2_2_4Bacteroides sp. 4_3_47FAABacteroides ovatusBacteroides sp. 9_1_42FAAParabacteroides distasonis ATCC 8503Eubacterium siraeum 70 3Bacteroides sp. 2_1_7Roseburia intestinalis M50 1Bacteroides vulgatus ATCC 8482Dorea formicigeneransCollinsella aerofaciensRuminococcus lactarisFaecalibacterium prausnitzii SL3 3Ruminococcus sp. SR1 5Unknown sp. SS3 4Ruminococcus torques L2−14Eubacterium halliiBacteroides thetaiotaomicron VPI−5482Clostridium sp. SS2−1Bacteroides caccaeRuminococcus bromii L2−63Dorea longicatenaParabacteroides merdaeAlistipes putredinisBacteroides uniformis

–4 –3 –2 –1

Figure 3 | Relative abundance of 57 frequent microbial genomes amongindividuals of the cohort. See Fig. 2c for definition of box and whisker plot.See Methods for computation.

1Number of individuals sampled

Num

ber o

f ort

holo

gous

gro

ups/

gene

fam

ilies

(×10

3 )

25 50 75 100 124

a b

c

320,000

280, 000

240,000

200,000

160,0001.0 0.8 0.6 0.4 0.2 0

0.6

0.7

0.8

0.9

1.0

0.5

85%90%95%

0

5

10

15

20

0

1

2

3

4

1 20 40 60 80 100

OGs + novel gene families

Known + unknown OGs

Known OGs

Num

ber o

f non

-red

unda

ntge

nes

(×10

6 )

Num

ber of target genes covered

Frac

tion

of ta

rget

gen

esco

vere

d

Number of samples Fraction of gene length covered

Figure 2 | Predicted ORFs in the human gut microbiome. a, Number ofunique genes as a function of the extent of sequencing. The gene accumulationcurve corresponds to the Sobs (Mao Tau) values (number of observed genes),calculated using EstimateS21 (version 8.2.0) on randomly chosen 100 samples(due to memory limitation). b, Coverage of genes from 89 frequent gutmicrobial species (Supplementary Table 12). c, Number of functions capturedby number of samples investigated, based on known (well characterized)orthologous groups (OGs; bottom), known plus unknown orthologousgroups (including, for example, putative, predicted, conserved hypotheticalfunctions; middle) and orthologous groups plus novel gene families (.20proteins) recovered from the metagenome (top). Boxes denote theinterquartile range (IQR) between the first and third quartiles (25th and 75thpercentiles, respectively) and the line inside denotes the median. Whiskersdenote the lowest and highest values within 1.5 times IQR from the first andthird quartiles, respectively. Circles denote outliers beyond the whiskers.

NATURE | Vol 464 | 4 March 2010 ARTICLES

61Macmillan Publishers Limited. All rights reserved©2010

Page 43: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

the cohort. For this purpose, we used a non-redundant set of 650sequenced bacterial and archaeal genomes (see Methods). We alignedthe Illumina GA reads of each human gut microbial sample onto thegenome set, using a 90% identity threshold, and determined theproportion of the genomes covered by the reads that aligned ontoonly a single position in the set. At a 1% coverage, which for a typicalgut bacterial genome corresponds to an average length of about40 kb, some 25-fold more than that of the 16S gene generally usedfor species identification, we detected 18 species in all individuals, 57in $90% and 75 in $50% of individuals (Supplementary Table 8). At10% coverage, requiring ,10-fold higher abundance in a sample, westill found 13 of the above species in $90% of individuals and 35in $50%.

When the cumulated sequence length increased from 3.96 Gb to8.74 Gb and from 4.41 Gb to 11.6 Gb, for samples MH0006 andMH0012, respectively, the number of strains common to the twoat the 1% coverage threshold increased by 25%, from 135 to 169.This indicates the existence of a significantly larger common corethan the one we could observe at the sequence depth routinely usedfor each individual.

The variability of abundance of microbial species in individualscan greatly affect identification of the common core. To visualizethis variability, we compared the number of sequencing reads alignedto different genomes across the individuals of our cohort. Even forthe most common 57 species present in $90% of individuals withgenome coverage .1% (Supplementary Table 8), the inter-individualvariability was between 12- and 2,187-fold (Fig. 3). As expected10,23,Bacteroidetes and Firmicutes had the highest abundance.

A complex pattern of species relatedness, characterized by clustersat the genus and family levels, emerges from the analysis of the net-work based on the pair-wise Pearson correlation coefficients of 155species present in at least one individual at $1% coverage(Supplementary Fig. 9). Prominent clusters include some of the mostabundant gut species, such as members of the Bacteroidetes andDorea/Eubacterium/Ruminococcus groups and also bifidobacteria,Proteobacteria and streptococci/lactobacilli groups. These observa-tions indicate that similar constellations of bacteria may be present indifferent individuals of our cohort, for reasons that remain to beestablished.

The above result indicates that the Illumina-based bacterial pro-filing should reveal differences between the healthy individuals andpatients. To test this hypothesis we compared the IBD patients andhealthy controls (Supplementary Table 1), as it was previouslyreported that the two have different microbiota22. The principal com-ponent analysis, based on the same 155 species, clearly separatespatients from healthy individuals and the ulcerative colitis fromthe Crohn’s disease patients (Fig. 4), confirming our hypothesis.

Functions encoded by the prevalent gene set

We classified the predicted genes by aligning them to the integratedNCBI-NR database of non-redundant protein sequences, the genes inthe KEGG (Kyoto Encyclopedia of Genes and Genomes)24 pathways,and COG (Clusters of Orthologous Groups)25 and eggNOG26 data-bases. There were 77.1% genes classified into phylotypes, 57.5% toeggNOG clusters, 47.0% to KEGG orthology and 18.7% genesassigned to KEGG pathways, respectively (Supplementary Table 9).

Relative abundance (log10)

Blautia hanseniiClostridium scindensEnterococcus faecalis TX0104Clostridium asparagiformeBacteroides fragilis 3_1_12Bacteroides intestinalisRuminococcus gnavusAnaerotruncus colihominisBacteroides pectinophilusClostridium nexileClostridium sp. L2−50Parabacteroides johnsoniiBacteroides finegoldiiButyrivibrio crossotusBacteroides eggerthiiClostridium sp. M62 1Coprococcus eutactusBacteroides stercorisHoldemania filiformisClostridium leptumStreptococcus thermophilus LMD−9Bacteroides capillosusSubdoligranulum variabileRuminococcus obeum A2−162Bacteroides doreiEubacterium ventriosumBacteroides sp. D4Bacteroides sp. D1Coprococcus comes SL7 1Bacteriodes xylanisolvens XB1AEubacterium rectale M104 1Bacteroides sp. 2_2_4Bacteroides sp. 4_3_47FAABacteroides ovatusBacteroides sp. 9_1_42FAAParabacteroides distasonis ATCC 8503Eubacterium siraeum 70 3Bacteroides sp. 2_1_7Roseburia intestinalis M50 1Bacteroides vulgatus ATCC 8482Dorea formicigeneransCollinsella aerofaciensRuminococcus lactarisFaecalibacterium prausnitzii SL3 3Ruminococcus sp. SR1 5Unknown sp. SS3 4Ruminococcus torques L2−14Eubacterium halliiBacteroides thetaiotaomicron VPI−5482Clostridium sp. SS2−1Bacteroides caccaeRuminococcus bromii L2−63Dorea longicatenaParabacteroides merdaeAlistipes putredinisBacteroides uniformis

–4 –3 –2 –1

Figure 3 | Relative abundance of 57 frequent microbial genomes amongindividuals of the cohort. See Fig. 2c for definition of box and whisker plot.See Methods for computation.

1Number of individuals sampled

Num

ber o

f ort

holo

gous

gro

ups/

gene

fam

ilies

(×10

3 )

25 50 75 100 124

a b

c

320,000

280, 000

240,000

200,000

160,0001.0 0.8 0.6 0.4 0.2 0

0.6

0.7

0.8

0.9

1.0

0.5

85%90%95%

0

5

10

15

20

0

1

2

3

4

1 20 40 60 80 100

OGs + novel gene families

Known + unknown OGs

Known OGs

Num

ber o

f non

-red

unda

ntge

nes

(×10

6 )

Num

ber of target genes covered

Frac

tion

of ta

rget

gen

esco

vere

d

Number of samples Fraction of gene length covered

Figure 2 | Predicted ORFs in the human gut microbiome. a, Number ofunique genes as a function of the extent of sequencing. The gene accumulationcurve corresponds to the Sobs (Mao Tau) values (number of observed genes),calculated using EstimateS21 (version 8.2.0) on randomly chosen 100 samples(due to memory limitation). b, Coverage of genes from 89 frequent gutmicrobial species (Supplementary Table 12). c, Number of functions capturedby number of samples investigated, based on known (well characterized)orthologous groups (OGs; bottom), known plus unknown orthologousgroups (including, for example, putative, predicted, conserved hypotheticalfunctions; middle) and orthologous groups plus novel gene families (.20proteins) recovered from the metagenome (top). Boxes denote theinterquartile range (IQR) between the first and third quartiles (25th and 75thpercentiles, respectively) and the line inside denotes the median. Whiskersdenote the lowest and highest values within 1.5 times IQR from the first andthird quartiles, respectively. Circles denote outliers beyond the whiskers.

NATURE | Vol 464 | 4 March 2010 ARTICLES

61Macmillan Publishers Limited. All rights reserved©2010

Page 44: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Each individual carried 536,112 ±12,167 (mean 6 s.e.m.) prevalent genes (Supplementary Fig. 6b), indicating that most of the 3.3 million gene pool must be shared. However, most of the prevalent genes were found in only a few individuals: 2,375,655 were present in less than 20%, whereas 294,110 were found in at least 50% of individuals (we term these ‘common’ genes). These values depend on the sampling depth; sequencing of MH0006 and MH0012 revealed more of the catalogue genes, present at a low abundance (Supplementary Fig. 7). Nevertheless, even at our routine sampling depth, each individual harboured 204,05663,603 (mean ± s.e.m.) common genes, indi- cating that about 38% of an individual’s total gene pool is shared. Interestingly, the IBD patients harboured, on average, 25% fewer genes than the individuals not suffering from IBD (Supplementary Fig. 8), consistent with the observation that the former have lower bacterial diversity than the latter22.

Page 45: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Each individual carried 536,112 ±12,167 (mean 6 s.e.m.) prevalent genes (Supplementary Fig. 6b), indicating that most of the 3.3 million gene pool must be shared. However, most of the prevalent genes were found in only a few individuals: 2,375,655 were present in less than 20%, whereas 294,110 were found in at least 50% of individuals (we term these ‘common’ genes). These values depend on the sampling depth; sequencing of MH0006 and MH0012 revealed more of the catalogue genes, present at a low abundance (Supplementary Fig. 7). Nevertheless, even at our routine sampling depth, each individual harboured 204,056 ± 3,603 (mean ± s.e.m.) common genes, indi- cating that about 38% of an individual’s total gene pool is shared. Interestingly, the IBD patients harboured, on average, 25% fewer genes than the individuals not suffering from IBD (Supplementary Fig. 8), consistent with the observation that the former have lower bacterial diversity than the latter22.

Page 46: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Figure 7 | Number of unique genes identified with increase of sequencing depth in sample MH0006 and MH0012.

www.nature.com / nature 7

Page 47: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Figure 8 | Distribution of nonredundant bacterial genes in IBD patients and healthy controls. The proportion of individuals having a given number of genes (classes of 100 thousand genes were used) is shown.The average gene number for IBD patients and individuals not suffering from IBD was425,397 + 126,685 (s.d.; n=25) and 564,070 + 121,962 (s.d.; n=99), respectively; p<10-6 (one-tailed Student t test).

www.nature.com / nature 8

Page 48: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Common bacterial core Deep metagenomic sequencing provides the opportunity to explore the existence of a common set of microbial species (common core) in the cohort. For this purpose, we used a non-redundant set of 650 sequenced bacterial and archaeal genomes (see Methods). We aligned the Illumina GA reads of each human gut microbial sample onto the genome set, using a 90% identity threshold, and determined the proportion of the genomes covered by the reads that aligned onto only a single position in the set. At a 1% coverage, which for a typical gut bacterial genome corresponds to an average length of about 40 kb, some 25-fold more than that of the 16S gene generally used for species identification, we detected 18 species in all individuals, 57 in ≥90% and 75 in ≥50% of individuals (Supplementary Table 8). At 10% coverage, requiring ,10-fold higher abundance in a sample, we still found 13 of the above species in ≥90% of individuals and 35 in ≥50%.

Page 49: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Common bacterial core Deep metagenomic sequencing provides the opportunity to explore the existence of a common set of microbial species (common core) in the cohort. For this purpose, we used a non-redundant set of 650 sequenced bacterial and archaeal genomes (see Methods). We aligned the Illumina GA reads of each human gut microbial sample onto the genome set, using a 90% identity threshold, and determined the proportion of the genomes covered by the reads that aligned onto only a single position in the set. At a 1% coverage, which for a typical gut bacterial genome corresponds to an average length of about 40 kb, some 25-fold more than that of the 16S gene generally used for species identification, we detected 18 species in all individuals, 57 in ≥90% and 75 in ≥50% of individuals (Supplementary Table 8). At 10% coverage, requiring ,10-fold higher abundance in a sample, we still found 13 of the above species in ≥90% of individuals and 35 in ≥50%.

Page 50: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

When the cumulated sequence length increased from 3.96 Gb to 8.74 Gb and from 4.41 Gb to 11.6 Gb, for samples MH0006 and MH0012, respectively, the number of strains common to the two at the 1% coverage threshold increased by 25%, from 135 to 169. This indicates the existence of a significantly larger common core than the one we could observe at the sequence depth routinely used for each individual.

Page 51: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

When the cumulated sequence length increased from 3.96 Gb to 8.74 Gb and from 4.41 Gb to 11.6 Gb, for samples MH0006 and MH0012, respectively, the number of strains common to the two at the 1% coverage threshold increased by 25%, from 135 to 169. This indicates the existence of a significantly larger common core than the one we could observe at the sequence depth routinely used for each individual.

Page 52: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

The variability of abundance of microbial species in individuals can greatly affect identification of the common core. To visualize this variability, we compared the number of sequencing reads aligned to different genomes across the individuals of our cohort. Even for the most common 57 species present in ≥90% of individuals with genome coverage .1% (Supplementary Table 8), the inter-individual variability was between 12- and 2,187-fold (Fig. 3). As expected10,23, Bacteroidetes and Firmicutes had the highest abundance.

Page 53: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

The variability of abundance of microbial species in individuals can greatly affect identification of the common core. To visualize this variability, we compared the number of sequencing reads aligned to different genomes across the individuals of our cohort. Even for the most common 57 species present in ≥90% of individuals with genome coverage .1% (Supplementary Table 8), the inter-individual variability was between 12- and 2,187-fold (Fig. 3). As expected10,23, Bacteroidetes and Firmicutes had the highest abundance.

Page 54: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014the cohort. For this purpose, we used a non-redundant set of 650sequenced bacterial and archaeal genomes (see Methods). We alignedthe Illumina GA reads of each human gut microbial sample onto thegenome set, using a 90% identity threshold, and determined theproportion of the genomes covered by the reads that aligned ontoonly a single position in the set. At a 1% coverage, which for a typicalgut bacterial genome corresponds to an average length of about40 kb, some 25-fold more than that of the 16S gene generally usedfor species identification, we detected 18 species in all individuals, 57in $90% and 75 in $50% of individuals (Supplementary Table 8). At10% coverage, requiring ,10-fold higher abundance in a sample, westill found 13 of the above species in $90% of individuals and 35in $50%.

When the cumulated sequence length increased from 3.96 Gb to8.74 Gb and from 4.41 Gb to 11.6 Gb, for samples MH0006 andMH0012, respectively, the number of strains common to the twoat the 1% coverage threshold increased by 25%, from 135 to 169.This indicates the existence of a significantly larger common corethan the one we could observe at the sequence depth routinely usedfor each individual.

The variability of abundance of microbial species in individualscan greatly affect identification of the common core. To visualizethis variability, we compared the number of sequencing reads alignedto different genomes across the individuals of our cohort. Even forthe most common 57 species present in $90% of individuals withgenome coverage .1% (Supplementary Table 8), the inter-individualvariability was between 12- and 2,187-fold (Fig. 3). As expected10,23,Bacteroidetes and Firmicutes had the highest abundance.

A complex pattern of species relatedness, characterized by clustersat the genus and family levels, emerges from the analysis of the net-work based on the pair-wise Pearson correlation coefficients of 155species present in at least one individual at $1% coverage(Supplementary Fig. 9). Prominent clusters include some of the mostabundant gut species, such as members of the Bacteroidetes andDorea/Eubacterium/Ruminococcus groups and also bifidobacteria,Proteobacteria and streptococci/lactobacilli groups. These observa-tions indicate that similar constellations of bacteria may be present indifferent individuals of our cohort, for reasons that remain to beestablished.

The above result indicates that the Illumina-based bacterial pro-filing should reveal differences between the healthy individuals andpatients. To test this hypothesis we compared the IBD patients andhealthy controls (Supplementary Table 1), as it was previouslyreported that the two have different microbiota22. The principal com-ponent analysis, based on the same 155 species, clearly separatespatients from healthy individuals and the ulcerative colitis fromthe Crohn’s disease patients (Fig. 4), confirming our hypothesis.

Functions encoded by the prevalent gene set

We classified the predicted genes by aligning them to the integratedNCBI-NR database of non-redundant protein sequences, the genes inthe KEGG (Kyoto Encyclopedia of Genes and Genomes)24 pathways,and COG (Clusters of Orthologous Groups)25 and eggNOG26 data-bases. There were 77.1% genes classified into phylotypes, 57.5% toeggNOG clusters, 47.0% to KEGG orthology and 18.7% genesassigned to KEGG pathways, respectively (Supplementary Table 9).

Relative abundance (log10)

Blautia hanseniiClostridium scindensEnterococcus faecalis TX0104Clostridium asparagiformeBacteroides fragilis 3_1_12Bacteroides intestinalisRuminococcus gnavusAnaerotruncus colihominisBacteroides pectinophilusClostridium nexileClostridium sp. L2−50Parabacteroides johnsoniiBacteroides finegoldiiButyrivibrio crossotusBacteroides eggerthiiClostridium sp. M62 1Coprococcus eutactusBacteroides stercorisHoldemania filiformisClostridium leptumStreptococcus thermophilus LMD−9Bacteroides capillosusSubdoligranulum variabileRuminococcus obeum A2−162Bacteroides doreiEubacterium ventriosumBacteroides sp. D4Bacteroides sp. D1Coprococcus comes SL7 1Bacteriodes xylanisolvens XB1AEubacterium rectale M104 1Bacteroides sp. 2_2_4Bacteroides sp. 4_3_47FAABacteroides ovatusBacteroides sp. 9_1_42FAAParabacteroides distasonis ATCC 8503Eubacterium siraeum 70 3Bacteroides sp. 2_1_7Roseburia intestinalis M50 1Bacteroides vulgatus ATCC 8482Dorea formicigeneransCollinsella aerofaciensRuminococcus lactarisFaecalibacterium prausnitzii SL3 3Ruminococcus sp. SR1 5Unknown sp. SS3 4Ruminococcus torques L2−14Eubacterium halliiBacteroides thetaiotaomicron VPI−5482Clostridium sp. SS2−1Bacteroides caccaeRuminococcus bromii L2−63Dorea longicatenaParabacteroides merdaeAlistipes putredinisBacteroides uniformis

–4 –3 –2 –1

Figure 3 | Relative abundance of 57 frequent microbial genomes amongindividuals of the cohort. See Fig. 2c for definition of box and whisker plot.See Methods for computation.

1Number of individuals sampled

Num

ber o

f ort

holo

gous

gro

ups/

gene

fam

ilies

(×10

3 )

25 50 75 100 124

a b

c

320,000

280, 000

240,000

200,000

160,0001.0 0.8 0.6 0.4 0.2 0

0.6

0.7

0.8

0.9

1.0

0.5

85%90%95%

0

5

10

15

20

0

1

2

3

4

1 20 40 60 80 100

OGs + novel gene families

Known + unknown OGs

Known OGs

Num

ber o

f non

-red

unda

ntge

nes

(×10

6 )

Num

ber of target genes covered

Frac

tion

of ta

rget

gen

esco

vere

d

Number of samples Fraction of gene length covered

Figure 2 | Predicted ORFs in the human gut microbiome. a, Number ofunique genes as a function of the extent of sequencing. The gene accumulationcurve corresponds to the Sobs (Mao Tau) values (number of observed genes),calculated using EstimateS21 (version 8.2.0) on randomly chosen 100 samples(due to memory limitation). b, Coverage of genes from 89 frequent gutmicrobial species (Supplementary Table 12). c, Number of functions capturedby number of samples investigated, based on known (well characterized)orthologous groups (OGs; bottom), known plus unknown orthologousgroups (including, for example, putative, predicted, conserved hypotheticalfunctions; middle) and orthologous groups plus novel gene families (.20proteins) recovered from the metagenome (top). Boxes denote theinterquartile range (IQR) between the first and third quartiles (25th and 75thpercentiles, respectively) and the line inside denotes the median. Whiskersdenote the lowest and highest values within 1.5 times IQR from the first andthird quartiles, respectively. Circles denote outliers beyond the whiskers.

NATURE | Vol 464 | 4 March 2010 ARTICLES

61Macmillan Publishers Limited. All rights reserved©2010

Page 55: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

A complex pattern of species relatedness, characterized by clusters at the genus and family levels, emerges from the analysis of the net- work based on the pair-wise Pearson correlation coefficients of 155 species present in at least one individual at$ ≥1% coverage (Supplementary Fig. 9). Prominent clusters include some of the most abundant gut species, such as members of the Bacteroidetes and Dorea/Eubacterium/Ruminococcus groups and also bifidobacteria, Proteobacteria and streptococci/lactobacilli groups. These observa- tions indicate that similar constellations of bacteria may be present in different individuals of our cohort, for reasons that remain to be established.

Page 56: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

A complex pattern of species relatedness, characterized by clusters at the genus and family levels, emerges from the analysis of the network based on the pair-wise Pearson correlation coefficients of 155 species present in at least one individual at ≥1% coverage (Supplementary Fig. 9). Prominent clusters include some of the most abundant gut species, such as members of the Bacteroidetes and Dorea/Eubacterium/Ruminococcus groups and also bifidobacteria, Proteobacteria and streptococci/lactobacilli groups. These observa- tions indicate that similar constellations of bacteria may be present in different individuals of our cohort, for reasons that remain to be established.

Page 57: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Figure 9 | Relations between the most abundant bacterial species. The network was deduced from the analysis of 155 bacterial species present in at least 1 individual at a genome coverage of �1%. Size of the nodes (circles) indicates species abundance over the cohort, width of the edges (lines connecting the circles) indicates the value of the Pearson correlation coefficient (only the 342 values above 0.4 or below -0.4 out of a total of 11,935 were used for the network).

www.nature.com / nature 9

Page 58: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

The above result indicates that the Illumina-based bacterial profiling should reveal differences between the healthy individuals and patients. To test this hypothesis we compared the IBD patients and healthy controls (Supplementary Table 1), as it was previously reported that the two have different microbiota22. The principal component analysis, based on the same 155 species, clearly separates patients from healthy individuals and the ulcerative colitis from the Crohn’s disease patients (Fig. 4), confirming our hypothesis.

Page 59: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

The above result indicates that the Illumina-based bacterial profiling should reveal differences between the healthy individuals and patients. To test this hypothesis we compared the IBD patients and healthy controls (Supplementary Table 1), as it was previously reported that the two have different microbiota22. The principal component analysis, based on the same 155 species, clearly separates patients from healthy individuals and the ulcerative colitis from the Crohn’s disease patients (Fig. 4), confirming our hypothesis.

Page 60: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014Almost all (99.96%) of the phylogenetically assigned genes belongedto the Bacteria and Archaea, reflecting their predominance in the gut.Genes that were not mapped to orthologous groups were clusteredinto gene families (see Methods). To investigate the functional con-tent of the prevalent gene set we computed the total number oforthologous groups and/or gene families present in any combinationof n individuals (with n 5 2–124; see Fig. 2c). This rarefaction ana-lysis shows that the ‘known’ functions (annotated in eggNOG orKEGG) quickly saturate (a value of 5,569 groups was observed): whensampling any subset of 50 individuals, most have been detected.However, three-quarters of the prevalent gut functionalities consistsof uncharacterized orthologous groups and/or completely novel genefamilies (Fig. 2c). When including these groups, the rarefaction curveonly starts to plateau at the very end, at a much higher level (19,338groups were detected), confirming that the extensive sampling of alarge number of individuals was necessary to capture this considerableamount of novel/unknown functionality.

Bacterial functions important for life in the gut

The extensive non-redundant catalogue of the bacterial genes fromthe human intestinal tract provides an opportunity to identify bac-terial functions important for life in this environment. There arefunctions necessary for a bacterium to thrive in a gut context (thatis, the ‘minimal gut genome’) and those involved in the homeostasisof the whole ecosystem, encoded across many species (the ‘minimalgut metagenome’). The first set of functions is expected to be presentin most or all gut bacterial species; the second set in most or allindividuals’ gut samples.

To identify the functions encoded by the minimal gut genome weuse the fact that they should be present in most or all gut bacterialspecies and therefore appear in the gene catalogue at a frequencyabove that of the functions present in only some of the gut bacterialspecies. The relative frequency of different functions can be deducedfrom the number of genes recruited to different eggNOG clusters,after normalization for gene length and copy number (Supplemen-tary Fig. 10a, b). We ranked all the clusters by gene frequencies anddetermined the range that included the clusters specifying well-known essential bacterial functions, such as those determined experi-mentally for a well-studied firmicute, Bacillus subtilis27, hypothe-sizing that additional clusters in this range are equally important.As expected, the range that included most of B. subtilis essentialclusters (86%) was at the very top of the ranking order (Fig. 5).Some 76% of the clusters with essential genes of Escherichia coli28

were within this range, confirming the validity of our approach.This suggests that 1,244 metagenomic clusters found within the range(Supplementary Table 10; termed ‘range clusters’ hereafter) specifyfunctions important for life in the gut.

We found two types of functions among the range clusters: thoserequired in all bacteria (housekeeping) and those potentially specificfor the gut. Among many examples of the first category are thefunctions that are part of main metabolic pathways (for example,central carbon metabolism, amino acid synthesis), and importantprotein complexes (RNA and DNA polymerase, ATP synthase, generalsecretory apparatus). Not surprisingly, projection of the range clusterson the KEGG metabolic pathways gives a highly integrated picture ofthe global gut cell metabolism (Fig. 6a).

The putative gut-specific functions include those involved in adhe-sion to the host proteins (collagen, fibrinogen, fibronectin) or inharvesting sugars of the globoseries glycolipids, which are carriedon blood and epithelial cells. Furthermore, 15% of range clustersencode functions that are present in ,10% of the eggNOG genomes(see Supplementary Fig. 11) and are largely (74.3%) not defined(Fig. 6b). Detailed studies of these should lead to a deeper compre-hension of bacterial life in the gut.

To identify the functions encoded by the minimal gut metagenome,we computed the orthologous groups that are shared by individuals ofour cohort. This minimal set, of 6,313 functions, is much larger than theone estimated in a previous study8. There are only 2,069 functionallyannotated orthologous groups, showing that they gravely underesti-mate the true size of the common functional complement among indi-viduals (Fig. 6c). The minimal gut metagenome includes a considerablefraction of functions (,45%) that are present in ,10% of thesequenced bacterial genomes (Fig. 6c, inset). These otherwise rare func-tionalities that are found in each of the 124 individuals may be necessaryfor the gut ecosystem. Eighty per cent of these orthologous groupscontain genes with at best poorly characterized function, underscoringour limited knowledge of gut functioning.

Of the known fraction, about 5% codes for (pro)phage-relatedproteins, implying a universal presence and possible important eco-logical role of bacteriophages in gut homeostasis. The most strikingsecondary metabolism that seems crucial for the minimal metage-nome relates, not unexpectedly, to biodegradation of complex sugarsand glycans harvested from the host diet and/or intestinal lining.Examples include degradation and uptake pathways for pectin(and its monomer, rhamnose) and sorbitol, sugars which are omni-present in fruits and vegetables, but which are not or poorly absorbedby humans. As some gut microorganisms were found to degrade bothof them29,30, this capacity seems to be selected for by the gut ecosystemas a non-competitive source of energy. Besides these, capacity toferment, for example, mannose, fructose, cellulose and sucrose is alsopart of the minimal metagenome. Together, these emphasize the

40

30

20

10

0

Clu

ster

(%)

1 2,001 4,001 6,001 8,001 10,001Cluster rank

Range

Figure 5 | Clusters that contain the B. subtilis essential genes. The clusterswere ranked by the number of genes they contain, normalized by averagelength and copy number (see Supplementary Fig. 10), and the proportion ofclusters with the essential B. subtilis genes was determined for successivegroups of 100 clusters. Range indicates the part of the cluster distributionthat contains 86% of the B. subtilis essential genes.

• •

• •

••

••

• •

• •

••

••

Healthy

Crohn’s disease

Ulcerative colitis

P value: 0.031

PC2

PC1

Figure 4 | Bacterial species abundance differentiates IBD patients andhealthy individuals. Principal component analysis with health status asinstrumental variables, based on the abundance of 155 species with $1%genome coverage by the Illumina reads in at least 1 individual of the cohort,was carried out with 14 healthy individuals and 25 IBD patients (21 ulcerativecolitis and 4 Crohn’s disease) from Spain (Supplementary Table 1). Two firstcomponents (PC1 and PC2) were plotted and represented 7.3% of wholeinertia. Individuals (represented by points) were clustered and centre ofgravity computed for each class; P-value of the link between health status andspecies abundance was assessed using a Monte-Carlo test (999 replicates).

ARTICLES NATURE | Vol 464 | 4 March 2010

62Macmillan Publishers Limited. All rights reserved©2010

Page 61: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Functions encoded by the prevalent gene set

We classified the predicted genes by aligning them to the integrated NCBI-NR database of non-redundant protein sequences, the genes in the KEGG (Kyoto Encyclopedia of Genes and Genomes)24 pathways, and COG (Clusters of Orthologous Groups)25 and eggNOG26 data- bases. There were 77.1% genes classified into phylotypes, 57.5% to eggNOG clusters, 47.0% to KEGG orthology and 18.7% genes assigned to KEGG pathways, respectively (Supplementary Table 9). Almost all (99.96%) of the phylogenetically assigned genes belonged to the Bacteria and Archaea, reflecting their predominance in the gut. Genes that were not mapped to orthologous groups were clustered into gene families (see Methods). To investigate the functional con- tent of the prevalent gene set we computed the total number of orthologous groups and/or gene families present in any combination of n individuals (with n 5 2–124; see Fig. 2c). This rarefaction ana- lysis shows that the ‘known’ functions (annotated in eggNOG or KEGG) quickly saturate (a value of 5,569 groups was observed): when sampling any subset of 50 individuals, most have been detected. However, three-quarters of the prevalent gut functionalities consists of uncharacterized orthologous groups and/or completely novel gene families (Fig. 2c). When including these groups, the rarefaction curve only starts to plateau at the very end, at a much higher level (19,338 groups were detected), confirming that the extensive sampling of a large number of individuals was necessary to capture this considerable amount of novel/unknown functionality.

Page 62: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Functions encoded by the prevalent gene set

We classified the predicted genes by aligning them to the integrated NCBI-NR database of non-redundant protein sequences, the genes in the KEGG (Kyoto Encyclopedia of Genes and Genomes)24 pathways, and COG (Clusters of Orthologous Groups)25 and eggNOG26 data- bases. There were 77.1% genes classified into phylotypes, 57.5% to eggNOG clusters, 47.0% to KEGG orthology and 18.7% genes assigned to KEGG pathways, respectively (Supplementary Table 9). Almost all (99.96%) of the phylogenetically assigned genes belonged to the Bacteria and Archaea, reflecting their predominance in the gut. Genes that were not mapped to orthologous groups were clustered into gene families (see Methods). To investigate the functional con- tent of the prevalent gene set we computed the total number of orthologous groups and/or gene families present in any combination of n individuals (with n 5 2–124; see Fig. 2c). This rarefaction ana- lysis shows that the ‘known’ functions (annotated in eggNOG or KEGG) quickly saturate (a value of 5,569 groups was observed): when sampling any subset of 50 individuals, most have been detected. However, three-quarters of the prevalent gut functionalities consists of uncharacterized orthologous groups and/or completely novel gene families (Fig. 2c). When including these groups, the rarefaction curve only starts to plateau at the very end, at a much higher level (19,338 groups were detected), confirming that the extensive sampling of a large number of individuals was necessary to capture this considerable amount of novel/unknown functionality.

Page 63: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

the cohort. For this purpose, we used a non-redundant set of 650sequenced bacterial and archaeal genomes (see Methods). We alignedthe Illumina GA reads of each human gut microbial sample onto thegenome set, using a 90% identity threshold, and determined theproportion of the genomes covered by the reads that aligned ontoonly a single position in the set. At a 1% coverage, which for a typicalgut bacterial genome corresponds to an average length of about40 kb, some 25-fold more than that of the 16S gene generally usedfor species identification, we detected 18 species in all individuals, 57in $90% and 75 in $50% of individuals (Supplementary Table 8). At10% coverage, requiring ,10-fold higher abundance in a sample, westill found 13 of the above species in $90% of individuals and 35in $50%.

When the cumulated sequence length increased from 3.96 Gb to8.74 Gb and from 4.41 Gb to 11.6 Gb, for samples MH0006 andMH0012, respectively, the number of strains common to the twoat the 1% coverage threshold increased by 25%, from 135 to 169.This indicates the existence of a significantly larger common corethan the one we could observe at the sequence depth routinely usedfor each individual.

The variability of abundance of microbial species in individualscan greatly affect identification of the common core. To visualizethis variability, we compared the number of sequencing reads alignedto different genomes across the individuals of our cohort. Even forthe most common 57 species present in $90% of individuals withgenome coverage .1% (Supplementary Table 8), the inter-individualvariability was between 12- and 2,187-fold (Fig. 3). As expected10,23,Bacteroidetes and Firmicutes had the highest abundance.

A complex pattern of species relatedness, characterized by clustersat the genus and family levels, emerges from the analysis of the net-work based on the pair-wise Pearson correlation coefficients of 155species present in at least one individual at $1% coverage(Supplementary Fig. 9). Prominent clusters include some of the mostabundant gut species, such as members of the Bacteroidetes andDorea/Eubacterium/Ruminococcus groups and also bifidobacteria,Proteobacteria and streptococci/lactobacilli groups. These observa-tions indicate that similar constellations of bacteria may be present indifferent individuals of our cohort, for reasons that remain to beestablished.

The above result indicates that the Illumina-based bacterial pro-filing should reveal differences between the healthy individuals andpatients. To test this hypothesis we compared the IBD patients andhealthy controls (Supplementary Table 1), as it was previouslyreported that the two have different microbiota22. The principal com-ponent analysis, based on the same 155 species, clearly separatespatients from healthy individuals and the ulcerative colitis fromthe Crohn’s disease patients (Fig. 4), confirming our hypothesis.

Functions encoded by the prevalent gene set

We classified the predicted genes by aligning them to the integratedNCBI-NR database of non-redundant protein sequences, the genes inthe KEGG (Kyoto Encyclopedia of Genes and Genomes)24 pathways,and COG (Clusters of Orthologous Groups)25 and eggNOG26 data-bases. There were 77.1% genes classified into phylotypes, 57.5% toeggNOG clusters, 47.0% to KEGG orthology and 18.7% genesassigned to KEGG pathways, respectively (Supplementary Table 9).

Relative abundance (log10)

Blautia hanseniiClostridium scindensEnterococcus faecalis TX0104Clostridium asparagiformeBacteroides fragilis 3_1_12Bacteroides intestinalisRuminococcus gnavusAnaerotruncus colihominisBacteroides pectinophilusClostridium nexileClostridium sp. L2−50Parabacteroides johnsoniiBacteroides finegoldiiButyrivibrio crossotusBacteroides eggerthiiClostridium sp. M62 1Coprococcus eutactusBacteroides stercorisHoldemania filiformisClostridium leptumStreptococcus thermophilus LMD−9Bacteroides capillosusSubdoligranulum variabileRuminococcus obeum A2−162Bacteroides doreiEubacterium ventriosumBacteroides sp. D4Bacteroides sp. D1Coprococcus comes SL7 1Bacteriodes xylanisolvens XB1AEubacterium rectale M104 1Bacteroides sp. 2_2_4Bacteroides sp. 4_3_47FAABacteroides ovatusBacteroides sp. 9_1_42FAAParabacteroides distasonis ATCC 8503Eubacterium siraeum 70 3Bacteroides sp. 2_1_7Roseburia intestinalis M50 1Bacteroides vulgatus ATCC 8482Dorea formicigeneransCollinsella aerofaciensRuminococcus lactarisFaecalibacterium prausnitzii SL3 3Ruminococcus sp. SR1 5Unknown sp. SS3 4Ruminococcus torques L2−14Eubacterium halliiBacteroides thetaiotaomicron VPI−5482Clostridium sp. SS2−1Bacteroides caccaeRuminococcus bromii L2−63Dorea longicatenaParabacteroides merdaeAlistipes putredinisBacteroides uniformis

–4 –3 –2 –1

Figure 3 | Relative abundance of 57 frequent microbial genomes amongindividuals of the cohort. See Fig. 2c for definition of box and whisker plot.See Methods for computation.

1Number of individuals sampled

Num

ber o

f ort

holo

gous

gro

ups/

gene

fam

ilies

(×10

3 )

25 50 75 100 124

a b

c

320,000

280, 000

240,000

200,000

160,0001.0 0.8 0.6 0.4 0.2 0

0.6

0.7

0.8

0.9

1.0

0.5

85%90%95%

0

5

10

15

20

0

1

2

3

4

1 20 40 60 80 100

OGs + novel gene families

Known + unknown OGs

Known OGs

Num

ber o

f non

-red

unda

ntge

nes

(×10

6 )

Num

ber of target genes covered

Frac

tion

of ta

rget

gen

esco

vere

d

Number of samples Fraction of gene length covered

Figure 2 | Predicted ORFs in the human gut microbiome. a, Number ofunique genes as a function of the extent of sequencing. The gene accumulationcurve corresponds to the Sobs (Mao Tau) values (number of observed genes),calculated using EstimateS21 (version 8.2.0) on randomly chosen 100 samples(due to memory limitation). b, Coverage of genes from 89 frequent gutmicrobial species (Supplementary Table 12). c, Number of functions capturedby number of samples investigated, based on known (well characterized)orthologous groups (OGs; bottom), known plus unknown orthologousgroups (including, for example, putative, predicted, conserved hypotheticalfunctions; middle) and orthologous groups plus novel gene families (.20proteins) recovered from the metagenome (top). Boxes denote theinterquartile range (IQR) between the first and third quartiles (25th and 75thpercentiles, respectively) and the line inside denotes the median. Whiskersdenote the lowest and highest values within 1.5 times IQR from the first andthird quartiles, respectively. Circles denote outliers beyond the whiskers.

NATURE | Vol 464 | 4 March 2010 ARTICLES

61Macmillan Publishers Limited. All rights reserved©2010

Page 64: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Bacterial functions important for life in the gut The extensive non-redundant catalogue of the bacterial genes from the human intestinal tract provides an opportunity to identify bacterial functions important for life in this environment. There are functions necessary for a bacterium to thrive in a gut context (that is, the ‘minimal gut genome’) and those involved in the homeostasis of the whole ecosystem, encoded across many species (the ‘minimal gut metagenome’). The first set of functions is expected to be present in most or all gut bacterial species; the second set in most or all individuals’ gut samples.

Page 65: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To identify the functions encoded by the minimal gut genome we use the fact that they should be present in most or all gut bacterial species and therefore appear in the gene catalogue at a frequency above that of the functions present in only some of the gut bacterial species. The relative frequency of different functions can be deduced from the number of genes recruited to different eggNOG clusters, after normalization for gene length and copy number (Supplementary Fig. 10a, b). We ranked all the clusters by gene frequencies and determined the range that included the clusters specifying well- known essential bacterial functions, such as those determined experi-mentally for a well-studied firmicute, Bacillus subtilis27, hypothesizing that additional clusters in this range are equally important. As expected, the range that included most of B. subtilis essential clusters (86%) was at the very top of the ranking order (Fig. 5). Some 76% of the clusters with essential genes of Escherichia coli28 were within this range, confirming the validity of our approach. This suggests that 1,244 metagenomic clusters found within the range (Supplementary Table 10; termed ‘range clusters’ hereafter) specify functions important for life in the gut.

Page 66: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To identify the functions encoded by the minimal gut genome we use the fact that they should be present in most or all gut bacterial species and therefore appear in the gene catalogue at a frequency above that of the functions present in only some of the gut bacterial species. The relative frequency of different functions can be deduced from the number of genes recruited to different eggNOG clusters, after normalization for gene length and copy number (Supplementary Fig. 10a, b). We ranked all the clusters by gene frequencies and determined the range that included the clusters specifying well- known essential bacterial functions, such as those determined experi-mentally for a well-studied firmicute, Bacillus subtilis27, hypothesizing that additional clusters in this range are equally important. As expected, the range that included most of B. subtilis essential clusters (86%) was at the very top of the ranking order (Fig. 5). Some 76% of the clusters with essential genes of Escherichia coli28 were within this range, confirming the validity of our approach. This suggests that 1,244 metagenomic clusters found within the range (Supplementary Table 10; termed ‘range clusters’ hereafter) specify functions important for life in the gut.

Page 67: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Almost all (99.96%) of the phylogenetically assigned genes belongedto the Bacteria and Archaea, reflecting their predominance in the gut.Genes that were not mapped to orthologous groups were clusteredinto gene families (see Methods). To investigate the functional con-tent of the prevalent gene set we computed the total number oforthologous groups and/or gene families present in any combinationof n individuals (with n 5 2–124; see Fig. 2c). This rarefaction ana-lysis shows that the ‘known’ functions (annotated in eggNOG orKEGG) quickly saturate (a value of 5,569 groups was observed): whensampling any subset of 50 individuals, most have been detected.However, three-quarters of the prevalent gut functionalities consistsof uncharacterized orthologous groups and/or completely novel genefamilies (Fig. 2c). When including these groups, the rarefaction curveonly starts to plateau at the very end, at a much higher level (19,338groups were detected), confirming that the extensive sampling of alarge number of individuals was necessary to capture this considerableamount of novel/unknown functionality.

Bacterial functions important for life in the gut

The extensive non-redundant catalogue of the bacterial genes fromthe human intestinal tract provides an opportunity to identify bac-terial functions important for life in this environment. There arefunctions necessary for a bacterium to thrive in a gut context (thatis, the ‘minimal gut genome’) and those involved in the homeostasisof the whole ecosystem, encoded across many species (the ‘minimalgut metagenome’). The first set of functions is expected to be presentin most or all gut bacterial species; the second set in most or allindividuals’ gut samples.

To identify the functions encoded by the minimal gut genome weuse the fact that they should be present in most or all gut bacterialspecies and therefore appear in the gene catalogue at a frequencyabove that of the functions present in only some of the gut bacterialspecies. The relative frequency of different functions can be deducedfrom the number of genes recruited to different eggNOG clusters,after normalization for gene length and copy number (Supplemen-tary Fig. 10a, b). We ranked all the clusters by gene frequencies anddetermined the range that included the clusters specifying well-known essential bacterial functions, such as those determined experi-mentally for a well-studied firmicute, Bacillus subtilis27, hypothe-sizing that additional clusters in this range are equally important.As expected, the range that included most of B. subtilis essentialclusters (86%) was at the very top of the ranking order (Fig. 5).Some 76% of the clusters with essential genes of Escherichia coli28

were within this range, confirming the validity of our approach.This suggests that 1,244 metagenomic clusters found within the range(Supplementary Table 10; termed ‘range clusters’ hereafter) specifyfunctions important for life in the gut.

We found two types of functions among the range clusters: thoserequired in all bacteria (housekeeping) and those potentially specificfor the gut. Among many examples of the first category are thefunctions that are part of main metabolic pathways (for example,central carbon metabolism, amino acid synthesis), and importantprotein complexes (RNA and DNA polymerase, ATP synthase, generalsecretory apparatus). Not surprisingly, projection of the range clusterson the KEGG metabolic pathways gives a highly integrated picture ofthe global gut cell metabolism (Fig. 6a).

The putative gut-specific functions include those involved in adhe-sion to the host proteins (collagen, fibrinogen, fibronectin) or inharvesting sugars of the globoseries glycolipids, which are carriedon blood and epithelial cells. Furthermore, 15% of range clustersencode functions that are present in ,10% of the eggNOG genomes(see Supplementary Fig. 11) and are largely (74.3%) not defined(Fig. 6b). Detailed studies of these should lead to a deeper compre-hension of bacterial life in the gut.

To identify the functions encoded by the minimal gut metagenome,we computed the orthologous groups that are shared by individuals ofour cohort. This minimal set, of 6,313 functions, is much larger than theone estimated in a previous study8. There are only 2,069 functionallyannotated orthologous groups, showing that they gravely underesti-mate the true size of the common functional complement among indi-viduals (Fig. 6c). The minimal gut metagenome includes a considerablefraction of functions (,45%) that are present in ,10% of thesequenced bacterial genomes (Fig. 6c, inset). These otherwise rare func-tionalities that are found in each of the 124 individuals may be necessaryfor the gut ecosystem. Eighty per cent of these orthologous groupscontain genes with at best poorly characterized function, underscoringour limited knowledge of gut functioning.

Of the known fraction, about 5% codes for (pro)phage-relatedproteins, implying a universal presence and possible important eco-logical role of bacteriophages in gut homeostasis. The most strikingsecondary metabolism that seems crucial for the minimal metage-nome relates, not unexpectedly, to biodegradation of complex sugarsand glycans harvested from the host diet and/or intestinal lining.Examples include degradation and uptake pathways for pectin(and its monomer, rhamnose) and sorbitol, sugars which are omni-present in fruits and vegetables, but which are not or poorly absorbedby humans. As some gut microorganisms were found to degrade bothof them29,30, this capacity seems to be selected for by the gut ecosystemas a non-competitive source of energy. Besides these, capacity toferment, for example, mannose, fructose, cellulose and sucrose is alsopart of the minimal metagenome. Together, these emphasize the

40

30

20

10

0

Clu

ster

(%)

1 2,001 4,001 6,001 8,001 10,001Cluster rank

Range

Figure 5 | Clusters that contain the B. subtilis essential genes. The clusterswere ranked by the number of genes they contain, normalized by averagelength and copy number (see Supplementary Fig. 10), and the proportion ofclusters with the essential B. subtilis genes was determined for successivegroups of 100 clusters. Range indicates the part of the cluster distributionthat contains 86% of the B. subtilis essential genes.

• •

• •

••

••

• •

• •

••

••

Healthy

Crohn’s disease

Ulcerative colitis

P value: 0.031

PC2

PC1

Figure 4 | Bacterial species abundance differentiates IBD patients andhealthy individuals. Principal component analysis with health status asinstrumental variables, based on the abundance of 155 species with $1%genome coverage by the Illumina reads in at least 1 individual of the cohort,was carried out with 14 healthy individuals and 25 IBD patients (21 ulcerativecolitis and 4 Crohn’s disease) from Spain (Supplementary Table 1). Two firstcomponents (PC1 and PC2) were plotted and represented 7.3% of wholeinertia. Individuals (represented by points) were clustered and centre ofgravity computed for each class; P-value of the link between health status andspecies abundance was assessed using a Monte-Carlo test (999 replicates).

ARTICLES NATURE | Vol 464 | 4 March 2010

62Macmillan Publishers Limited. All rights reserved©2010

Page 68: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

We found two types of functions among the range clusters: those required in all bacteria (housekeeping) and those potentially specific for the gut. Among many examples of the first category are the functions that are part of main metabolic pathways (for example, central carbon metabolism, amino acid synthesis), and important protein complexes (RNA and DNA polymerase, ATP synthase, general secretory apparatus). Not surprisingly, projection of the range clusters on the KEGG metabolic pathways gives a highly integrated picture of the global gut cell metabolism (Fig. 6a).

The putative gut-specific functions include those involved in adhe- sion to the host proteins (collagen, fibrinogen, fibronectin) or in harvesting sugars of the globoseries glycolipids, which are carried on blood and epithelial cells. Furthermore, 15% of range clusters encode functions that are present in ,10% of the eggNOG genomes (see Supplementary Fig. 11) and are largely (74.3%) not defined (Fig. 6b). Detailed studies of these should lead to a deeper compre- hension of bacterial life in the gut.

Page 69: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

We found two types of functions among the range clusters: those required in all bacteria (housekeeping) and those potentially specific for the gut. Among many examples of the first category are the functions that are part of main metabolic pathways (for example, central carbon metabolism, amino acid synthesis), and important protein complexes (RNA and DNA polymerase, ATP synthase, general secretory apparatus). Not surprisingly, projection of the range clusters on the KEGG metabolic pathways gives a highly integrated picture of the global gut cell metabolism (Fig. 6a). The putative gut-specific functions include those involved in adhe-sion to the host proteins (collagen, fibrinogen, fibronectin) or in harvesting sugars of the globoseries glycolipids, which are carried on blood and epithelial cells. Furthermore, 15% of range clusters encode functions that are present in ,10% of the eggNOG genomes (see Supplementary Fig. 11) and are largely (74.3%) not defined (Fig. 6b). Detailed studies of these should lead to a deeper comprehension of bacterial life in the gut.

Page 70: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

strong dependence of the gut ecosystem on complex sugar degrada-tion for its functioning.

Functional complementarities of the genome and metagenome

Detailed analysis of the complementarities between the gut metage-nome and the human genome is beyond the scope of the present work.To provide an overview, we considered two factors: conservation of thefunctions in the minimal metagenome and presence/absence of func-tions in one or the other (Supplementary Table 11). Gut bacteria usemostly fermentation to generate energy, converting sugars, in part, toshort-chain fatty acid, that are used by the host as energy source. Acetateis important for muscle, heart and brain cells31, propionate is used inhost hepatic neoglucogenic processes, whereas, in addition, butyrate isimportant for enterocytes32. Beyond short-chain fatty acid, a number of

amino acids are indispensable to humans33 and can be provided bybacteria34. Similarly, bacteria can contribute certain vitamins3 (forexample, biotin, phylloquinone) to the host. All of the steps of biosyn-thesis of these molecules are encoded by the minimal metagenome.

Gut bacteria seem to be able to degrade numerous xenobiotics,including non-modified and halogenated aromatic compounds (Sup-plementary Table 11), even if the steps of most pathways are not partof the minimal metagenome and are found in a fraction of individualsonly. A particularly interesting example is that of benzoate, which is acommon food supplement, known as E211. Its degradation by thecoenzyme-A ligation pathway, encoded in the minimal metagenome,leads to pimeloyl-coenzyme-A, which is a precursor of biotin, indi-cating that this food supplement can have a potentially beneficial rolefor human health.

Common

UncommonRare

Unknown

Known

Phage-associated

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Number of individuals sampled

Min

imal

met

agen

ome

size

1 25 50 75 100

c

a b

map00565

map00350

map00629

resistanceβ-Lactam

map00643

map00620

map00780

map00260

map00940

map00512

map00670

map00513

map00220

map00628

Limonene and pinenedegradation

map00040

map00632

map00563

map00920

biosynthesis IIAlkaloid

map00600

map00625

map00400

map00941

map00520

map00790

map00330

map00621

map00271

map00591

map00072

map00480

map00031

map00460

map00910

map00604

map00631

map00900

map00010

map00331

map00290

map00240

map00300

map00561

map00196

map00053

map00071

map00660

map00860

map00440

map00601

Tetracyclinebiosynthesis

map00641

map00642

map00750

map00710

map00195

map00251

map00052

map00531

map00051

map00410

map00540

map00140

map00120

map00252

map00380

map00627

biosynthesisPenicillins and cephalosporins

map00830

map00623

Monoterpenoidbiosynthesis

map00360

map00472

map00562

map00530

map00650

map00770

map00062

map00640

map00730

map00473

map00130

map00760

map00950

map00510

map00272

map00622

map00363

map00680

Diterpenoidbiosynthesis

Streptomycinbiosynthesis

map00340

map00791

map00564

map00020

map00500

map00720

map00362

map00310

map00230

map00550

map00630

map00603

map00471

map00901

map00602

map00590

map00351map00626

map00030map00534

map00532

map00190

map00740

map00430

map00624

map00061

map00150

biosynthesisNovobiocin

map00280

map00906

map00100

map00361

map00930

map00450

Carbohydratemetabolism

and metabolismGlycan biosynthesis

metabolismAmino acid

metabolismEnergy

Lipidmetabolism

xenobioticsBiodegradation of

Metabolism ofother amino acids

metabolismNucleotide

Metabolism ofcofactors and vitamins

Biosynthesis ofsecondary metabolites

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

General function - RTranslation - J

Amino acids - EDNA - L

Unknown - SEnvelope - M

Carbohydrates - GEnergy - C

Transcription - KCoenzymes - HNucleotides - F

Inorganic - PProtein turnover - O

Lipids - ISignal transduction - T

Secretion - UCell cycle - D

Defence- VSecond metabolites - Q

Cell motility - NRNA - A

Chromatin - BExtracellular - W

Nuclear structure - YCytoskeleton - Z

Rare minimal genome Rare minimal metagenome Frequent minimal genome Frequent minimal metagenome

Figure 6 | Characterization of the minimal gut genome and metagenome.a, Projection of the minimal gut genome on the KEGG pathways using theiPath tool38. b, Functional composition of the minimal gut genome andmetagenome. Rare and frequent refer to the presence in sequenced eggNOGgenomes. c, Estimation of the minimal gut metagenome size. Knownorthologous groups (red), known plus unknown orthologous groups (blue)and orthologous groups plus novel gene families (.20 proteins; grey) areshown (see Fig. 2c for definition of box and whisker plot). The inset shows

composition of the gut minimal microbiome. Large circle: classification inthe minimal metagenome according to orthologous group occurrence inSTRING739 bacterial genomes. Common (25%), uncommon (35%) and rare(45%) refer to functions that are present in .50%, ,50% but .10%, and,10% of STRING bacteria genomes, respectively. Small circle: compositionof the rare orthologous groups. Unknown (80%) have no annotation or arepoorly characterized, whereas known bacterial (19%) and phage-related(1%) orthologous groups have functional description.

NATURE | Vol 464 | 4 March 2010 ARTICLES

63Macmillan Publishers Limited. All rights reserved©2010

Projection of the minimal gut genome on the KEGG pathways using the iPath tool38. b, Functional composition of the minimal gut genome and metagenome. Rare and frequent refer to the presence in sequenced eggNOG genomes. c, Estimation of the minimal gut metagenome size. Known orthologous groups (red), known plus unknown orthologous groups (blue) and orthologous groups plus novel gene families (>20 proteins; grey) are shown (see Fig. 2c for definition of box and whisker plot). The inset shows composition of the gut minimal microbiome. Large circle: classification in the minimal metagenome according to orthologous group occurrence in STRING739 bacterial genomes. Common (25%), uncommon (35%) and rare (45%) refer to functions that are present in >50%, <50% but >10%, and <10% of STRING bacteria genomes, respectively. Small circle: composition of the rare orthologous groups. Unknown (80%) have no annotation or are poorly characterized, whereas known bacterial (19%) and phage-related (1%) orthologous groups have functional description.

Page 71: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

strong dependence of the gut ecosystem on complex sugar degrada-tion for its functioning.

Functional complementarities of the genome and metagenome

Detailed analysis of the complementarities between the gut metage-nome and the human genome is beyond the scope of the present work.To provide an overview, we considered two factors: conservation of thefunctions in the minimal metagenome and presence/absence of func-tions in one or the other (Supplementary Table 11). Gut bacteria usemostly fermentation to generate energy, converting sugars, in part, toshort-chain fatty acid, that are used by the host as energy source. Acetateis important for muscle, heart and brain cells31, propionate is used inhost hepatic neoglucogenic processes, whereas, in addition, butyrate isimportant for enterocytes32. Beyond short-chain fatty acid, a number of

amino acids are indispensable to humans33 and can be provided bybacteria34. Similarly, bacteria can contribute certain vitamins3 (forexample, biotin, phylloquinone) to the host. All of the steps of biosyn-thesis of these molecules are encoded by the minimal metagenome.

Gut bacteria seem to be able to degrade numerous xenobiotics,including non-modified and halogenated aromatic compounds (Sup-plementary Table 11), even if the steps of most pathways are not partof the minimal metagenome and are found in a fraction of individualsonly. A particularly interesting example is that of benzoate, which is acommon food supplement, known as E211. Its degradation by thecoenzyme-A ligation pathway, encoded in the minimal metagenome,leads to pimeloyl-coenzyme-A, which is a precursor of biotin, indi-cating that this food supplement can have a potentially beneficial rolefor human health.

Common

UncommonRare

Unknown

Known

Phage-associated

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Number of individuals sampled

Min

imal

met

agen

ome

size

1 25 50 75 100

c

a b

map00565

map00350

map00629

resistanceβ-Lactam

map00643

map00620

map00780

map00260

map00940

map00512

map00670

map00513

map00220

map00628

Limonene and pinenedegradation

map00040

map00632

map00563

map00920

biosynthesis IIAlkaloid

map00600

map00625

map00400

map00941

map00520

map00790

map00330

map00621

map00271

map00591

map00072

map00480

map00031

map00460

map00910

map00604

map00631

map00900

map00010

map00331

map00290

map00240

map00300

map00561

map00196

map00053

map00071

map00660

map00860

map00440

map00601

Tetracyclinebiosynthesis

map00641

map00642

map00750

map00710

map00195

map00251

map00052

map00531

map00051

map00410

map00540

map00140

map00120

map00252

map00380

map00627

biosynthesisPenicillins and cephalosporins

map00830

map00623

Monoterpenoidbiosynthesis

map00360

map00472

map00562

map00530

map00650

map00770

map00062

map00640

map00730

map00473

map00130

map00760

map00950

map00510

map00272

map00622

map00363

map00680

Diterpenoidbiosynthesis

Streptomycinbiosynthesis

map00340

map00791

map00564

map00020

map00500

map00720

map00362

map00310

map00230

map00550

map00630

map00603

map00471

map00901

map00602

map00590

map00351map00626

map00030map00534

map00532

map00190

map00740

map00430

map00624

map00061

map00150

biosynthesisNovobiocin

map00280

map00906

map00100

map00361

map00930

map00450

Carbohydratemetabolism

and metabolismGlycan biosynthesis

metabolismAmino acid

metabolismEnergy

Lipidmetabolism

xenobioticsBiodegradation of

Metabolism ofother amino acids

metabolismNucleotide

Metabolism ofcofactors and vitamins

Biosynthesis ofsecondary metabolites

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

General function - RTranslation - J

Amino acids - EDNA - L

Unknown - SEnvelope - M

Carbohydrates - GEnergy - C

Transcription - KCoenzymes - HNucleotides - F

Inorganic - PProtein turnover - O

Lipids - ISignal transduction - T

Secretion - UCell cycle - D

Defence- VSecond metabolites - Q

Cell motility - NRNA - A

Chromatin - BExtracellular - W

Nuclear structure - YCytoskeleton - Z

Rare minimal genome Rare minimal metagenome Frequent minimal genome Frequent minimal metagenome

Figure 6 | Characterization of the minimal gut genome and metagenome.a, Projection of the minimal gut genome on the KEGG pathways using theiPath tool38. b, Functional composition of the minimal gut genome andmetagenome. Rare and frequent refer to the presence in sequenced eggNOGgenomes. c, Estimation of the minimal gut metagenome size. Knownorthologous groups (red), known plus unknown orthologous groups (blue)and orthologous groups plus novel gene families (.20 proteins; grey) areshown (see Fig. 2c for definition of box and whisker plot). The inset shows

composition of the gut minimal microbiome. Large circle: classification inthe minimal metagenome according to orthologous group occurrence inSTRING739 bacterial genomes. Common (25%), uncommon (35%) and rare(45%) refer to functions that are present in .50%, ,50% but .10%, and,10% of STRING bacteria genomes, respectively. Small circle: compositionof the rare orthologous groups. Unknown (80%) have no annotation or arepoorly characterized, whereas known bacterial (19%) and phage-related(1%) orthologous groups have functional description.

NATURE | Vol 464 | 4 March 2010 ARTICLES

63Macmillan Publishers Limited. All rights reserved©2010

Page 72: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

strong dependence of the gut ecosystem on complex sugar degrada-tion for its functioning.

Functional complementarities of the genome and metagenome

Detailed analysis of the complementarities between the gut metage-nome and the human genome is beyond the scope of the present work.To provide an overview, we considered two factors: conservation of thefunctions in the minimal metagenome and presence/absence of func-tions in one or the other (Supplementary Table 11). Gut bacteria usemostly fermentation to generate energy, converting sugars, in part, toshort-chain fatty acid, that are used by the host as energy source. Acetateis important for muscle, heart and brain cells31, propionate is used inhost hepatic neoglucogenic processes, whereas, in addition, butyrate isimportant for enterocytes32. Beyond short-chain fatty acid, a number of

amino acids are indispensable to humans33 and can be provided bybacteria34. Similarly, bacteria can contribute certain vitamins3 (forexample, biotin, phylloquinone) to the host. All of the steps of biosyn-thesis of these molecules are encoded by the minimal metagenome.

Gut bacteria seem to be able to degrade numerous xenobiotics,including non-modified and halogenated aromatic compounds (Sup-plementary Table 11), even if the steps of most pathways are not partof the minimal metagenome and are found in a fraction of individualsonly. A particularly interesting example is that of benzoate, which is acommon food supplement, known as E211. Its degradation by thecoenzyme-A ligation pathway, encoded in the minimal metagenome,leads to pimeloyl-coenzyme-A, which is a precursor of biotin, indi-cating that this food supplement can have a potentially beneficial rolefor human health.

Common

UncommonRare

Unknown

Known

Phage-associated

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Number of individuals sampled

Min

imal

met

agen

ome

size

1 25 50 75 100

c

a b

map00565

map00350

map00629

resistanceβ-Lactam

map00643

map00620

map00780

map00260

map00940

map00512

map00670

map00513

map00220

map00628

Limonene and pinenedegradation

map00040

map00632

map00563

map00920

biosynthesis IIAlkaloid

map00600

map00625

map00400

map00941

map00520

map00790

map00330

map00621

map00271

map00591

map00072

map00480

map00031

map00460

map00910

map00604

map00631

map00900

map00010

map00331

map00290

map00240

map00300

map00561

map00196

map00053

map00071

map00660

map00860

map00440

map00601

Tetracyclinebiosynthesis

map00641

map00642

map00750

map00710

map00195

map00251

map00052

map00531

map00051

map00410

map00540

map00140

map00120

map00252

map00380

map00627

biosynthesisPenicillins and cephalosporins

map00830

map00623

Monoterpenoidbiosynthesis

map00360

map00472

map00562

map00530

map00650

map00770

map00062

map00640

map00730

map00473

map00130

map00760

map00950

map00510

map00272

map00622

map00363

map00680

Diterpenoidbiosynthesis

Streptomycinbiosynthesis

map00340

map00791

map00564

map00020

map00500

map00720

map00362

map00310

map00230

map00550

map00630

map00603

map00471

map00901

map00602

map00590

map00351map00626

map00030map00534

map00532

map00190

map00740

map00430

map00624

map00061

map00150

biosynthesisNovobiocin

map00280

map00906

map00100

map00361

map00930

map00450

Carbohydratemetabolism

and metabolismGlycan biosynthesis

metabolismAmino acid

metabolismEnergy

Lipidmetabolism

xenobioticsBiodegradation of

Metabolism ofother amino acids

metabolismNucleotide

Metabolism ofcofactors and vitamins

Biosynthesis ofsecondary metabolites

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

General function - RTranslation - J

Amino acids - EDNA - L

Unknown - SEnvelope - M

Carbohydrates - GEnergy - C

Transcription - KCoenzymes - HNucleotides - F

Inorganic - PProtein turnover - O

Lipids - ISignal transduction - T

Secretion - UCell cycle - D

Defence- VSecond metabolites - Q

Cell motility - NRNA - A

Chromatin - BExtracellular - W

Nuclear structure - YCytoskeleton - Z

Rare minimal genome Rare minimal metagenome Frequent minimal genome Frequent minimal metagenome

Figure 6 | Characterization of the minimal gut genome and metagenome.a, Projection of the minimal gut genome on the KEGG pathways using theiPath tool38. b, Functional composition of the minimal gut genome andmetagenome. Rare and frequent refer to the presence in sequenced eggNOGgenomes. c, Estimation of the minimal gut metagenome size. Knownorthologous groups (red), known plus unknown orthologous groups (blue)and orthologous groups plus novel gene families (.20 proteins; grey) areshown (see Fig. 2c for definition of box and whisker plot). The inset shows

composition of the gut minimal microbiome. Large circle: classification inthe minimal metagenome according to orthologous group occurrence inSTRING739 bacterial genomes. Common (25%), uncommon (35%) and rare(45%) refer to functions that are present in .50%, ,50% but .10%, and,10% of STRING bacteria genomes, respectively. Small circle: compositionof the rare orthologous groups. Unknown (80%) have no annotation or arepoorly characterized, whereas known bacterial (19%) and phage-related(1%) orthologous groups have functional description.

NATURE | Vol 464 | 4 March 2010 ARTICLES

63Macmillan Publishers Limited. All rights reserved©2010

Page 73: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To identify the functions encoded by the minimal gut metagenome, we computed the orthologous groups that are shared by individuals of our cohort. This minimal set, of 6,313 functions, is much larger than the one estimated in a previous study8. There are only 2,069 functionally annotated orthologous groups, showing that they gravely underesti- mate the true size of the common functional complement among indi- viduals (Fig. 6c). The minimal gut metagenome includes a considerable fraction of functions (,45%) that are present in ,10% of the sequenced bacterial genomes (Fig. 6c, inset). These otherwise rare func- tionalities that are found in each of the 124 individuals may be necessary for the gut ecosystem. Eighty per cent of these orthologous groups contain genes with at best poorly characterized function, underscoring our limited knowledge of gut functioning.

Page 74: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

To identify the functions encoded by the minimal gut metagenome, we computed the orthologous groups that are shared by individuals of our cohort. This minimal set, of 6,313 functions, is much larger than the one estimated in a previous study8. There are only 2,069 functionally annotated orthologous groups, showing that they gravely underesti- mate the true size of the common functional complement among indi- viduals (Fig. 6c). The minimal gut metagenome includes a considerable fraction of functions (,45%) that are present in ,10% of the sequenced bacterial genomes (Fig. 6c, inset). These otherwise rare func- tionalities that are found in each of the 124 individuals may be necessary for the gut ecosystem. Eighty per cent of these orthologous groups contain genes with at best poorly characterized function, underscoring our limited knowledge of gut functioning.

Page 75: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Of the known fraction, about 5% codes for (pro)phage-related proteins, implying a universal presence and possible important eco- logical role of bacteriophages in gut homeostasis. The most striking secondary metabolism that seems crucial for the minimal metage- nome relates, not unexpectedly, to biodegradation of complex sugars and glycans harvested from the host diet and/or intestinal lining. Examples include degradation and uptake pathways for pectin (and its monomer, rhamnose) and sorbitol, sugars which are omni- present in fruits and vegetables, but which are not or poorly absorbed by humans. As some gut microorganisms were found to degrade both of them29,30, this capacity seems to be selected for by the gut ecosystem as a non-competitive source of energy. Besides these, capacity to ferment, for example, mannose, fructose, cellulose and sucrose is also part of the minimal metagenome. Together, these emphasize the strong dependence of the gut ecosystem on complex sugar degrada- tion for its functioning.

!

Page 76: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Of the known fraction, about 5% codes for (pro)phage-related proteins, implying a universal presence and possible important eco- logical role of bacteriophages in gut homeostasis. The most striking secondary metabolism that seems crucial for the minimal metage- nome relates, not unexpectedly, to biodegradation of complex sugars and glycans harvested from the host diet and/or intestinal lining. Examples include degradation and uptake pathways for pectin (and its monomer, rhamnose) and sorbitol, sugars which are omni- present in fruits and vegetables, but which are not or poorly absorbed by humans. As some gut microorganisms were found to degrade both of them29,30, this capacity seems to be selected for by the gut ecosystem as a non-competitive source of energy. Besides these, capacity to ferment, for example, mannose, fructose, cellulose and sucrose is also part of the minimal metagenome. Together, these emphasize the strong dependence of the gut ecosystem on complex sugar degrada- tion for its functioning.

!

Page 77: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Functional complementarities of the genome and metagenome Detailed analysis of the complementarities between the gut metage- nome and the human genome is beyond the scope of the present work. To provide an overview, we considered two factors: conservation of the functions in the minimal metagenome and presence/absence of func- tions in one or the other (Supplementary Table 11). Gut bacteria use mostly fermentation to generate energy, converting sugars, in part, to short-chain fatty acid, that are used by the host as energy source. Acetate is important for muscle, heart and brain cells31, propionate is used in host hepatic neoglucogenic processes, whereas, in addition, butyrate is important for enterocytes32. Beyond short-chain fatty acid, a number of amino acids are indispensable to humans33 and can be provided by bacteria34. Similarly, bacteria can contribute certain vitamins3 (for example, biotin, phylloquinone) to the host. All of the steps of biosyn- thesis of these molecules are encoded by the minimal metagenome !

Page 78: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Gut bacteria seem to be able to degrade numerous xenobiotics, including non-modified and halogenated aromatic compounds (Sup- plementary Table 11), even if the steps of most pathways are not part of the minimal metagenome and are found in a fraction of individuals only. A particularly interesting example is that of benzoate, which is a common food supplement, known as E211. Its degradation by the coenzyme-A ligation pathway, encoded in the minimal metagenome, leads to pimeloyl-coenzyme-A, which is a precursor of biotin, indi- cating that this food supplement can have a potentially beneficial role for human health.

Page 79: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

DISCUSSION

Page 80: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Discussion

We have used extensive Illumina GA short-read-based sequencing of total faecal DNA from a cohort of 124 individuals of European (Nordic and Mediterranean) origin to establish a catalogue of non- redundant human intestinal microbial genes. The catalogue contains 3.3 million microbial genes, 150-fold more than the human gene complement, and includes an overwhelming majority (.86%) of prevalent genes harboured by our cohort. The catalogue probably contains a large majority of prevalent intestinal microbial genes in the human population, for the following reasons: (1) over 70% of the metagenomic reads from three previous studies, including American and Japanese individuals8,16,17, can be mapped on our contigs; (2) about 80% of the microbial genes from 89 frequent gut reference genomes are present in our set. This result represents a proof of principle that short-read sequencing can be used to characterize complex microbiomes.

Page 81: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

The full bacterial gene complement of each individual was not sampled in our work. Nevertheless, we have detected some 536,000 prevalent unique genes in each, out of the total of 3.3 million carried by our cohort. Inevitably, the individuals largely share the genes of the common pool. At the present depth of sequencing, we found that almost 40% of the genes from each individual are shared with at least half of the individuals of the cohort. Future studies of world-wide span, envisaged within the International Human Microbiome Consortium, will complete, as necessary, our gene catalogue and establish boundaries to the proportion of shared genes.

Page 82: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Essentially all (99.1%) of the genes of our catalogue are of bacterial origin, the remainder being mostly archaeal, with only 0.1% of eukar- yotic and viral origins. The gene catalogue is therefore equivalent to that of some 1,000 bacterial species with an average-sized genome, encoding about 3,364 non-redundant genes. We estimate that no more than 15% of prevalent genes of our cohort may be missing from the catalogue, and suggest that the cohort harbours no more than ,1,150 bacterial species abundant enough to be detected by our sampling. Given the large overlap between microbial sequences in this and previous studies we suggest that the number of abundant intestinal bacterial species may be not much higher than that observed in our cohort. Each individual of our cohort harbours at least 160 such bacterial species, as estimated by the average prevalent gene number, and many must thus be shared.

Page 83: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

`We assigned about 12% of the reference set genes (404,000) to the 194 sequenced intestinal bacterial genomes, and can thus associate them with bacterial species. Sequencing of at least 1,000 human- associated bacterial genomes is foreseen within the International Human Microbiome Consortium, via the Human Microbiome Project and MetaHIT. This is commensurate with the number of dominant species in our cohort and expected more broadly in human gut, and should enable a much more extensive gene to species assign- ment. Nevertheless, we used the presently available sequenced genomes to explore further the concept of largely shared species among our cohort and identified 75 species common to .50% of individuals and 57 species common to .90%. These numbers are likely to increase with the number of sequenced reference strains and a deeper sampling. Indeed, a 2–3-fold increase in sequencing depth raised by 25% the number of species that we could detect as shared between two individuals. A large number of shared species supports the view that the prevalent human microbiome is of a finite and not overly large size.

Page 84: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

How can this view be reconciled with that of a considerable inter- personal diversity of innumerable bacterial species in the gut, arising from most previous studies using the 16S RNA marker gene4,8,10,11? Possibly the depth of sampling of these studies was insufficient to reveal common species when present at low abundance, and empha- sized the difference in the composition of a relatively few dominant species. We found a very high variability of abundance (12- to 2,200- fold) for the 57 most common species across the individuals of our cohort. Nevertheless, a recent 16S rRNA-based study concluded that a common bacterial species ‘core’, shared among at least 50% of individuals under study, exists35

!

Page 85: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Detailed comparisons of bacterial genes across the individuals of our cohort will be carried out in the future, within the context of the ongoing MetaHIT clinical studies of which they are part. Nevertheless, clustering of the genes in families allowed us to capture a virtually full functional potential of the prevalent gene set and revealed a considerable novelty, extending the functional categories by some 30% in regard to previous work8. Similarly, this analysis has revealed a functional core, conserved in each individual of the cohort, which reflects the full minimal human gut metagenome, encoded across many species and probably required for the proper functioning of the gut ecosystem. The size of this minimal metagenome exceeds several-fold that of the core metagenome reported previously8. It includes functions known to be important to the host–bacterial inter- action, such as degradation of complex polysaccharides, synthesis of short-chain fatty acids, indispensable amino acids and vitamins. Finally, we also identified functions that we attribute to a minimal gut bacterial genome, likely to be required by any bacterium to thrive in this ecosystem. Besides general housekeeping functions, the minimal genome encompasses many genes of unknown function, rare in sequenced genomes and possibly specifically required in the gut.

Page 86: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Beyond providing the global view of the human gut microbiome, the extensive gene catalogue we have established enables future studies of association of the microbial genes with human phenotypes and, even more broadly, human living habits, taking into account the environment, including diet, from birth to old age. We anticipate that these studies will lead to a much more complete understanding of human biology than the one we presently have.

Page 87: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

LETTERS

A core gut microbiome in obese and lean twinsPeter J. Turnbaugh1, Micah Hamady3, Tanya Yatsunenko1, Brandi L. Cantarel5, Alexis Duncan2, Ruth E. Ley1,Mitchell L. Sogin6, William J. Jones7, Bruce A. Roe8, Jason P. Affourtit9, Michael Egholm9, Bernard Henrissat5,Andrew C. Heath2, Rob Knight4 & Jeffrey I. Gordon1

The human distal gut harbours a vast ensemble of microbes (themicrobiota) that provide important metabolic capabilities, includ-ing the ability to extract energy from otherwise indigestible dietarypolysaccharides1–6. Studies of a few unrelated, healthy adults haverevealed substantial diversity in their gut communities, as mea-sured by sequencing 16S rRNA genes6–8, yet how this diversityrelates to function and to the rest of the genes in the collectivegenomes of the microbiota (the gut microbiome) remains obscure.Studies of lean and obese mice suggest that the gut microbiotaaffects energy balance by influencing the efficiency of calorie har-vest from the diet, and how this harvested energy is used andstored3–5. Here we characterize the faecal microbial communitiesof adult female monozygotic and dizygotic twin pairs concordantfor leanness or obesity, and their mothers, to address how hostgenotype, environmental exposure and host adiposity influencethe gut microbiome. Analysis of 154 individuals yielded 9,920 nearfull-length and 1,937,461 partial bacterial 16S rRNA sequences,plus 2.14 gigabases from their microbiomes. The results reveal thatthe human gut microbiome is shared among family members, butthat each person’s gut microbial community varies in the specificbacterial lineages present, with a comparable degree of co-variationbetween adult monozygotic and dizygotic twin pairs. However,there was a wide array of shared microbial genes among sampledindividuals, comprising an extensive, identifiable ‘core micro-biome’ at the gene, rather than at the organismal lineage, level.Obesity is associated with phylum-level changes in the microbiota,reduced bacterial diversity and altered representation of bacterialgenes and metabolic pathways. These results demonstrate that adiversity of organismal assemblages can nonetheless yield a coremicrobiome at a functional level, and that deviations from this coreare associated with different physiological states (obese comparedwith lean).

We characterized gut microbial communities in 31 monozygotictwin pairs, 23 dizygotic twin pairs and, where available, their mothers(n 5 46) (Supplementary Tables 1–5). Monozygotic and dizygoticco-twins and parent–offspring pairs provided an attractive modelfor assessing the impact of genotype and shared early environmentalexposures on the gut microbiome. Moreover, genetically ‘identical’9

monozygotic twin pairs gain weight in response to overfeeding in amore reproducible way than unrelated individuals10 and are moreconcordant for body mass index (BMI) than dizygotic twin pairs11.

Twin pairs who had been enrolled in the Missouri AdolescentFemale Twin Study (MOAFTS12) were recruited for this study (meanperiod of enrolment in MOAFTS, 11.7 6 1.2 years; range, 4.4–13.0years). Twins were 21–32 years old, of European or African ancestry,and were generally concordant for obesity (BMI > 30 kg m22) or

leanness (BMI 5 18.5–24.9 kg m22) (one twin pair was lean/over-weight (overweight defined as BMI $ 25 and , 30) and six pairs wereoverweight/obese). They had not taken antibiotics for at least5.49 6 0.09 months. Each participant completed a detailed medical,lifestyle and dietary questionnaire: study enrolees were broadlyrepresentative of the overall Missouri population for BMI, parity,education and marital status (see Supplementary Results).Although all were born in Missouri, they currently live throughoutthe USA: 29% live in the same house, but some live more than 800 kmapart. Because faecal samples are readily attainable and representativeof interpersonal differences in gut microbial ecology7, they were col-lected from each individual and frozen immediately. The collectionprocedure was repeated again with an average interval betweensampling of 57 6 4 days.

To characterize the bacterial lineages present in the faecal micro-biotas of these 154 individuals, we performed 16S rRNA sequencing,targeting the full-length gene with an ABI 3730xl capillary sequencer.Additionally, we performed multiplex pyrosequencing with a 454FLX instrument to survey the gene’s V2 variable region13 and itsV6 hypervariable region14 (Supplementary Tables 1–3).

Complementary phylogenetic and taxon-based methods wereused to compare 16S rRNA sequences among faecal communities(see Methods). No matter which region of the gene was examined,individuals from the same family (a twin and her co-twin, or twinsand their mother) had a more similar bacterial community structurethan unrelated individuals (Fig. 1a and Supplementary Fig. 1a, b),and shared significantly more species-level phylotypes (16S rRNAsequences with $97% identity comprise each phylotype)(G 5 55.2, P , 10212 (V2); G 5 12.3, P , 0.001 (V6); G 5 11.3,P , 0.001 (full-length)). No significant correlation was seen betweenthe degree of physical separation of family members’ current homesand the degree of similarity between their microbial communities(defined by UniFrac15). The observed familial similarity was not dueto an indirect effect of the physiological states of obesity versus lean-ness; similar results were observed after stratifying twin pairs andtheir mothers by BMI category (concordant lean or concordant obeseindividuals; Supplementary Fig. 2). Surprisingly, there was no sig-nificant difference in the degree of similarity in the gut microbiotas ofadult monozygotic compared with dizygotic twin pairs (Fig. 1a).However, we could not assess whether monozygotic and dizygotictwin pairs had different degrees of similarities at earlier stages of theirlives.

Multiplex pyrosequencing of V2 and V6 amplicons allowed higherlevels of coverage compared with what was feasible using Sangersequencing, reaching on average 3,984 6 232 (V2) and24,786 6 1,403 (V6) sequences per sample. To control for differences

1Center for Genome Sciences. 2Department of Psychiatry, Washington University School of Medicine, St Louis, Missouri 63108, USA. 3Department of Computer Science. 4Departmentof Chemistry and Biochemistry, University of Colorado, Boulder, Colorado 80309, USA. 5CNRS, UMR6098, Marseille, France. 6Josephine Bay Paul Center, Marine BiologicalLaboratory, Woods Hole, Massachusetts 02543, USA. 7Environmental Genomics Core Facility, University of South Carolina, Columbia, South Carolina 29208, USA. 8Department ofChemistry and Biochemistry and the Advanced Center for Genome Technology, University of Oklahoma, Norman, Oklahoma 73019, USA. 9454 Life Sciences, Branford, Connecticut06405, USA.

Vol 457 | 22 January 2009 | doi:10.1038/nature07540

480 Macmillan Publishers Limited. All rights reserved©2009

Page 88: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

in coverage, all analyses were performed on an equal number ofrandomly selected sequences (200 full-length, 1,000 V2 and 10,000V6). At this level of coverage, there was little overlap between thesampled faecal communities. Moreover, the number of 16S rRNAgene sequences belonging to each phylotype varied greatly betweenfaecal microbiotas (Supplementary Tables 6–8).

Because this apparent lack of overlap could reflect the level ofcoverage (Supplementary Tables 1–3), we subsequently searched allhosts for bacterial phylotypes present at high abundance using asampling model based on a combination of standard Poisson andbinomial sampling statistics. The analysis allowed us to conclude thatno phylotype was present at more than about 0.5% abundance in allof the samples in this study (see Supplementary Results). Finally, wesub-sampled our data set by randomly selecting 50–3,000 sequencesper sample; again, no phylotypes were detectable in all individualssampled within this range of coverage (Supplementary Fig. 3).

Samples taken from the same individual at the initial collectionpoint and 57 6 4 days later were consistent with respect to the specificphylotypes found (Supplementary Figs 4 and 5), but showed varia-tions in relative abundance of the major gut bacterial phyla(Supplementary Fig. 6). There was no significant association betweenUniFrac distance and the time between sample collections. Overall,faecal samples from the same individual were much more similar toone another than samples from family members or unrelated indi-viduals (Fig. 1a), demonstrating that short-term temporal changes incommunity structure within an individual are minor compared withinter-personal differences.

Analysis of 16S rRNA data sets produced by the three PCR-basedmethods, plus shotgun sequencing of community DNA (see below),revealed a lower proportion of Bacteroidetes and a higher proportionof Actinobacteria in obese compared with lean individuals of bothancestries (Supplementary Table 9). Combining the individual Pvalues across these independent analyses using Fisher’s method dis-closed significantly fewer Bacteroidetes (P 5 0.003), moreActinobacteria (P 5 0.002) but no significant difference inFirmicutes (P 5 0.09). These findings agree with previous workshowing comparable differences in both taxa in mice2 and a progress-ive increase in the representation of Bacteroidetes when 12 unrelated,obese humans lost weight after being placed on one of two reduced-calorie diets6.

Across all methods, obesity was associated with a significantdecrease in the level of diversity (Fig. 1b and Supplementary Fig.1c–f). This reduced diversity suggests an analogy: the obese gutmicrobiota is not like a rainforest or reef, which are adapted to highenergy flux and are highly diverse; rather, it may be more like afertilizer runoff where a reduced-diversity microbial communityblooms with abnormal energy input16.

We subsequently characterized the microbial lineage and genecontent of the faecal microbiomes of 18 individuals representingsix of the families (three lean and three obese European ancestrymonozygotic twin pairs and their mothers) through shotgun pyro-sequencing (Supplementary Tables 4 and 5) and BLASTX compar-isons against several databases (KEGG17 (version 44) and STRING18)plus a custom database of 44 reference human gut microbial genomes(Supplementary Figs 7–10 and Supplementary Results). Our analysisparameters were validated using control data sets comprising ran-domly fragmented microbial genes with annotations in the KEGGdatabase17 (Supplementary Fig. 11 and Supplementary Methods).We also tested how technical advances that produce longer readsmight improve these assignments by sequencing faecal communitysamples from one twin pair using Titanium pyrosequencing methods(average read length of 341 6 134 nucleotides (s.d.) versus 208 6 68nucleotides for the standard FLX method). Supplementary Fig. 12shows that the frequency and quality of sequence assignments isimproved as read length increases from 200 to 350 nucleotides.

The 18 microbiomes were searched to identify sequences matchingdomains from experimentally validated carbohydrate-activeenzymes (CAZymes). Sequences matching 156 total CAZy familieswere found within at least one human gut microbiome, including 77glycoside hydrolase, 21 carbohydrate-binding module, 35 glycosyl-transferase, 12 polysaccharide lyase and 11 carbohydrate-esterasefamilies (Supplementary Table 10). On average, 2.62 6 0.13% ofthe sequences in the gut microbiome could be assigned toCAZymes (a total of 217,615 sequences), a percentage that is greaterthan the most abundant KEGG pathway (‘Transporters’;1.20 6 0.06% of the filtered sequences generated from each sample)and indicative of the abundant and diverse set of microbial genesdirected towards accessing a wide range of polysaccharides.

Category-based clustering of the functions from each microbiomewas performed using principal components analysis (PCA) and hier-archical clustering19. Two distinct clusters of gut microbiomes wereidentified based on metabolic profile, corresponding to samples withan increased abundance of Firmicutes and Actinobacteria, and sam-ples with a high abundance of Bacteroidetes (Fig. 2a). A linear regres-sion of the first principal component (PC1, explaining 20% of thefunctional variance) and the relative abundance of the Bacteroidetesshowed a highly significant correlation (R2 5 0.96, P , 10212;Fig. 2b). Functional profiles stabilized within each individual’smicrobiome after 20,000 sequences had been accumulated(Supplementary Fig. 13). Family members had more similar profilesthan unrelated individuals (Fig. 2c), suggesting that shared bacterialcommunity structure (‘who’s there’ based on 16S rRNA analyses)also translates into shared community-wide relative abundance ofmetabolic pathways. Accordingly, a direct comparison of functional

b

a

*

0.66

0.68

0.70

0.72

0.74

0.76

0.78

0.80

0.82

Self

Twin–twin

Mono-zygotic

Twin–mother

Unrelated

Uni

Frac

dis

tanc

e

Dizygotic Mono-zygotic

Dizygotic

2

22

42

62

82

102

122

0 2,000 4,000 6,000 8,000 1,0000Number of sequences

Phy

loge

netic

div

ersi

ty

LeanObese

****

Mor

e si

mila

rM

ore

diffe

rent

*

***** **

ns

Figure 1 | 16S rRNA gene surveys reveal familial similarity and reduceddiversity of the gut microbiota in obese individuals. a, Average unweightedUniFrac distance (a measure of differences in bacterial communitystructure) between individuals over time (self), twin pairs, twins and theirmother, and unrelated individuals (1,000 sequences per V2 data set;Student’s t-test with Monte Carlo; *P , 1025; **P , 10214; ***P , 10241;mean 6 s.e.m.). b, Phylogenetic diversity curves for the microbiota of leanand obese individuals (based on 1–10,000 sequences per V6 data set;mean 6 95% confidence intervals shown).

NATURE | Vol 457 | 22 January 2009 LETTERS

481 Macmillan Publishers Limited. All rights reserved©2009

Page 89: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014and taxonomic similarity (see Supplementary Methods) disclosed asignificant association: individuals with similar taxonomic profilesalso share similar metabolic profiles (P , 0.001; Mantel test).

Functional clustering of phylum-wide sequence bins representingmicrobiome reads assigned to 23 human gut Firmicutes and 14Bacteroidetes reference genomes showed discrete clustering byphylum (Supplementary Figs 14a and 15). Bootstrap analyses of therelative abundance of metabolic pathways in the microbiome-derivedFirmicutes and Bacteroidetes sequence bins disclosed 26 pathwayswith a significantly different relative abundance (SupplementaryFig. 14a). The Bacteroidetes bins were enriched for several carbohyd-rate metabolism pathways, whereas the Firmicutes bins were enrichedfor transport systems. This finding is consistent with our CAZymeanalysis, which revealed a significantly higher relative abundance ofglycoside hydrolases, carbohydrate-binding modules, glycosyltrans-ferases, polysaccharide lyases and carbohydrate esterases in theBacteroidetes sequence bins (Supplementary Fig. 14b).

One of the major goals of the International Human MicrobiomeProject(s) is to determine whether there is an identifiable ‘coremicrobiome’ of shared organisms, genes or functional capabilitiesfound in a given body habitat of all or the vast majority of humans1.Although all of the 18 gut microbiomes surveyed showed a high level

ofb-diversity with respect to the relative abundance of bacterial phyla(Fig. 3a), analysis of the relative abundance of broad functional cat-egories of genes and metabolic pathways (KEGG) revealed a generallyconsistent pattern regardless of the sample surveyed (Fig. 3b andSupplementary Table 11): the pattern is also consistent with resultswe obtained from a meta-analysis of previously published gut micro-biome data sets from nine adults20,21 (Supplementary Fig. 16). Thisconsistency is not simply due to the broad level of these annotations,as a similar analysis of Bacteroidetes and Firmicutes reference gen-omes revealed substantial variation in the relative abundance of eachcategory (see Supplementary Fig. 17). Furthermore, pairwise com-parisons of metabolic profiles obtained from the 18 microbiomes inthis study revealed an average value of R2 of 0.97 6 0.002 (Fig. 2a),indicating a high level of functional similarity.

Overall functional diversity was compared using the Shannonindex22, a measurement that combines diversity (the number of dif-ferent metabolic pathways) and evenness (the relative abundance ofeach pathway). The human gut microbiomes surveyed had a stableand high Shannon index value (4.63 6 0.01), close to the maximumpossible level of functional diversity (5.54; see SupplementaryMethods). Despite the presence of a small number of abundant meta-bolic pathways (listed in Supplementary Table 11), the overall func-tional profile of each gut microbiome is quite even (Shannon evennessof 0.84 6 0.001 on a scale of 0–1), demonstrating that most metabolicpathways are found at a similar level of abundance. Interestingly, thelevel of functional diversity in each microbiome was significantlylinked to the relative abundance of the Bacteroidetes (R2 5 0.81,P , 1026); microbiomes enriched for Firmicutes/Actinobacteria hada lower level of functional diversity. This observation is consistent withan analysis of simulated metagenomic reads generated from each of 36Bacteroidetes and Firmicutes genomes (Supplementary Fig. 18): onaverage, the Bacteroidetes genomes have a significantly higher level ofboth functional diversity and evenness (Mann–Whitney U-test,P , 0.01).

At a finer level, 26–53% of ‘enzyme’-level functional groups(KEGG/CAZy/STRING) were shared across all 18 microbiomes,whereas 8–22% of the groups were unique to a single microbiome(Supplementary Fig. 19a–c). The ‘core’ functional groups present inall microbiomes were also highly abundant, representing 93–98% ofthe total sequences. Given the higher relative abundance of these ‘core’groups, more than 95% were found after 26.11 6 2.02 megabases ofsequence were collected from a given microbiome, whereas the ‘vari-able’ groups continued to increase substantially with each additionalmegabase of sequence. Of course, any estimate of the total size of thecore microbiome will depend on sequencing effort, especially for

0.990.980.97R2 value:

PC1 (20%)

Bac

tero

idet

es (%

)

a

b c

High Firmicutes/ActinobacteriaHigh Bacteroidetes

Twin versustwin

Twin versusmother

Unrelated pairs

Func

tiona

l sim

ilarit

y (R

2 )

*

0.94

0.95

0.96

0.97

0.98

0.99

1

R2 = 0.96

0

20

40

60

80

100

–0.6 –0.4 –0.2 0 0.2 0.4 0.6

1.00 0.98 0.98 0.99 0.99 0.99 0.98 0.98 0.99 0.980.98 1.00 0.99 0.99 0.97 0.97 0.98 0.98 0.98 0.97 0.98 0.990.98 0.99 1.00 1.00 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.98 0.98 0.990.99 0.99 1.00 1.00 0.99 0.99 0.99 0.97 0.99 1.00 1.00 0.99 0.98 0.98 0.990.99 0.98 0.99 1.00 0.99 0.98 0.97 0.99 0.99 0.99 0.990.99 0.97 0.98 0.99 0.99 1.00 0.97 0.98 0.99 0.99 0.99

0.97 0.98 0.99 0.98 0.97 1.00 0.97 0.98 0.99 0.99 0.98 0.98 0.97 0.980.97 1.00 0.99 0.98

0.98 0.97 0.99 0.98 0.97 0.99 1.00 0.99 0.99 0.980.98 0.98 0.99 0.99 0.99 0.98 0.98 0.99 1.00 0.99 0.990.99 0.98 0.99 1.00 0.99 0.99 0.99 0.99 0.99 1.00 0.99 0.98 0.97 0.97 0.980.98 0.98 0.99 1.00 0.99 0.99 0.99 0.98 0.99 0.99 1.00 0.98 0.98 0.97 0.98

0.98 0.99 0.99 0.98 0.98 0.98 1.00 0.99 0.97 0.98 0.99 0.990.97 0.98 0.98 0.98 0.97 0.98 0.99 1.00 0.98 0.99 0.99 0.99

0.97 0.98 1.00 1.00 0.99 0.980.98 0.99 1.00 1.00 0.99 0.98

0.98 0.98 0.98 0.97 0.97 0.97 0.99 0.99 0.99 0.99 1.00 0.990.99 0.99 0.99 0.98 0.98 0.98 0.99 0.99 0.98 0.98 0.99 1.00

F1T1LeF1T2Le

F1MOvF2T1Le

F2T2Le

F2MObF3T1LeF3T2Le

F3MOv

F4T1Ob

F4T2Ob

F4MOb

F5T1Ob

F5T2Ob

F5MOv

F6T1ObF6T2Ob

F6MOb

F1T1

Le

F1T2

Le

F1M

Ov

F2T1

Le

F2T2

Le

F2M

Ob

F3T1

Le

F3T2

Le

F3M

Ov

F4T1

Ob

F4T2

Ob

F4M

Ob

F5T1

Ob

F5T2

Ob

F5M

Ov

F6T1

Ob

F6T2

Ob

F6M

Ob

<0.97

Figure 2 | Metabolic-pathway-based clustering and analysis of the humangut microbiome of monozygotic twins. a, Clustering of functional profilesbased on the relative abundance of KEGG metabolic pathways. All pairwisecomparisons were made of the profiles by calculating each R2 value. Sampleidentifier nomenclature: family number, twin number or mother, and BMIcategory (Le, lean; Ov, overweight; Ob, obese; for example, F1T1Le standsfor family 1, twin 1, lean). b, The relative abundance of Bacteroidetes as afunction of the first principal component derived from an analysis of KEGGmetabolic profiles. c, Comparisons of functional similarity between twinpairs, between twins and their mother, and between unrelated individuals.Asterisk indicates significant differences (Student’s t-test with Monte Carlo;P , 0.01; mean 6 s.e.m.).

[Q][P][I]

[H][F][E]

[G][C][S]

[R][O][U]

[W][Z][N]

[M][T][V]

[Y][D][B]

[L][K][A]

[J]

0

20

40

60

80

100

Rel

ativ

e ab

unda

nce

(%)

OtherProteobacteria

F1T1

LeF1

T2Le

F1M

Ov

F2T1

LeF2

T2Le

F2M

Ob

F3T1

LeF3

T2Le

F3M

Ov

F4T1

Ob

F4T2

Ob

F4M

Ob

F5T1

Ob

F5T2

Ob

F5M

Ov

F6T1

Ob

F6T2

Ob

F6M

Ob

F1T1

LeF1

T2Le

F1M

Ov

F2T1

LeF2

T2Le

F2M

Ob

F3T1

LeF3

T2Le

F3M

Ov

F4T1

Ob

F4T2

Ob

F4M

Ob

F5T1

Ob

F5T2

Ob

F5M

Ov

F6T1

Ob

F6T2

Ob

F6M

Ob

a COG categoriesBacterial phylum b

ActinobacteriaBacteroidetesFirmicutes

Figure 3 | Comparison of taxonomic and functional variations in the humangut microbiome. a, Relative abundance of major phyla across 18 faecalmicrobiomes from monozygotic twins and their mothers, based on BLASTXcomparisons of microbiomes and the National Center for BiotechnologyInformation non-redundant database. b, Relative abundance of categories ofgenes across each sampled gut microbiome (letters correspond to categoriesin the COG database).

LETTERS NATURE | Vol 457 | 22 January 2009

482 Macmillan Publishers Limited. All rights reserved©2009

Page 90: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

and taxonomic similarity (see Supplementary Methods) disclosed asignificant association: individuals with similar taxonomic profilesalso share similar metabolic profiles (P , 0.001; Mantel test).

Functional clustering of phylum-wide sequence bins representingmicrobiome reads assigned to 23 human gut Firmicutes and 14Bacteroidetes reference genomes showed discrete clustering byphylum (Supplementary Figs 14a and 15). Bootstrap analyses of therelative abundance of metabolic pathways in the microbiome-derivedFirmicutes and Bacteroidetes sequence bins disclosed 26 pathwayswith a significantly different relative abundance (SupplementaryFig. 14a). The Bacteroidetes bins were enriched for several carbohyd-rate metabolism pathways, whereas the Firmicutes bins were enrichedfor transport systems. This finding is consistent with our CAZymeanalysis, which revealed a significantly higher relative abundance ofglycoside hydrolases, carbohydrate-binding modules, glycosyltrans-ferases, polysaccharide lyases and carbohydrate esterases in theBacteroidetes sequence bins (Supplementary Fig. 14b).

One of the major goals of the International Human MicrobiomeProject(s) is to determine whether there is an identifiable ‘coremicrobiome’ of shared organisms, genes or functional capabilitiesfound in a given body habitat of all or the vast majority of humans1.Although all of the 18 gut microbiomes surveyed showed a high level

ofb-diversity with respect to the relative abundance of bacterial phyla(Fig. 3a), analysis of the relative abundance of broad functional cat-egories of genes and metabolic pathways (KEGG) revealed a generallyconsistent pattern regardless of the sample surveyed (Fig. 3b andSupplementary Table 11): the pattern is also consistent with resultswe obtained from a meta-analysis of previously published gut micro-biome data sets from nine adults20,21 (Supplementary Fig. 16). Thisconsistency is not simply due to the broad level of these annotations,as a similar analysis of Bacteroidetes and Firmicutes reference gen-omes revealed substantial variation in the relative abundance of eachcategory (see Supplementary Fig. 17). Furthermore, pairwise com-parisons of metabolic profiles obtained from the 18 microbiomes inthis study revealed an average value of R2 of 0.97 6 0.002 (Fig. 2a),indicating a high level of functional similarity.

Overall functional diversity was compared using the Shannonindex22, a measurement that combines diversity (the number of dif-ferent metabolic pathways) and evenness (the relative abundance ofeach pathway). The human gut microbiomes surveyed had a stableand high Shannon index value (4.63 6 0.01), close to the maximumpossible level of functional diversity (5.54; see SupplementaryMethods). Despite the presence of a small number of abundant meta-bolic pathways (listed in Supplementary Table 11), the overall func-tional profile of each gut microbiome is quite even (Shannon evennessof 0.84 6 0.001 on a scale of 0–1), demonstrating that most metabolicpathways are found at a similar level of abundance. Interestingly, thelevel of functional diversity in each microbiome was significantlylinked to the relative abundance of the Bacteroidetes (R2 5 0.81,P , 1026); microbiomes enriched for Firmicutes/Actinobacteria hada lower level of functional diversity. This observation is consistent withan analysis of simulated metagenomic reads generated from each of 36Bacteroidetes and Firmicutes genomes (Supplementary Fig. 18): onaverage, the Bacteroidetes genomes have a significantly higher level ofboth functional diversity and evenness (Mann–Whitney U-test,P , 0.01).

At a finer level, 26–53% of ‘enzyme’-level functional groups(KEGG/CAZy/STRING) were shared across all 18 microbiomes,whereas 8–22% of the groups were unique to a single microbiome(Supplementary Fig. 19a–c). The ‘core’ functional groups present inall microbiomes were also highly abundant, representing 93–98% ofthe total sequences. Given the higher relative abundance of these ‘core’groups, more than 95% were found after 26.11 6 2.02 megabases ofsequence were collected from a given microbiome, whereas the ‘vari-able’ groups continued to increase substantially with each additionalmegabase of sequence. Of course, any estimate of the total size of thecore microbiome will depend on sequencing effort, especially for

0.990.980.97R2 value:

PC1 (20%)

Bac

tero

idet

es (%

)

a

b c

High Firmicutes/ActinobacteriaHigh Bacteroidetes

Twin versustwin

Twin versusmother

Unrelated pairs

Func

tiona

l sim

ilarit

y (R

2 )

*

0.94

0.95

0.96

0.97

0.98

0.99

1

R2 = 0.96

0

20

40

60

80

100

–0.6 –0.4 –0.2 0 0.2 0.4 0.6

1.00 0.98 0.98 0.99 0.99 0.99 0.98 0.98 0.99 0.980.98 1.00 0.99 0.99 0.97 0.97 0.98 0.98 0.98 0.97 0.98 0.990.98 0.99 1.00 1.00 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.98 0.98 0.990.99 0.99 1.00 1.00 0.99 0.99 0.99 0.97 0.99 1.00 1.00 0.99 0.98 0.98 0.990.99 0.98 0.99 1.00 0.99 0.98 0.97 0.99 0.99 0.99 0.990.99 0.97 0.98 0.99 0.99 1.00 0.97 0.98 0.99 0.99 0.99

0.97 0.98 0.99 0.98 0.97 1.00 0.97 0.98 0.99 0.99 0.98 0.98 0.97 0.980.97 1.00 0.99 0.98

0.98 0.97 0.99 0.98 0.97 0.99 1.00 0.99 0.99 0.980.98 0.98 0.99 0.99 0.99 0.98 0.98 0.99 1.00 0.99 0.990.99 0.98 0.99 1.00 0.99 0.99 0.99 0.99 0.99 1.00 0.99 0.98 0.97 0.97 0.980.98 0.98 0.99 1.00 0.99 0.99 0.99 0.98 0.99 0.99 1.00 0.98 0.98 0.97 0.98

0.98 0.99 0.99 0.98 0.98 0.98 1.00 0.99 0.97 0.98 0.99 0.990.97 0.98 0.98 0.98 0.97 0.98 0.99 1.00 0.98 0.99 0.99 0.99

0.97 0.98 1.00 1.00 0.99 0.980.98 0.99 1.00 1.00 0.99 0.98

0.98 0.98 0.98 0.97 0.97 0.97 0.99 0.99 0.99 0.99 1.00 0.990.99 0.99 0.99 0.98 0.98 0.98 0.99 0.99 0.98 0.98 0.99 1.00

F1T1LeF1T2Le

F1MOvF2T1Le

F2T2Le

F2MObF3T1LeF3T2Le

F3MOv

F4T1Ob

F4T2Ob

F4MOb

F5T1Ob

F5T2Ob

F5MOv

F6T1ObF6T2Ob

F6MOb

F1T1

Le

F1T2

Le

F1M

Ov

F2T1

Le

F2T2

Le

F2M

Ob

F3T1

Le

F3T2

Le

F3M

Ov

F4T1

Ob

F4T2

Ob

F4M

Ob

F5T1

Ob

F5T2

Ob

F5M

Ov

F6T1

Ob

F6T2

Ob

F6M

Ob

<0.97

Figure 2 | Metabolic-pathway-based clustering and analysis of the humangut microbiome of monozygotic twins. a, Clustering of functional profilesbased on the relative abundance of KEGG metabolic pathways. All pairwisecomparisons were made of the profiles by calculating each R2 value. Sampleidentifier nomenclature: family number, twin number or mother, and BMIcategory (Le, lean; Ov, overweight; Ob, obese; for example, F1T1Le standsfor family 1, twin 1, lean). b, The relative abundance of Bacteroidetes as afunction of the first principal component derived from an analysis of KEGGmetabolic profiles. c, Comparisons of functional similarity between twinpairs, between twins and their mother, and between unrelated individuals.Asterisk indicates significant differences (Student’s t-test with Monte Carlo;P , 0.01; mean 6 s.e.m.).

[Q][P][I]

[H][F][E]

[G][C][S]

[R][O][U]

[W][Z][N]

[M][T][V]

[Y][D][B]

[L][K][A]

[J]

0

20

40

60

80

100

Rel

ativ

e ab

unda

nce

(%)

OtherProteobacteria

F1T1

LeF1

T2Le

F1M

Ov

F2T1

LeF2

T2Le

F2M

Ob

F3T1

LeF3

T2Le

F3M

Ov

F4T1

Ob

F4T2

Ob

F4M

Ob

F5T1

Ob

F5T2

Ob

F5M

Ov

F6T1

Ob

F6T2

Ob

F6M

Ob

F1T1

LeF1

T2Le

F1M

Ov

F2T1

LeF2

T2Le

F2M

Ob

F3T1

LeF3

T2Le

F3M

Ov

F4T1

Ob

F4T2

Ob

F4M

Ob

F5T1

Ob

F5T2

Ob

F5M

Ov

F6T1

Ob

F6T2

Ob

F6M

Ob

a COG categoriesBacterial phylum b

ActinobacteriaBacteroidetesFirmicutes

Figure 3 | Comparison of taxonomic and functional variations in the humangut microbiome. a, Relative abundance of major phyla across 18 faecalmicrobiomes from monozygotic twins and their mothers, based on BLASTXcomparisons of microbiomes and the National Center for BiotechnologyInformation non-redundant database. b, Relative abundance of categories ofgenes across each sampled gut microbiome (letters correspond to categoriesin the COG database).

LETTERS NATURE | Vol 457 | 22 January 2009

482 Macmillan Publishers Limited. All rights reserved©2009

Page 91: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

functional groups found at a low abundance. On average, our surveyachieved more than 450,000 sequences per faecal sample, which,assuming an even distribution, would allow us to sample groupsfound at a relative abundance of 1024. To estimate the total size ofthe core microbiome based on the 18 individuals, we randomly sub-sampled each microbiome in 1,000 sequence intervals (Supple-mentary Fig. 19d). Based on this analysis, the core microbiome isapproaching a total of 2,142 total orthologous groups (one site bind-ing (hyperbola) curve fit, R2 5 0.9966), indicating that we identified93% of functional groups (defined by STRING) found within the coremicrobiome of the 18 individuals surveyed. Of these core groups, 71%(CAZy), 64% (KEGG) and 56% (STRING) were also found in thenine previously published, but much lower coverage, data sets gener-ated by capillary sequencing of adult faecal DNA20,21 (average of78,413 6 2,044 bidirectional reads per sample; see SupplementaryMethods).

Metabolic reconstructions of the ‘core’ microbiome revealed sig-nificant enrichment for several expected functional categories,including those involved in transcription and translation (Fig. 4).Metabolic profile-based clustering indicated that the representationof ‘core’ functional groups was highly consistent across samples(Supplementary Fig. 20), and included several pathways that are

likely important for life in the gut, such as those for carbohydrateand amino-acid metabolism (for example, fructose/mannose meta-bolism, amino-sugar metabolism and N-glycan degradation).Variably represented pathways and categories include cell motility(only a subset of Firmicutes produce flagella), secretion systems andmembrane transport (for example, phosphotransferase systemsinvolved in the import of nutrients, including sugars; Fig. 4 andSupplementary Fig. 20).

The distribution of CAZy glycoside hydrolase and glycosyltrans-ferase families was compared between each pair of microbiomes (seeSupplementary Table 10 for CAZy families with a relative abundancegreater than 1%). This analysis revealed that all individuals had asimilar profile of glycosyltransferases (R2 5 0.96 6 0.003), whereasthe profiles of glycoside hydrolases were significantly more variable,even between family members (R2 5 0.80 6 0.01; P , 10230, pairedStudent’s t-test). This suggests that the number and spectrum ofglycoside hydrolases is affected by ‘external’ factors such as diet morethan the glycosyltransferases.

To identify metabolic pathways associated with obesity, only non-core associated (variable) functional groups were included in a com-parison of the gut microbiomes of lean versus obese twin pairs. Abootstrap analysis23 was used to identify metabolic pathways thatwere enriched or depleted in the variable obese gut microbiome.For example, similar to a mouse model of diet-induced obesity4,the obese human gut microbiome was enriched for phosphotransfer-ase systems involved in microbial processing of carbohydrates(Supplementary Table 12). All gut microbiome sequences were com-pared with the custom database of 44 human gut genomes: an oddsratio analysis revealed 383 genes that were significantly differentbetween the obese and lean gut microbiome (q value , 0.05; 273enriched and 110 depleted in the obese microbiome;Supplementary Tables 13 and 14). By contrast, only 49 genes wereconsistently enriched or depleted between all twin pairs (seeSupplementary Methods).

These obesity-associated genes were representative of the taxo-nomic differences described above: 75% of the obesity-enrichedgenes were from Actinobacteria (compared with 0% of lean-enrichedgenes; the other 25% are from Firmicutes) whereas 42% of the lean-enriched genes were from Bacteroidetes (compared with 0% of theobesity-enriched genes). Their functional annotation indicated thatmany are involved in carbohydrate, lipid and amino-acid metabol-ism (Supplementary Tables 13 and 14). Together, they comprise aninitial set of microbial biomarkers of the obese gut microbiome.

Our finding that the gut microbial community structures of adultmonozygotic twin pairs had a degree of similarity that was compar-able to that of dizygotic twin pairs, and only slightly more similarthan that of their mothers, is consistent with an earlier fingerprintingstudy of adult twins24, and with a recent microarray-based analysis,which revealed that gut community assembly during the first year oflife followed a more similar pattern in a pair of dizygotic twins than12 unrelated infants25. Intriguingly, another fingerprinting study ofmonozygotic and dizygotic twins in childhood showed a slightlyreduced similarity profile in dizygotic twins26. Thus, comprehensivetime-course studies, comparing monozygotic and dizygotic twinpairs from birth through adulthood, as well as intergenerationalanalyses of their families’ microbiotas, will be key to determiningthe relative contributions of host genotype and environmental expo-sures to (gut) microbial ecology.

The hypothesis that there is a core human gut microbiome, defin-able by a set of abundant microbial organismal lineages that we allshare, may be incorrect: by adulthood, no single bacterial phylotypewas detectable at an abundant frequency in the guts of all 154sampled humans. Instead, it appears that a core gut microbiomeexists at the level of shared genes, including an important componentinvolved in various metabolic functions. This conservation suggests ahigh degree of redundancy in the gut microbiome and supports anecological view of each individual as an ‘island’ inhabited by unique

0 2 4 6 8 10 12 14

Transcription

Translation

Nucleotide metabolism

Amino-acid metabolism

Biosynthesis ofsecondary metabolites

Replication and repair

Metabolism ofother amino acids

Glycan biosynthesisand metabolism

Carbohydrate metabolism

Lipid metabolism

Biosynthesis of polyketides

Cell growth and death

Metabolism of cofactorsand vitamins

Energy metabolism

Xenobiotics biodegradationand metabolism

Genetic informationprocessing protein families

Metabolismprotein families

Metabolism unclassified

Membrane transport

Folding, sorting,and degradation

Cellular processes andsignalling protein families

Cellular processes andsignalling unclassified

Signal transduction

Poorly characterizedunclassified

Genetic informationprocessing unclassified

Cell motility

Signalling moleculesand interaction

Relative abundance (percentage of KEGG assignments)

KEGG category

Core

Variable

***

*

**

******

******

*********

***

******

***

***

******

******

***

Figure 4 | KEGG categories enriched or depleted in the core versus variablecomponents of the gut microbiome. Sequences from each of the 18 faecalmicrobiomes were binned into the ‘core’ or ‘variable’ microbiome based onthe co-occurrence of KEGG orthologous groups (core groups were found inall 18 microbiomes whereas variable groups were present in fewer (,18)microbiomes; see Supplementary Fig. 19a). Asterisks indicate significantdifferences (Student’s t-test, *P , 0.05, **P , 0.001, ***P , 1025;mean 6 s.e.m.).

NATURE | Vol 457 | 22 January 2009 LETTERS

483 Macmillan Publishers Limited. All rights reserved©2009

Page 92: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

ARTICLEdoi:10.1038/nature11234

Structure, function and diversity of thehealthy human microbiomeThe Human Microbiome Project Consortium*

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes thatoccupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet,environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize theecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohortand set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’ssignature microbes to vary widely even among healthy subjects, with strong niche specialization both within and amongindividuals. The project encountered an estimated 81–99% of the genera, enzyme families and communityconfigurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways wasstable among individuals despite variation in community structure, and ethnic/racial background proved to be one ofthe strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the rangeof structural and functional configurations normal in the microbial communities of a healthy population, enabling futurecharacterization of the epidemiology, ecology and translational applications of the human microbiome.

A total of 4,788 specimens from 242 screened and phenotyped adults1

(129 males, 113 females) were available for this study, representing themajority of the target Human Microbiome Project (HMP) cohort of300 individuals. Adult subjects lacking evidence of disease wererecruited based on a lengthy list of exclusion criteria; we will referto them here as ‘healthy’, as defined by the consortium clinicalsampling criteria (K. Aagaard et al., manuscript submitted).Women were sampled at 18 body habitats, men at 15 (excluding threevaginal sites), distributed among five major body areas. Nine specimenswere collected from the oral cavity and oropharynx: saliva; buccalmucosa (cheek), keratinized gingiva (gums), palate, tonsils, throatand tongue soft tissues, and supra- and subgingival dental plaque (toothbiofilm above and below the gum). Four skin specimens were collectedfrom the two retroauricular creases (behind each ear) and the twoantecubital fossae (inner elbows), and one specimen for the anteriornares (nostrils). A self-collected stool specimen represented the micro-biota of the lower gastrointestinal tract, and three vaginal specimenswere collected from the vaginal introitus, midpoint and posteriorfornix. To evaluate within-subject stability of the microbiome, 131individuals in these data were sampled at an additional time point(mean 219 days and s.d. 69 days after first sampling, range 35–404 days).After quality control, these specimens were used for 16S rRNA geneanalysis via 454 pyrosequencing (abbreviated henceforth as 16S profil-ing, mean 5,408 and s.d. 4,605 filtered sequences per sample); to assessfunction, 681 samples were sequenced using paired-end Illuminashotgun metagenomic reads (mean 2.9 gigabases (Gb) and s.d. 2.1 Gbper sample)1. More details on data generation are provided in relatedHMP publications1 and in Supplementary Methods.

Microbial diversity of healthy humansThe diversity of microbes within a given body habitat can be defined asthe number and abundance distribution of distinct types of organisms,which has been linked to several human diseases: low diversity in thegut to obesity and inflammatory bowel disease2,3, for example, and highdiversity in the vagina to bacterial vaginosis4. For this large study

involving microbiome samples collected from healthy volunteers attwo distinct geographic locations in the United States, we have definedthe microbial communities at each body habitat, encountering 81–99%of predicted genera and saturating the range of overall communityconfigurations (Fig. 1, Supplementary Fig. 1 and SupplementaryTable 1; see also Fig. 4). Oral and stool communities were especiallydiverse in terms of community membership, expanding prior observa-tions5, and vaginal sites harboured particularly simple communities(Fig. 1a). This study established that these patterns of alpha diversity(within samples) differed markedly from comparisons betweensamples from the same habitat among subjects (beta diversity,Fig. 1b). For example, the saliva had among the highest median alphadiversities of operational taxonomic units (OTUs, roughly species levelclassification, see http://hmpdacc.org/HMQCP), but one of the lowestbeta diversities—so although each individual’s saliva was ecologicallyrich, members of the population shared similar organisms. Conversely,the antecubital fossae (skin) had the highest beta diversity but wereintermediate in alpha diversity. The vagina had the lowest alpha diversity,with quite low beta diversity at the genus level but very high amongOTUs due to the presence of distinct Lactobacillus spp. (Fig. 1b). Theprimary patterns of variation in community structure followed themajor body habitat groups (oral, skin, gut and vaginal), defining as aresult the complete range of population-wide between-subject variationin human microbiome habitats (Fig. 1c). Within-subject variation overtime was consistently lower than between-subject variation, both inorganismal composition and in metabolic function (Fig. 1d). Theuniqueness of each individual’s microbial community thus seems tobe stable over time (relative to the population as a whole), which may beanother feature of the human microbiome specifically associated withhealth.

No taxa were observed to be universally present among all bodyhabitats and individuals at the sequencing depth employed here,unlike several pathways (Fig. 2 and Supplementary Fig. 2, see below),although several clades demonstrated broad prevalence and relativelyabundant carriage patterns6,7. Instead, as suggested by individually

*Lists of participants and their affiliations appear at the end of the paper.

1 4 J U N E 2 0 1 2 | V O L 4 8 6 | N A T U R E | 2 0 7

Macmillan Publishers Limited. All rights reserved©2012

The Human Microbiome Project Consortium

Curtis Huttenhower1,2*, Dirk Gevers2*, Rob Knight3,4, Sahar Abubucker5, Jonathan H.Badger6, Asif T. Chinwalla5, Heather H. Creasy7, Ashlee M. Earl2, Michael G. FitzGerald2,Robert S. Fulton5, Michelle G. Giglio7, Kymberlie Hallsworth-Pepin5, Elizabeth A.Lobos5, Ramana Madupu6, Vincent Magrini5, John C. Martin5, Makedonka Mitreva5,Donna M. Muzny8, Erica J. Sodergren5, James Versalovic9,10, Aye M. Wollam5, Kim C.Worley8, Jennifer R.Wortman7, Sarah K. Young2, Qiandong Zeng2, Kjersti M. Aagaard11,Olukemi O. Abolude7, Emma Allen-Vercoe12, Eric J. Alm13,2, Lucia Alvarado2, Gary L.Andersen14, Scott Anderson2, Elizabeth Appelbaum5, Harindra M. Arachchi2, GaryArmitage15, Cesar A. Arze7, Tulin Ayvaz16, Carl C. Baker17, Lisa Begg18, TsegahiwotBelachew19, Veena Bhonagiri5, Monika Bihan6, Martin J. Blaser20, Toby Bloom2, VivienBonazzi21, J. Paul Brooks22,23, Gregory A. Buck23,24, Christian J. Buhay8, Dana A.Busam6, Joseph L. Campbell21,19, Shane R. Canon25, Brandi L. Cantarel7, Patrick S. G.Chain26,27, I-Min A. Chen28, Lei Chen5, Shaila Chhibba21, Ken Chu28, Dawn M. Ciulla2,Jose C. Clemente3, Sandra W. Clifton5, Sean Conlan79, Jonathan Crabtree7, Mary A.Cutting29, Noam J. Davidovics7, Catherine C. Davis30, Todd Z. DeSantis31, CarolynDeal19, Kimberley D. Delehaunty5, Floyd E. Dewhirst32,33, Elena Deych34, Yan Ding8,David J. Dooling5, Shannon P. Dugan8, Wm Michael Dunne35,36, A. Scott Durkin6,Robert C. Edgar37, Rachel L. Erlich2, Candace N. Farmer5, Ruth M. Farrell38, KarolineFaust39,40, Michael Feldgarden2, Victor M. Felix7, Sheila Fisher2, Anthony A. Fodor41,Larry J. Forney42, Leslie Foster6, Valentina Di Francesco19, Jonathan Friedman43,Dennis C. Friedrich2, Catrina C. Fronick5, Lucinda L. Fulton5, Hongyu Gao5, NathaliaGarcia44, Georgia Giannoukos2, Christina Giblin19, Maria Y. Giovanni19, Jonathan M.Goldberg2, Johannes Goll6, Antonio Gonzalez45, Allison Griggs2, Sharvari Gujja2, SusanKinder Haake46, Brian J. Haas2, Holli A. Hamilton29, Emily L. Harris29, Theresa A.Hepburn2, Brandi Herter5, Diane E. Hoffmann47, Michael E. Holder8, Clinton Howarth2,Katherine H. Huang2, Susan M. Huse48, Jacques Izard32,33, Janet K. Jansson49,Huaiyang Jiang8, Catherine Jordan7, Vandita Joshi8, James A. Katancik50, Wendy A.Keitel16, Scott T. Kelley51, Cristyn Kells2, Nicholas B. King52, Dan Knights45, Heidi H.Kong53, Omry Koren54, Sergey Koren55, Karthik C. Kota5, Christie L. Kovar8, Nikos C.Kyrpides27, Patricio S. La Rosa34, Sandra L. Lee8, Katherine P. Lemon32,56, NiallLennon2, Cecil M. Lewis57, Lora Lewis8, Ruth E. Ley54, Kelvin Li6, Konstantinos Liolios27,Bo Liu55, Yue Liu8, Chien-Chi Lo26, Catherine A. Lozupone3, R. Dwayne Lunsford29,Tessa Madden58, Anup A. Mahurkar7, Peter J. Mannon59, Elaine R. Mardis5, Victor M.Markowitz27,28, Konstantinos Mavromatis27, Jamison M. McCorrison6, DanielMcDonald3, Jean McEwen21, Amy L. McGuire60, Pamela McInnes29, Teena Mehta2,Kathie A. Mihindukulasuriya5, Jason R. Miller6, Patrick J. Minx5, Irene Newsham8, ChadNusbaum2, Michelle O’Laughlin5, Joshua Orvis7, Ioanna Pagani27, KrishnaPalaniappan28, Shital M. Patel61, Matthew Pearson2, Jane Peterson21, Mircea Podar62,Craig Pohl5, Katherine S. Pollard63,64,65, Mihai Pop55,66, Margaret E. Priest2, Lita M.Proctor21, Xiang Qin8, Jeroen Raes39,40, Jacques Ravel7, Jeffrey G. Reid8, Mina Rho67,Rosamond Rhodes68, Kevin P. Riehle69, Maria C. Rivera23,24, BeltranRodriguez-Mueller51, Yu-Hui Rogers6, Matthew C. Ross16, Carsten Russ2, Ravi K.Sanka6, Pamela Sankar70, J. Fah Sathirapongsasuti1, Jeffery A. Schloss21, Patrick D.Schloss71, Thomas M. Schmidt72, Matthew Scholz26, Lynn Schriml7, Alyxandria M.Schubert71, Nicola Segata1, JuliaA. Segre79, WilliamD. Shannon34, Richard R. Sharp38,Thomas J. Sharpton63, Narmada Shenoy2, NiharU. Sheth23, GinaA. Simone 73, IndreshSingh6, Christopher S. Smillie43, Jack D. Sobel74, Daniel D. Sommer55, Paul Spicer57,GrangerG.Sutton6, SeanM.Sykes2, DianaG. Tabbaa2, Mathangi Thiagarajan6, ChadM.Tomlinson5, Manolito Torralba6, Todd J. Treangen75, Rebecca M. Truty63, Tatiana A.Vishnivetskaya62, Jason Walker5, Lu Wang21, Zhengyuan Wang5, Doyle V. Ward2,Wesley Warren5, Mark A. Watson35, Christopher Wellington21, Kris A. Wetterstrand21,James R. White7, Katarzyna Wilczek-Boney8, YuanQing Wu8, Kristine M. Wylie5, ToddWylie5, Chandri Yandava2, Liang Ye5, Yuzhen Ye67, Shibu Yooseph76, Bonnie P.Youmans16, Lan Zhang8, Yanjiao Zhou5, Yiming Zhu8, Laurie Zoloth77, Jeremy D.Zucker2, Bruce W. Birren2, Richard A. Gibbs8, Sarah K. Highlander8,16, Barbara A.Methe6, Karen E. Nelson6, Joseph F. Petrosino8,78,16, George M. Weinstock5, Richard K.Wilson5 & Owen White7

1Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA. 2TheBroad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.3Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado80309, USA. 4Howard Hughes Medical Institute, Boulder, Colorado 80309, USA. 5TheGenome Institute, Washington University School of Medicine, St. Louis, Missouri 63108,USA. 6J. Craig Venter Institute, Rockville, Maryland 20850, USA. 7Institute for GenomeSciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA.8HumanGenome Sequencing Center, Baylor College of Medicine,Houston, Texas 77030,USA. 9Department of Pathology & Immunology, Baylor College of Medicine, Houston,Texas77030,USA. 10Department ofPathology, TexasChildren’sHospital,Houston, Texas77030, USA. 11Department of Obstetrics & Gynecology, Division of Maternal-FetalMedicine, Baylor College of Medicine, Houston, Texas 77030, USA. 12Molecular andCellular Biology, University of Guelph, Guleph, Ontario N1G 2W1, Canada. 13Departmentof Civil & Environmental Engineering, Massachusetts Institute of Technology, Cambridge,Massachusetts 02139, USA. 14Center for Environmental Biotechnology, LawrenceBerkeley National Laboratory, Berkeley, California 94720, USA. 15School of Dentistry,University of California, San Francisco, San Francisco, California 94143, USA. 16MolecularVirology and Microbiology, Baylor College of Medicine, Houston, Texas 77030, USA.

17National Institute of Arthritis and Musculoskeletal and Skin, National Institutes ofHealth, Bethesda, Maryland 20892, USA. 18Office of Research on Women’s Health,National Institutes of Health, Bethesda, Maryland 20892, USA. 19National Institute ofAllergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892,USA. 20Department of Medicine, New York University Langone Medical Center, New York,New York 10016, USA. 21National Human Genome Research Institute, National Institutesof Health, Bethesda, Maryland 20892, USA. 22Department of Statistical Sciences andOperations Research, Virginia Commonwealth University, Richmond, Virginia 23284,USA. 23Center for the Study of Biological Complexity, Virginia Commonwealth University,Richmond, Virginia 23284, USA. 24Department of Biology, Virginia CommonwealthUniversity, Richmond, Virginia 23284, USA. 25Technology Integration Group, NationalEnergy Research Scientific Computing Center, Lawrence Berkeley National Laboratory,Berkeley, California 94720, USA. 26Genome Science Group, Bioscience Division, LosAlamos National Laboratory, Los Alamos, New Mexico 87545, USA. 27Joint GenomeInstitute, Walnut Creek, California 94598, USA. 28Biological Data Management andTechnology Center, Computational Research Division, Lawrence Berkeley NationalLaboratory, Berkeley, California 94720, USA. 29National Institute of Dental andCraniofacial Research (NIDCR), National Institutes of Health, Bethesda, Maryland 20892,USA. 30FemCare Product Safety and Regulatory Affairs, The Procter & Gamble Company,Cincinnati, Ohio 45224, USA. 31Bioinformatics Department, Second Genome, Inc., SanBruno, California 94066, USA. 32Department of Molecular Genetics, Forsyth Institute,Cambridge, Massachusetts 02142, USA. 33Department of Oral Medicine, Infection andImmunity, Harvard School of Dental Medicine, Boston, Massachusetts 02115, USA.34Department of Medicine, Division of General Medical Science, Washington UniversitySchool of Medicine, St. Louis, Missouri 63110, USA. 35Department of Pathology &Immunology, Washington University School of Medicine, St. Louis, Missouri 63110, USA.36bioMerieux, Inc., Durham, South Carolina 27712, USA. 37drive5.com, Tiburon,California 94920, USA. 38Center for Ethics, Humanities and Spiritual Care, ClevelandClinic, Cleveland, Ohio 44195, USA. 39Department of Structural Biology, VIB, Belgium,1050 Ixelles, Belgium. 40Department of Applied Biological Sciences (DBIT), VrijeUniversiteit Brussel, 1050 Ixelles, Belgium. 41Department of Bioinformatics andGenomics, University of North Carolina - Charlotte, Charlotte, North Carolina 28223, USA.42Department of Biological Sciences, University of Idaho, Moscow, Idaho 83844, USA.43Computational and Systems Biology, Massachusetts Institute of Technology,Cambridge, Massachusetts 02139, USA. 44Center for Advanced Dental Education, SaintLouis University, St. Louis, Missouri 63104, USA. 45Department of Computer Science,University of Colorado, Boulder, Colorado 80309, USA. 46Division of Associated ClinicalSpecialties and Dental Research Institute, UCLA School of Dentistry, Los Angeles,California 90095, USA. 47University of Maryland Francis King Carey School of Law,Baltimore, Maryland 21201, USA. 48Josephine Bay Paul Center, Marine BiologicalLaboratory, Woods Hole, Massachusetts 02543, USA. 49Ecology Department, EarthSciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720,USA. 50Department of Periodontics, University of Texas Health Science Center School ofDentistry, Houston, Texas 77030, USA. 51Department of Biology, San Diego StateUniversity, San Diego, California 92182, USA. 52Faculty of Medicine, McGill University,3647 Peel St, Montreal, Ouebec H3A 1X1, Canada. 53Dermatology Branch, CCR, NationalCancer Institute, Bethesda, Maryland 20892, USA. 54Department of Microbiology, CornellUniversity, Ithaca, New York 14853, USA. 55Center for Bioinformatics and ComputationalBiology, University of Maryland, College Park, Maryland 20742, USA. 56Division ofInfectious Diseases, Children’s Hospital Boston, Harvard Medical School, Boston,Massachusetts 02115, USA. 57Department of Anthropology, University of Oklahoma,Norman, Oklahoma 73019, USA. 58Department of Obstetrics and Gynecology,Washington University School of Medicine, Saint Louis, Missouri 63110, USA. 59Divisionof Gastroenterology and Hepatology, University of Alabama at Birmingham, Birmingham,Alabama 35294, USA. 60Center for Medical Ethics and Health Policy, Baylor College ofMedicine, Houston, Texas 77030, USA. 61Medicine-Infectious Disease, Baylor College ofMedicine, Houston, Texas 77030, USA. 62Biosciences Division, Oak Ridge NationalLaboratory, Oak Ridge, Tennessee 37831, USA. 63Gladstone Institutes, University ofCalifornia, San Francisco, San Francisco, California 94158, USA. 64Institute for HumanGenetics, University of California, San Francisco, San Francisco, California 94158, USA.65Division ofBiostatistics,University of California, SanFrancisco, San Francisco, California94158, USA. 66Department of Computer Science, University of Maryland, College Park,Maryland 20742, USA. 67School of Informatics and Computing, Indiana University,Bloomington, Indiana 47405, USA. 68Mount Sinai School of Medicine, New York, NewYork 10029, USA. 69Molecular & Human Genetics, Baylor College of Medicine, Houston,Texas 77030, USA. 70Center for Bioethics and Departmentof Medical Ethics, University ofPennsylvania, Philadelphia, Pennsylviana 19104, USA. 71Department of Microbiology &Immunology, University of Michigan, Ann Arbor, Michigan 48109, USA. 72Department ofMicrobiology and Molecular Genetics, Michigan State University, East Lansing, Michigan48824, USA. 73The EMMES Corporation, Rockville, Maryland 20850, USA. 74HarperUniversity Hospital, Wayne State University School of Medicine, Detroit, Michigan 48201,USA. 75McKusick-Nathans Institute of GeneticMedicine, Johns HopkinsUniversity Schoolof Medicine, Baltimore, Maryland 21205, USA. 76J. Craig Venter Institute, San Diego,California 92121, USA. 77Feinberg School of Medicine, Northwestern University, Chicago,Illinois 60611, USA. 78Alkek Center for Metagenomics and Microbiome Research, BaylorCollege of Medicine, Houston, Texas 77030, USA. 79Genetics and Molecular BiologyBranch, National Human Genome Research Institute, Bethesda, Maryland 20892, USA.*These authors contributed equally to this work.

RESEARCH ARTICLE

2 1 4 | N A T U R E | V O L 4 8 6 | 1 4 J U N E 2 0 1 2

Macmillan Publishers Limited. All rights reserved©2012

Page 93: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Abstract

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the epidemiology, ecology and translational applications of the human microbiome.

Page 94: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Abstract

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the epidemiology, ecology and translational applications of the human microbiome.

!We did a big big study - bigger than anyone!

Page 95: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Abstract

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the epidemiology, ecology and translational applications of the human microbiome.

!We did a big big study - bigger than anyone!

Lots of variation w/in and between people

Page 96: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Abstract

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the epidemiology, ecology and translational applications of the human microbiome.

!We did a big big study - bigger than anyone!

Lots of variation w/in and between people

We covered a lot of diversity

Page 97: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Abstract

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the epidemiology, ecology and translational applications of the human microbiome.

!We did a big big study - bigger than anyone!

Lots of variation w/in and between people

We covered a lot of diversity

Functions varied less than taxa

Page 98: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Abstract

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the epidemiology, ecology and translational applications of the human microbiome.

!We did a big big study - bigger than anyone!

Lots of variation w/in and between people

We covered a lot of diversity

Functions varied less than taxa

Good reference data set

Page 99: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

focused studies2,3,5,8,9, each body habitat in almost every subject wascharacterized by one or a few signature taxa making up the plurality ofthe community (Fig. 3). Signature clades at the genus level formed onaverage anywhere from 17% to 84% of their respective body habitats,completely absent in some communities (0% at this level of detection)and representing the entire population (100%) in others. Notably, lessdominant taxa were also highly personalized, both among individualsand body habitats; in the oral cavity, for example, most habitats aredominated by Streptococcus, but these are followed in abundance byHaemophilus in the buccal mucosa, Actinomyces in the supragingivalplaque, and Prevotella in the immediately adjacent (but low oxygen)subgingival plaque10.

Additional taxonomic detail of the human microbiome was pro-vided by identifying unique marker sequences in metagenomic data11

(Fig. 3a) to complement 16S profiling (Fig. 3b). These two profileswere typically in close agreement (Supplementary Fig. 3), with theformer in some cases offering more specific information on membersof signature genera differentially present within habitats (for example,vaginal Prevotella amnii and gut Prevotella copri) or among indivi-duals (for example, vaginal Lactobacillus spp.) One application of thisspecificity was to confirm the absence of NIAID (National Institute of

Allergy and Infectious Diseases) class A–C pathogens above 0.1%abundance (aside from Staphylococcus aureus and Escherichia coli)from the healthy microbiome, but the near-ubiquity and broad dis-tribution of opportunistic ‘pathogens’ as defined by PATRIC12.Canonical pathogens including Vibrio cholerae, Mycobacteriumavium, Campylobacter jejuni and Salmonella enterica were notdetected at this level of sensitivity. Helicobacter pylori was found inonly two stool samples, both at ,0.01%, and E. coli was present at.0.1% abundance in 15% of stool microbiomes (.0% abundance in61%). Similar species-level observations were obtained for a smallsubset of stool samples with 454 pyrosequencing metagenomics datausing PhylOTU13,14. In total 56 of 327 PATRIC pathogens weredetected in the healthy microbiome (at .1% prevalence of .0.1%abundance, Supplementary Table 2), all opportunistic and, strikingly,typically prevalent both among hosts and habitats. The latter is incontrast to many of the most abundant signature taxa, which wereusually more habitat-specific and variable among hosts (Fig. 3a, b).This overall absence of particularly detrimental microbes supports thehypothesis that even given this cohort’s high diversity, the microbiotatend to occupy a range of configurations in health distinct from manyof the disease perturbations studied to date3,15.

a Within-sample alpha diversity

Between-sample beta diversity

log 2

(rela

tive

alph

a di

vers

ity)

d

Vagi

nal i

ntro

itus

Pos

terio

r for

nix

Mid

-vag

ina

Sto

ol

Sup

ragi

ngiv

al p

laqu

e

Sub

ging

ival

pla

que

Tong

ue d

orsu

m

Thro

at

Sal

iva

Pal

atin

e to

nsils

Har

d pa

late

Ker

atin

ized

gin

giva

Buc

cal m

ucos

a

R re

troa

uric

ular

cre

ase

L re

troa

uric

ular

cre

ase

R a

ntec

ubita

l fos

sa

L an

tecu

bita

l fos

sa

Ant

erio

r nar

es

PC

2 (4

.4%

)

PC1 (13%)

Urogenital Skin Nasal

Technical replicates (16S) Between visits (16S)

Between subjects (16S)

Between visits (WGS)

Between subjects (WGS)

Gastrointestinal

Phylotypes (16S)

Reference genomes (WGS)

Metabolic modules (WGS)

Gene index (WGS)

OTUs (16S)

c

b

log 2 (

rela

tive

beta

div

ersi

ty)

Oral

Gastrointestinal

Urogenital

Skin

Nasal

4

2

0

–2

–4

0.6

0.4

0.2

0.0

–0.2

–0.4

–0.6

log 2 (

rela

tive

dive

rsity

)

5

4

3

2

1

0

–1

–2

–3

–4

Oral

Figure 1 | Diversity of the human microbiome is concordant amongmeasures, unique to each individual, and strongly determined by microbialhabitat. a, Alpha diversity within subjects by body habitat, grouped by area, asmeasured using the relative inverse Simpson index of genus-level phylotypes(cyan), 16S rRNA gene OTUs (blue), shotgun metagenomic reads matched toreference genomes (orange), functional modules (dark orange), and enzymefamilies (yellow). The mouth generally shows high within-subject diversity andthe vagina low diversity, with other habitats intermediate; variation amongindividuals often exceeds variation among body habitats. b, Bray–Curtis betadiversity among subjects by body habitat, colours as for a. Skin differs mostbetween subjects, with oral habitats and vaginal genera more stable. Although

alpha- and beta-diversity are not directly comparable, changes in structureamong communities (a) occupy a wider dynamic range than do changes withincommunities among individuals (b). c, Principal coordinates plot showingvariation among samples demonstrates that primary clustering is by body area,with the oral, gastrointestinal, skin and urogenital habitats separate; the nareshabitat bridges oral and skin habitats. d, Repeated samples from the samesubject (blue) are more similar than microbiomes from different subjects (red).Technical replicates (grey) are in turn more similar; these patterns areconsistent for all body habitats and for both phylogenetic and metaboliccommunity composition. See previously described sample counts1 for allcomparisons.

RESEARCH ARTICLE

2 0 8 | N A T U R E | V O L 4 8 6 | 1 4 J U N E 2 0 1 2

Macmillan Publishers Limited. All rights reserved©2012

Page 100: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

focused studies2,3,5,8,9, each body habitat in almost every subject wascharacterized by one or a few signature taxa making up the plurality ofthe community (Fig. 3). Signature clades at the genus level formed onaverage anywhere from 17% to 84% of their respective body habitats,completely absent in some communities (0% at this level of detection)and representing the entire population (100%) in others. Notably, lessdominant taxa were also highly personalized, both among individualsand body habitats; in the oral cavity, for example, most habitats aredominated by Streptococcus, but these are followed in abundance byHaemophilus in the buccal mucosa, Actinomyces in the supragingivalplaque, and Prevotella in the immediately adjacent (but low oxygen)subgingival plaque10.

Additional taxonomic detail of the human microbiome was pro-vided by identifying unique marker sequences in metagenomic data11

(Fig. 3a) to complement 16S profiling (Fig. 3b). These two profileswere typically in close agreement (Supplementary Fig. 3), with theformer in some cases offering more specific information on membersof signature genera differentially present within habitats (for example,vaginal Prevotella amnii and gut Prevotella copri) or among indivi-duals (for example, vaginal Lactobacillus spp.) One application of thisspecificity was to confirm the absence of NIAID (National Institute of

Allergy and Infectious Diseases) class A–C pathogens above 0.1%abundance (aside from Staphylococcus aureus and Escherichia coli)from the healthy microbiome, but the near-ubiquity and broad dis-tribution of opportunistic ‘pathogens’ as defined by PATRIC12.Canonical pathogens including Vibrio cholerae, Mycobacteriumavium, Campylobacter jejuni and Salmonella enterica were notdetected at this level of sensitivity. Helicobacter pylori was found inonly two stool samples, both at ,0.01%, and E. coli was present at.0.1% abundance in 15% of stool microbiomes (.0% abundance in61%). Similar species-level observations were obtained for a smallsubset of stool samples with 454 pyrosequencing metagenomics datausing PhylOTU13,14. In total 56 of 327 PATRIC pathogens weredetected in the healthy microbiome (at .1% prevalence of .0.1%abundance, Supplementary Table 2), all opportunistic and, strikingly,typically prevalent both among hosts and habitats. The latter is incontrast to many of the most abundant signature taxa, which wereusually more habitat-specific and variable among hosts (Fig. 3a, b).This overall absence of particularly detrimental microbes supports thehypothesis that even given this cohort’s high diversity, the microbiotatend to occupy a range of configurations in health distinct from manyof the disease perturbations studied to date3,15.

a Within-sample alpha diversity

Between-sample beta diversity

log 2

(rela

tive

alph

a di

vers

ity)

d

Vagi

nal i

ntro

itus

Pos

terio

r for

nix

Mid

-vag

ina

Sto

ol

Sup

ragi

ngiv

al p

laqu

e

Sub

ging

ival

pla

que

Tong

ue d

orsu

m

Thro

at

Sal

iva

Pal

atin

e to

nsils

Har

d pa

late

Ker

atin

ized

gin

giva

Buc

cal m

ucos

a

R re

troa

uric

ular

cre

ase

L re

troa

uric

ular

cre

ase

R a

ntec

ubita

l fos

sa

L an

tecu

bita

l fos

sa

Ant

erio

r nar

es

PC

2 (4

.4%

)

PC1 (13%)

Urogenital Skin Nasal

Technical replicates (16S) Between visits (16S)

Between subjects (16S)

Between visits (WGS)

Between subjects (WGS)

Gastrointestinal

Phylotypes (16S)

Reference genomes (WGS)

Metabolic modules (WGS)

Gene index (WGS)

OTUs (16S)

c

b

log 2 (

rela

tive

beta

div

ersi

ty)

Oral

Gastrointestinal

Urogenital

Skin

Nasal

4

2

0

–2

–4

0.6

0.4

0.2

0.0

–0.2

–0.4

–0.6

log 2 (

rela

tive

dive

rsity

)

5

4

3

2

1

0

–1

–2

–3

–4

Oral

Figure 1 | Diversity of the human microbiome is concordant amongmeasures, unique to each individual, and strongly determined by microbialhabitat. a, Alpha diversity within subjects by body habitat, grouped by area, asmeasured using the relative inverse Simpson index of genus-level phylotypes(cyan), 16S rRNA gene OTUs (blue), shotgun metagenomic reads matched toreference genomes (orange), functional modules (dark orange), and enzymefamilies (yellow). The mouth generally shows high within-subject diversity andthe vagina low diversity, with other habitats intermediate; variation amongindividuals often exceeds variation among body habitats. b, Bray–Curtis betadiversity among subjects by body habitat, colours as for a. Skin differs mostbetween subjects, with oral habitats and vaginal genera more stable. Although

alpha- and beta-diversity are not directly comparable, changes in structureamong communities (a) occupy a wider dynamic range than do changes withincommunities among individuals (b). c, Principal coordinates plot showingvariation among samples demonstrates that primary clustering is by body area,with the oral, gastrointestinal, skin and urogenital habitats separate; the nareshabitat bridges oral and skin habitats. d, Repeated samples from the samesubject (blue) are more similar than microbiomes from different subjects (red).Technical replicates (grey) are in turn more similar; these patterns areconsistent for all body habitats and for both phylogenetic and metaboliccommunity composition. See previously described sample counts1 for allcomparisons.

RESEARCH ARTICLE

2 0 8 | N A T U R E | V O L 4 8 6 | 1 4 J U N E 2 0 1 2

Macmillan Publishers Limited. All rights reserved©2012

Page 101: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

focused studies2,3,5,8,9, each body habitat in almost every subject wascharacterized by one or a few signature taxa making up the plurality ofthe community (Fig. 3). Signature clades at the genus level formed onaverage anywhere from 17% to 84% of their respective body habitats,completely absent in some communities (0% at this level of detection)and representing the entire population (100%) in others. Notably, lessdominant taxa were also highly personalized, both among individualsand body habitats; in the oral cavity, for example, most habitats aredominated by Streptococcus, but these are followed in abundance byHaemophilus in the buccal mucosa, Actinomyces in the supragingivalplaque, and Prevotella in the immediately adjacent (but low oxygen)subgingival plaque10.

Additional taxonomic detail of the human microbiome was pro-vided by identifying unique marker sequences in metagenomic data11

(Fig. 3a) to complement 16S profiling (Fig. 3b). These two profileswere typically in close agreement (Supplementary Fig. 3), with theformer in some cases offering more specific information on membersof signature genera differentially present within habitats (for example,vaginal Prevotella amnii and gut Prevotella copri) or among indivi-duals (for example, vaginal Lactobacillus spp.) One application of thisspecificity was to confirm the absence of NIAID (National Institute of

Allergy and Infectious Diseases) class A–C pathogens above 0.1%abundance (aside from Staphylococcus aureus and Escherichia coli)from the healthy microbiome, but the near-ubiquity and broad dis-tribution of opportunistic ‘pathogens’ as defined by PATRIC12.Canonical pathogens including Vibrio cholerae, Mycobacteriumavium, Campylobacter jejuni and Salmonella enterica were notdetected at this level of sensitivity. Helicobacter pylori was found inonly two stool samples, both at ,0.01%, and E. coli was present at.0.1% abundance in 15% of stool microbiomes (.0% abundance in61%). Similar species-level observations were obtained for a smallsubset of stool samples with 454 pyrosequencing metagenomics datausing PhylOTU13,14. In total 56 of 327 PATRIC pathogens weredetected in the healthy microbiome (at .1% prevalence of .0.1%abundance, Supplementary Table 2), all opportunistic and, strikingly,typically prevalent both among hosts and habitats. The latter is incontrast to many of the most abundant signature taxa, which wereusually more habitat-specific and variable among hosts (Fig. 3a, b).This overall absence of particularly detrimental microbes supports thehypothesis that even given this cohort’s high diversity, the microbiotatend to occupy a range of configurations in health distinct from manyof the disease perturbations studied to date3,15.

a Within-sample alpha diversity

Between-sample beta diversity

log 2

(rela

tive

alph

a di

vers

ity)

d

Vagi

nal i

ntro

itus

Pos

terio

r for

nix

Mid

-vag

ina

Sto

ol

Sup

ragi

ngiv

al p

laqu

e

Sub

ging

ival

pla

que

Tong

ue d

orsu

m

Thro

at

Sal

iva

Pal

atin

e to

nsils

Har

d pa

late

Ker

atin

ized

gin

giva

Buc

cal m

ucos

a

R re

troa

uric

ular

cre

ase

L re

troa

uric

ular

cre

ase

R a

ntec

ubita

l fos

sa

L an

tecu

bita

l fos

sa

Ant

erio

r nar

es

PC

2 (4

.4%

)PC1 (13%)

Urogenital Skin Nasal

Technical replicates (16S) Between visits (16S)

Between subjects (16S)

Between visits (WGS)

Between subjects (WGS)

Gastrointestinal

Phylotypes (16S)

Reference genomes (WGS)

Metabolic modules (WGS)

Gene index (WGS)

OTUs (16S)

c

b

log 2 (

rela

tive

beta

div

ersi

ty)

Oral

Gastrointestinal

Urogenital

Skin

Nasal

4

2

0

–2

–4

0.6

0.4

0.2

0.0

–0.2

–0.4

–0.6

log 2 (

rela

tive

dive

rsity

)5

4

3

2

1

0

–1

–2

–3

–4

Oral

Figure 1 | Diversity of the human microbiome is concordant amongmeasures, unique to each individual, and strongly determined by microbialhabitat. a, Alpha diversity within subjects by body habitat, grouped by area, asmeasured using the relative inverse Simpson index of genus-level phylotypes(cyan), 16S rRNA gene OTUs (blue), shotgun metagenomic reads matched toreference genomes (orange), functional modules (dark orange), and enzymefamilies (yellow). The mouth generally shows high within-subject diversity andthe vagina low diversity, with other habitats intermediate; variation amongindividuals often exceeds variation among body habitats. b, Bray–Curtis betadiversity among subjects by body habitat, colours as for a. Skin differs mostbetween subjects, with oral habitats and vaginal genera more stable. Although

alpha- and beta-diversity are not directly comparable, changes in structureamong communities (a) occupy a wider dynamic range than do changes withincommunities among individuals (b). c, Principal coordinates plot showingvariation among samples demonstrates that primary clustering is by body area,with the oral, gastrointestinal, skin and urogenital habitats separate; the nareshabitat bridges oral and skin habitats. d, Repeated samples from the samesubject (blue) are more similar than microbiomes from different subjects (red).Technical replicates (grey) are in turn more similar; these patterns areconsistent for all body habitats and for both phylogenetic and metaboliccommunity composition. See previously described sample counts1 for allcomparisons.

RESEARCH ARTICLE

2 0 8 | N A T U R E | V O L 4 8 6 | 1 4 J U N E 2 0 1 2

Macmillan Publishers Limited. All rights reserved©2012

Page 102: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

PhylaFirmicutesActinobacteriaBacteroidetesProteobacteriaFusobacteriaTenericutesSpirochaetesCyanobacteriaVerrucomicrobiaTM7

Metabolic pathwaysCentral carbohydrate metabolismCofactor and vitamin biosynthesisOligosaccharide and polyol transport systemPurine metabolismATP synthesisPhosphate and amino acid transport systemAminoacyl tRNAPyrimidine metabolismRibosomeAromatic amino acid metabolism

a

b

Anterior nares RC Buccal mucosa Supragingival plaque Tongue dorsum Stool Posterior fornix

Figure 2 | Carriage of microbial taxa varies while metabolic pathwaysremain stable within a healthy population. a, b, Vertical bars representmicrobiome samples by body habitat in the seven locations with both shotgunand 16S data; bars indicate relative abundances colored by microbial phylafrom binned OTUs (a) and metabolic modules (b). Legend indicates mostabundant phyla/pathways by average within one or more body habitats; RC,

retroauricular crease. A plurality of most communities’ memberships consistsof a single dominant phylum (and often genus; see Supplementary Fig. 2), butthis is universal neither to all body habitats nor to all individuals. Conversely,most metabolic pathways are evenly distributed and prevalent across bothindividuals and body habitats.

Anterior nares

Antecubital fossa

Retroauricular crease

Buccal mucosa

Keratinized gingiva

Hard palate

SalivaThroat

Tongue dorsum

Subgingival plaque

Supragingival plaque

StoolMid-vagina

Posterior fornix

Vaginal introitus

Veillonella

Prevotella

Haemophilus

Moraxella

StaphylococcusCorynebacterium

Bacteroides

Streptococcus

Propionibacterium

Lactobacillus

Anterior nares

Retroauricular crease

Buccal mucosa

Tongue dorsum

Supragingival plaque

Stool

Rothia mucilaginosa

Gardnerella vaginalisBacteroides vulgatus

Alistipes putredinisBifidobacterium dentium

Staphylococcus epidermidis

Staphylococcus aureusCorynebacterium matruchotii

Streptococcus mitisPropionibacterium acnes

Corynebacterium accolensCorynebacterium kroppenstedtii

Prevotella copriLactobacillus jensenii

Prevotella amniiLactobacillus gasseri

Lactobacillus inersStreptococcus mitis

Propionibacterium acnesLactobacillus crispatus

Abundant species (metagenomic data) Abundant genera (16S data)

Mean non-zero abundance (size) and population prevalence (intensity) of microbial clades

a b

c

Beta-diversity added by sampled microbial communities

OTUs (16S data)Enzyme classes (metagenomic data)

�������

Div

ersi

ty (B

ray–

Cur

tis)

Actinobacteria|ActinobacteriaBacteroidetes|Bacteroidia

Firmicutes|BacilliFirmicutes|Negativicutes

Proteobacteria|Gammaproteobacteria

Prevalence (%)

d

100%

Abundance

0%

Div

ersi

ty (w

eigh

ted

Uni

Frac

)

Samples

eAbundant PATRIC ‘pathogens’

(metagenomic data)

Samples

Posterior fornix

0.3

0.2

0.1

0.00 20 40 60 80 100

Anterior naresRight retroauricular creaseLeft retroauricular creaseBuccal mucosaPosterior fornixStoolSupragingival plaqueTongue dorsum

0.5

0.4

0.3

0.2

0.1

0.00 50 100 150 200 250 300

Subgingival plaqueSalivaSupragingival plaquePalatine tonsilsStoolTongue dorsumThroatHard palateBuccal mucosa

Anterior naresAttached keratinized gingivaRight antecubital fossaLeft antecubital fossaRight retroauricular creaseLeft retroauricular creaseVaginal introitusMid-vaginaPosterior fornix

Anterior nares

Retroauricular crease

Buccal mucosa

Tongue dorsum

Supragingival plaque

StoolPosterior fornix

0 100

Figure 3 | Abundant taxa in the human microbiome that have beenmetagenomically and taxonomically well defined in the HMP population.a–c, Prevalence (intensity, colour denoting phylum/class) and abundance whenpresent (size) of clades in the healthy microbiome. The most abundantmetagenomically-identified species (a), 16S-identified genera (b) andPATRIC12 pathogens (metagenomic) (c) are shown. d, e, The population size

and sequencing depths of the HMP have well defined the microbiome at allassayed body sites, as assessed by saturation of added community metabolicconfigurations (rarefaction of minimum Bray–Curtis beta-diversity ofmetagenomic enzyme class abundances to nearest neighbour, inter-quartilerange over 100 samples) (d) and phylogenetic configurations (minimum 16SOTU weighted UniFrac distance to nearest neighbour) (e).

ARTICLE RESEARCH

1 4 J U N E 2 0 1 2 | V O L 4 8 6 | N A T U R E | 2 0 9

Macmillan Publishers Limited. All rights reserved©2012

Page 103: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

PhylaFirmicutesActinobacteriaBacteroidetesProteobacteriaFusobacteriaTenericutesSpirochaetesCyanobacteriaVerrucomicrobiaTM7

Metabolic pathwaysCentral carbohydrate metabolismCofactor and vitamin biosynthesisOligosaccharide and polyol transport systemPurine metabolismATP synthesisPhosphate and amino acid transport systemAminoacyl tRNAPyrimidine metabolismRibosomeAromatic amino acid metabolism

a

b

Anterior nares RC Buccal mucosa Supragingival plaque Tongue dorsum Stool Posterior fornix

Figure 2 | Carriage of microbial taxa varies while metabolic pathwaysremain stable within a healthy population. a, b, Vertical bars representmicrobiome samples by body habitat in the seven locations with both shotgunand 16S data; bars indicate relative abundances colored by microbial phylafrom binned OTUs (a) and metabolic modules (b). Legend indicates mostabundant phyla/pathways by average within one or more body habitats; RC,

retroauricular crease. A plurality of most communities’ memberships consistsof a single dominant phylum (and often genus; see Supplementary Fig. 2), butthis is universal neither to all body habitats nor to all individuals. Conversely,most metabolic pathways are evenly distributed and prevalent across bothindividuals and body habitats.

Anterior nares

Antecubital fossa

Retroauricular crease

Buccal mucosa

Keratinized gingiva

Hard palate

SalivaThroat

Tongue dorsum

Subgingival plaque

Supragingival plaque

StoolMid-vagina

Posterior fornix

Vaginal introitus

Veillonella

Prevotella

Haemophilus

Moraxella

StaphylococcusCorynebacterium

Bacteroides

Streptococcus

Propionibacterium

Lactobacillus

Anterior nares

Retroauricular crease

Buccal mucosa

Tongue dorsum

Supragingival plaque

Stool

Rothia mucilaginosa

Gardnerella vaginalisBacteroides vulgatus

Alistipes putredinisBifidobacterium dentium

Staphylococcus epidermidis

Staphylococcus aureusCorynebacterium matruchotii

Streptococcus mitisPropionibacterium acnes

Corynebacterium accolensCorynebacterium kroppenstedtii

Prevotella copriLactobacillus jensenii

Prevotella amniiLactobacillus gasseri

Lactobacillus inersStreptococcus mitis

Propionibacterium acnesLactobacillus crispatus

Abundant species (metagenomic data) Abundant genera (16S data)

Mean non-zero abundance (size) and population prevalence (intensity) of microbial clades

a b

c

Beta-diversity added by sampled microbial communities

OTUs (16S data)Enzyme classes (metagenomic data)

�������

Div

ersi

ty (B

ray–

Cur

tis)

Actinobacteria|ActinobacteriaBacteroidetes|Bacteroidia

Firmicutes|BacilliFirmicutes|Negativicutes

Proteobacteria|Gammaproteobacteria

Prevalence (%)

d

100%

Abundance

0%

Div

ersi

ty (w

eigh

ted

Uni

Frac

)

Samples

eAbundant PATRIC ‘pathogens’

(metagenomic data)

Samples

Posterior fornix

0.3

0.2

0.1

0.00 20 40 60 80 100

Anterior naresRight retroauricular creaseLeft retroauricular creaseBuccal mucosaPosterior fornixStoolSupragingival plaqueTongue dorsum

0.5

0.4

0.3

0.2

0.1

0.00 50 100 150 200 250 300

Subgingival plaqueSalivaSupragingival plaquePalatine tonsilsStoolTongue dorsumThroatHard palateBuccal mucosa

Anterior naresAttached keratinized gingivaRight antecubital fossaLeft antecubital fossaRight retroauricular creaseLeft retroauricular creaseVaginal introitusMid-vaginaPosterior fornix

Anterior nares

Retroauricular crease

Buccal mucosa

Tongue dorsum

Supragingival plaque

StoolPosterior fornix

0 100

Figure 3 | Abundant taxa in the human microbiome that have beenmetagenomically and taxonomically well defined in the HMP population.a–c, Prevalence (intensity, colour denoting phylum/class) and abundance whenpresent (size) of clades in the healthy microbiome. The most abundantmetagenomically-identified species (a), 16S-identified genera (b) andPATRIC12 pathogens (metagenomic) (c) are shown. d, e, The population size

and sequencing depths of the HMP have well defined the microbiome at allassayed body sites, as assessed by saturation of added community metabolicconfigurations (rarefaction of minimum Bray–Curtis beta-diversity ofmetagenomic enzyme class abundances to nearest neighbour, inter-quartilerange over 100 samples) (d) and phylogenetic configurations (minimum 16SOTU weighted UniFrac distance to nearest neighbour) (e).

ARTICLE RESEARCH

1 4 J U N E 2 0 1 2 | V O L 4 8 6 | N A T U R E | 2 0 9

Macmillan Publishers Limited. All rights reserved©2012

Page 104: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

PhylaFirmicutesActinobacteriaBacteroidetesProteobacteriaFusobacteriaTenericutesSpirochaetesCyanobacteriaVerrucomicrobiaTM7

Metabolic pathwaysCentral carbohydrate metabolismCofactor and vitamin biosynthesisOligosaccharide and polyol transport systemPurine metabolismATP synthesisPhosphate and amino acid transport systemAminoacyl tRNAPyrimidine metabolismRibosomeAromatic amino acid metabolism

a

b

Anterior nares RC Buccal mucosa Supragingival plaque Tongue dorsum Stool Posterior fornix

Figure 2 | Carriage of microbial taxa varies while metabolic pathwaysremain stable within a healthy population. a, b, Vertical bars representmicrobiome samples by body habitat in the seven locations with both shotgunand 16S data; bars indicate relative abundances colored by microbial phylafrom binned OTUs (a) and metabolic modules (b). Legend indicates mostabundant phyla/pathways by average within one or more body habitats; RC,

retroauricular crease. A plurality of most communities’ memberships consistsof a single dominant phylum (and often genus; see Supplementary Fig. 2), butthis is universal neither to all body habitats nor to all individuals. Conversely,most metabolic pathways are evenly distributed and prevalent across bothindividuals and body habitats.

Anterior nares

Antecubital fossa

Retroauricular crease

Buccal mucosa

Keratinized gingiva

Hard palate

SalivaThroat

Tongue dorsum

Subgingival plaque

Supragingival plaque

StoolMid-vagina

Posterior fornix

Vaginal introitus

Veillonella

Prevotella

Haemophilus

Moraxella

StaphylococcusCorynebacterium

Bacteroides

Streptococcus

Propionibacterium

Lactobacillus

Anterior nares

Retroauricular crease

Buccal mucosa

Tongue dorsum

Supragingival plaque

Stool

Rothia mucilaginosa

Gardnerella vaginalisBacteroides vulgatus

Alistipes putredinisBifidobacterium dentium

Staphylococcus epidermidis

Staphylococcus aureusCorynebacterium matruchotii

Streptococcus mitisPropionibacterium acnes

Corynebacterium accolensCorynebacterium kroppenstedtii

Prevotella copriLactobacillus jensenii

Prevotella amniiLactobacillus gasseri

Lactobacillus inersStreptococcus mitis

Propionibacterium acnesLactobacillus crispatus

Abundant species (metagenomic data) Abundant genera (16S data)

Mean non-zero abundance (size) and population prevalence (intensity) of microbial clades

a b

c

Beta-diversity added by sampled microbial communities

OTUs (16S data)Enzyme classes (metagenomic data)

�������

Div

ersi

ty (B

ray–

Cur

tis)

Actinobacteria|ActinobacteriaBacteroidetes|Bacteroidia

Firmicutes|BacilliFirmicutes|Negativicutes

Proteobacteria|Gammaproteobacteria

Prevalence (%)

d

100%

Abundance

0%

Div

ersi

ty (w

eigh

ted

Uni

Frac

)

Samples

eAbundant PATRIC ‘pathogens’

(metagenomic data)

Samples

Posterior fornix

0.3

0.2

0.1

0.00 20 40 60 80 100

Anterior naresRight retroauricular creaseLeft retroauricular creaseBuccal mucosaPosterior fornixStoolSupragingival plaqueTongue dorsum

0.5

0.4

0.3

0.2

0.1

0.00 50 100 150 200 250 300

Subgingival plaqueSalivaSupragingival plaquePalatine tonsilsStoolTongue dorsumThroatHard palateBuccal mucosa

Anterior naresAttached keratinized gingivaRight antecubital fossaLeft antecubital fossaRight retroauricular creaseLeft retroauricular creaseVaginal introitusMid-vaginaPosterior fornix

Anterior nares

Retroauricular crease

Buccal mucosa

Tongue dorsum

Supragingival plaque

StoolPosterior fornix

0 100

Figure 3 | Abundant taxa in the human microbiome that have beenmetagenomically and taxonomically well defined in the HMP population.a–c, Prevalence (intensity, colour denoting phylum/class) and abundance whenpresent (size) of clades in the healthy microbiome. The most abundantmetagenomically-identified species (a), 16S-identified genera (b) andPATRIC12 pathogens (metagenomic) (c) are shown. d, e, The population size

and sequencing depths of the HMP have well defined the microbiome at allassayed body sites, as assessed by saturation of added community metabolicconfigurations (rarefaction of minimum Bray–Curtis beta-diversity ofmetagenomic enzyme class abundances to nearest neighbour, inter-quartilerange over 100 samples) (d) and phylogenetic configurations (minimum 16SOTU weighted UniFrac distance to nearest neighbour) (e).

ARTICLE RESEARCH

1 4 J U N E 2 0 1 2 | V O L 4 8 6 | N A T U R E | 2 0 9

Macmillan Publishers Limited. All rights reserved©2012

Page 105: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

PhylaFirmicutesActinobacteriaBacteroidetesProteobacteriaFusobacteriaTenericutesSpirochaetesCyanobacteriaVerrucomicrobiaTM7

Metabolic pathwaysCentral carbohydrate metabolismCofactor and vitamin biosynthesisOligosaccharide and polyol transport systemPurine metabolismATP synthesisPhosphate and amino acid transport systemAminoacyl tRNAPyrimidine metabolismRibosomeAromatic amino acid metabolism

a

b

Anterior nares RC Buccal mucosa Supragingival plaque Tongue dorsum Stool Posterior fornix

Figure 2 | Carriage of microbial taxa varies while metabolic pathwaysremain stable within a healthy population. a, b, Vertical bars representmicrobiome samples by body habitat in the seven locations with both shotgunand 16S data; bars indicate relative abundances colored by microbial phylafrom binned OTUs (a) and metabolic modules (b). Legend indicates mostabundant phyla/pathways by average within one or more body habitats; RC,

retroauricular crease. A plurality of most communities’ memberships consistsof a single dominant phylum (and often genus; see Supplementary Fig. 2), butthis is universal neither to all body habitats nor to all individuals. Conversely,most metabolic pathways are evenly distributed and prevalent across bothindividuals and body habitats.

Anterior nares

Antecubital fossa

Retroauricular crease

Buccal mucosa

Keratinized gingiva

Hard palate

SalivaThroat

Tongue dorsum

Subgingival plaque

Supragingival plaque

StoolMid-vagina

Posterior fornix

Vaginal introitus

Veillonella

Prevotella

Haemophilus

Moraxella

StaphylococcusCorynebacterium

Bacteroides

Streptococcus

Propionibacterium

Lactobacillus

Anterior nares

Retroauricular crease

Buccal mucosa

Tongue dorsum

Supragingival plaque

Stool

Rothia mucilaginosa

Gardnerella vaginalisBacteroides vulgatus

Alistipes putredinisBifidobacterium dentium

Staphylococcus epidermidis

Staphylococcus aureusCorynebacterium matruchotii

Streptococcus mitisPropionibacterium acnes

Corynebacterium accolensCorynebacterium kroppenstedtii

Prevotella copriLactobacillus jensenii

Prevotella amniiLactobacillus gasseri

Lactobacillus inersStreptococcus mitis

Propionibacterium acnesLactobacillus crispatus

Abundant species (metagenomic data) Abundant genera (16S data)

Mean non-zero abundance (size) and population prevalence (intensity) of microbial clades

a b

c

Beta-diversity added by sampled microbial communities

OTUs (16S data)Enzyme classes (metagenomic data)

�������

Div

ersi

ty (B

ray–

Cur

tis)

Actinobacteria|ActinobacteriaBacteroidetes|Bacteroidia

Firmicutes|BacilliFirmicutes|Negativicutes

Proteobacteria|Gammaproteobacteria

Prevalence (%)

d

100%

Abundance

0%

Div

ersi

ty (w

eigh

ted

Uni

Frac

)

Samples

eAbundant PATRIC ‘pathogens’

(metagenomic data)

Samples

Posterior fornix

0.3

0.2

0.1

0.00 20 40 60 80 100

Anterior naresRight retroauricular creaseLeft retroauricular creaseBuccal mucosaPosterior fornixStoolSupragingival plaqueTongue dorsum

0.5

0.4

0.3

0.2

0.1

0.00 50 100 150 200 250 300

Subgingival plaqueSalivaSupragingival plaquePalatine tonsilsStoolTongue dorsumThroatHard palateBuccal mucosa

Anterior naresAttached keratinized gingivaRight antecubital fossaLeft antecubital fossaRight retroauricular creaseLeft retroauricular creaseVaginal introitusMid-vaginaPosterior fornix

Anterior nares

Retroauricular crease

Buccal mucosa

Tongue dorsum

Supragingival plaque

StoolPosterior fornix

0 100

Figure 3 | Abundant taxa in the human microbiome that have beenmetagenomically and taxonomically well defined in the HMP population.a–c, Prevalence (intensity, colour denoting phylum/class) and abundance whenpresent (size) of clades in the healthy microbiome. The most abundantmetagenomically-identified species (a), 16S-identified genera (b) andPATRIC12 pathogens (metagenomic) (c) are shown. d, e, The population size

and sequencing depths of the HMP have well defined the microbiome at allassayed body sites, as assessed by saturation of added community metabolicconfigurations (rarefaction of minimum Bray–Curtis beta-diversity ofmetagenomic enzyme class abundances to nearest neighbour, inter-quartilerange over 100 samples) (d) and phylogenetic configurations (minimum 16SOTU weighted UniFrac distance to nearest neighbour) (e).

ARTICLE RESEARCH

1 4 J U N E 2 0 1 2 | V O L 4 8 6 | N A T U R E | 2 0 9

Macmillan Publishers Limited. All rights reserved©2012

Page 106: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Carriage of specific microbesInter-individual variation in the microbiome proved to be specific,functionally relevant and personalized. One example of this is illu-strated by the Streptococcus spp. of the oral cavity. The genus dominatesthe oropharynx16, with different species abundant within each sampledbody habitat (see http://hmpdacc.org/HMSMCP) and, even at thespecies level, marked differences in carriage within each habitat amongindividuals (Fig. 4a). As the ratio of pan- to core-genomes is high inmany human-associated microbes17, this variation in abundance couldbe due to selective pressures acting on pathways differentially presentamong Streptococcus species or strains (Fig. 4b). Indeed, we observedextensive strain-level genomic variation within microbial species inthis population, enriched for host-specific structural variants aroundgenomic islands (Fig. 4c). Even with respect to the single Streptococcusmitis strain B6, gene losses associated with these events were common,

for example differentially eliminating S. mitis carriage of the V-typeATPase or choline binding proteins cbp6 and cbp12 among subsets ofthe host population (Fig. 4d). These losses were easily observable bycomparison to reference isolate genomes, and these initial findingsindicate that microbial strain- and host-specific gene gains andpolymorphisms may be similarly ubiquitous.

Other examples of functionally relevant inter-individual variationat the species and strain levels occurred throughout the microbiome.In the gut, Bacteroides fragilis has been shown to prime T-cellresponses in animal models via the capsular polysaccharide A18,and in the HMP stool samples this taxon was carried at a level of atleast 0.1% in 16% of samples (over 1% abundance in 3%). Bacteroidesthetaiotaomicron has been studied for its effect on host gastrointestinalmetabolism19 and was likewise common at 46% prevalence. On the skin,S. aureus, of particular interest as the cause of methicillin-resistant

0

10

20

30

40

50

60Other S. sanguinis S. gordonii S. oralis S. thermophilus S. mitis

S. mitis

S. peroris S. vestibularisS. australis S. infantis S. salivarius S. parasanguinis

Rel

ativ

e S

trep

toco

ccus

spe

cies

abu

ndan

ce (%

)

127 tongue dorsum samples

Average relativeStreptococcus abundance

1 500 1000 1500 2000 kb

log(

RP

KM

)

Choline-bindingproteins

V-type H+

ATPase subunits

127 tongue dorsum samples

Streptococcus mitis B6

Genomic islands

Streptococcus mitis

V CH

S. gordonii ChallisS. mitis B6S. mutans UA159S. pneumoniae TIGR4S. pyogenes SF370S. sanguinis SK36S. suis 05ZYH33S. thermophilus LMD9

M00

283:

PTS

sys

tem

, asc

orba

te-s

peci

fic II

cpn

t

M00

280:

PTS

sys

tem

, glu

cito

l/sor

bito

l-spe

cific

II c

pnt

M00

279:

PTS

sys

tem

, gal

actit

ol-s

peci

fic II

cpn

t

M00

277:

PTS

sys

tem

, N-a

cety

lgal

acto

sam

ine−

spec

ific

II cp

nt

M00

274:

PTS

sys

tem

, man

nito

l-spe

cific

II c

pnt

M00

270:

PTS

sys

tem

, tre

halo

se-s

peci

fic II

cpn

t

M00

269:

PTS

sys

tem

, suc

rose

-spe

cific

II c

pnt

0026

5: P

TS s

yste

m, g

luco

se-s

peci

fic II

cpn

t

M00

159:

V-t

ype

ATPa

se, p

roka

ryot

es

a b

c

d

1

0

–1

log(RPKM)–2 –1 0 0.5 1

Figure 4 | Microbial carriage varies between subjects down to the speciesand strain level. Metagenomic reads from 127 tongue samples spanning 90subjects were processed with MetaPhlAn to determine relative abundances foreach species. a, Relative abundances of 11 distinct Streptococcus spp. In additionto variation in broader clades (see Fig. 2), individual species within a singlehabitat demonstrate a wide range of compositional variation. Inset illustratesaverage tongue sample composition. b, Metabolic modules present/absent(grey/white) in KEGG24 reference genomes of tongue streptococci denoteselected areas of strain-specific functional differentiation. cpnt, component.

c, Comparative genomic coverage for the single Streptococcus mitis B6 strain.Grey dots are median reads per kilobase per million reads (RPKM) for 1-kbwindows, grey bars are the 25th to 75th percentiles across all samples, red linethe LOWESS-smoothed average. Red bars at the bottom highlight predictedgenomic islands27. Large, discrete, and highly variable islands are commonlyunder-represented. d, Two islands are highlighted, V (V-type H1 ATPasesubunits I, K, E, C, F, A and B) and CH (choline-binding proteins cbp6 andcbp12), indicating functional cohesion of strain-specific gene loss withinindividual human hosts.

RESEARCH ARTICLE

2 1 0 | N A T U R E | V O L 4 8 6 | 1 4 J U N E 2 0 1 2

Macmillan Publishers Limited. All rights reserved©2012

Page 107: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014metadata, and other potentially important factors such as short- andlong-term diet, daily cycles, founder effects such as mode of delivery,and host genetics should be considered in future analyses.

ConclusionsThis extensive sampling of the human microbiome across many sub-jects and body habitats provides an initial characterization of thenormal microbiota of healthy adults in a Western population. Thelarge sample size and consistent sampling of many sites from the sameindividuals allows for the first time an understanding of the relationshipsamong microbes, and between the microbiome and clinical parameters,that underpin the basis for individual variation—variation that mayultimately be critical for understanding microbiome-based disorders.Clinical studies of the microbiome will be able to leverage the resultingextensive catalogues of taxa, pathways and genes1, although they mustalso still include carefully matched internal controls. The uniqueness ofeach individual’s microbiome even in this reference population arguesfor future studies to consider prospective within-subjects designs wherepossible. The HMP’s unique combination of organismal and functionaldata across body habitats, encompassing both 16S and metagenomicprofiling, together with detailed characterization of each subject, hasallowed us and subsequent studies to move beyond the observation of

variability in the human microbiome to ask how and why these microbialcommunities vary so extensively.

Many details remain for further work to fill in, building on thisreference study. How do early colonization and lifelong change varyamong body habitats? Do epidemiological patterns of transmission ofbeneficial or harmless microbes mirror patterns of transmission ofpathogens? Which co-occurrences among microbes reflect sharedresponse to the environment, as opposed to competitive or mutualisticinteractions? How large a role does host immunity or genetics play inshaping patterns of diversity, and how do the patterns observed in thisNorth American population compare to those around the world? Futurestudies building on the gene and organism catalogues established by theHuman Microbiome Project, including increasingly detailed investi-gations of metatranscriptomes and metaproteomes, will help to unravelthese open questions and allow us to more fully understand the linksbetween the human microbiome, health and disease.

METHODS SUMMARYMicrobiome samples were collected from up to 18 body sites at one or two timepoints from 242 individuals clinically screened for absence of disease (K. Aagaardet al., manuscript submitted). Samples were subjected to 16S ribosomal RNA genepyrosequencing (454 Life Sciences), and a subset were shotgun-sequenced formetagenomics using the Illumina GAIIx platform1. 16S data processing and

Asian Black Mexican Puerto Rican White

Race/ethnicity

Nor

m. r

el. a

bund

ance

a M00028: ornithine biosynthesis, glutamate => ornithine (tongue dorsum)M00026: histidine biosynthesis, PRPP => histidine (tongue dorsum)Proteobacteria|Gammaproteobacteria|Enterobacteriales|Enterobacteriaceae|Klebsiella (anterior nares)Proteobacteria|Gammaproteobacteria|Pseudomonadales (antecubital fossa)

Vaginal pH (posterior fornix)

3.5 4.0 4.5

M00222: Phosphate transport system,posterior fornix

b

3.5 4.0 4.5 5.0

Actinobacteria,mid-vagina

Age

20 25 30 35

M00012: Glyoxylate cycle,retroauricular crease

c

20 25 30 35 40

Firmicutes,retroauricular crease

BMI20 25 30

M00004: Pentose phosphate pathway,tongue dorsum

d

20 25 30

Pseudomonadaceae, throat

Figure 5 | Microbial community membership and function correlates withhost phenotype and sample metadata. a–d, The pathway and cladeabundances most significantly associated (all FDR q , 0.2) using a multivariatelinear model with subject race or ethnicity (a), vaginal posterior fornix pH(b), subject age (c) and BMI (d). Scatter plots of samples are shown with lines

indicating best simple linear fit. Race/ethnicity and vaginal pH are particularlystrong associations; age and BMI are more representative of typically modestphenotypic associations (Supplementary Table 3), suggesting that variation inthe healthy microbiota may correspond to other host or environmental factors.

RESEARCH ARTICLE

2 1 2 | N A T U R E | V O L 4 8 6 | 1 4 J U N E 2 0 1 2

Macmillan Publishers Limited. All rights reserved©2012

Page 108: UC Davis EVE161 Lecture 18 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014