method open access metagenomic biomarker discovery and

18
METHOD Open Access Metagenomic biomarker discovery and explanation Nicola Segata 1 , Jacques Izard 2,3 , Levi Waldron 1 , Dirk Gevers 4 , Larisa Miropolsky 1 , Wendy S Garrett 5,6,7 and Curtis Huttenhower 1* Abstract This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph. harvard.edu/lefse/. Background Biomarker discovery has proven to be one of the most broadly applicable and successful means of translating molecular and genomic data into clinical practice. Com- parisons between healthy and diseased tissues have high- lighted the importance of tasks such as class discovery (detecting novel subtypes of a disease) and class predic- tion (determining the subtype of a new sample) [1-4], and recent metagenomic assays have shown that human microbial communities can be used as biomarkers for host factors such as lifestyle [5-7] and disease [7-10]. As sequencing technology continues to develop and makes microbial biomarkers increasingly easily detected, this enables clinical diagnostic and microbiological applica- tions through the comparison of microbial communities [11,12]. The human microbiome, consisting of the total micro- bial complement associated with human hosts, is an important emerging area for metagenomic biomarker discovery [13,14]. Changes in microbial abundances in the gut, oral cavity, and skin have been associated with disease states ranging from obesity [15-17] to psoriasis [18]. More generally, the metagenomic study of micro- bial communities is an effective approach for identifying the microorganisms or microbial metabolic characteris- tics of any uncultured sample [19,20]. Analyses of metagenomic data typically seek to identify the specific organisms, clades, operational taxonomic units, or path- ways whose relative abundances differ between two or more groups of samples, and several features of micro- bial communities have been proposed as potential bio- markers for various disease states. For example, single pathogenic organisms can signal disease if present in a community [21,22], and increases and decreases in com- munity complexity have been observed in bacterial vagi- nosis [23] and Crohns disease [8]. Each of these different types of microbial biomarkers is correlated with disease phenotypes, but few bioinformatic methods exist to explain the class comparisons afforded by meta- genomic data. Identifying the most biologically informative features differentiating two or more phenotypes can be challen- ging in any genomics dataset, and this is particularly true for metagenomic biomarkers. Robust statistical tools are needed to ensure the reproducibility of conclu- sions drawn from metagenomic data, which is crucial for clinical application of the biological findings. Related challenges are associated with high-dimensional data regardless of the data type or experimental platform; the number of potential biomarkers, for example, is typically much higher than the number of samples [24-26]. Meta- genomic analyses additionally present their own specific issues, including sequencing errors, chimeric reads [27,28], and complex underlying biology; many micro- bial communities have been found to show remarkably high inter-subject variability. For example, large * Correspondence: [email protected] 1 Department of Biostatistics, 677 Huntington Avenue, Harvard School of Public Health, Boston, MA 02115, USA Full list of author information is available at the end of the article Segata et al. Genome Biology 2011, 12:R60 http://genomebiology.com/2011/11/6/R60 © 2011 Segata et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: others

Post on 03-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

METHOD Open Access

Metagenomic biomarker discovery andexplanationNicola Segata1, Jacques Izard2,3, Levi Waldron1, Dirk Gevers4, Larisa Miropolsky1, Wendy S Garrett5,6,7 andCurtis Huttenhower1*

Abstract

This study describes and validates a new method for metagenomic biomarker discovery by way of classcomparison, tests of biological consistency and effect size estimation. This addresses the challenge of findingorganisms, genes, or pathways that consistently explain the differences between two or more microbialcommunities, which is a central problem to the study of metagenomics. We extensively validate our method onseveral microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.

BackgroundBiomarker discovery has proven to be one of the mostbroadly applicable and successful means of translatingmolecular and genomic data into clinical practice. Com-parisons between healthy and diseased tissues have high-lighted the importance of tasks such as class discovery(detecting novel subtypes of a disease) and class predic-tion (determining the subtype of a new sample) [1-4],and recent metagenomic assays have shown that humanmicrobial communities can be used as biomarkers forhost factors such as lifestyle [5-7] and disease [7-10]. Assequencing technology continues to develop and makesmicrobial biomarkers increasingly easily detected, thisenables clinical diagnostic and microbiological applica-tions through the comparison of microbial communities[11,12].The human microbiome, consisting of the total micro-

bial complement associated with human hosts, is animportant emerging area for metagenomic biomarkerdiscovery [13,14]. Changes in microbial abundances inthe gut, oral cavity, and skin have been associated withdisease states ranging from obesity [15-17] to psoriasis[18]. More generally, the metagenomic study of micro-bial communities is an effective approach for identifyingthe microorganisms or microbial metabolic characteris-tics of any uncultured sample [19,20]. Analyses of

metagenomic data typically seek to identify the specificorganisms, clades, operational taxonomic units, or path-ways whose relative abundances differ between two ormore groups of samples, and several features of micro-bial communities have been proposed as potential bio-markers for various disease states. For example, singlepathogenic organisms can signal disease if present in acommunity [21,22], and increases and decreases in com-munity complexity have been observed in bacterial vagi-nosis [23] and Crohn’s disease [8]. Each of thesedifferent types of microbial biomarkers is correlatedwith disease phenotypes, but few bioinformatic methodsexist to explain the class comparisons afforded by meta-genomic data.Identifying the most biologically informative features

differentiating two or more phenotypes can be challen-ging in any genomics dataset, and this is particularlytrue for metagenomic biomarkers. Robust statisticaltools are needed to ensure the reproducibility of conclu-sions drawn from metagenomic data, which is crucialfor clinical application of the biological findings. Relatedchallenges are associated with high-dimensional dataregardless of the data type or experimental platform; thenumber of potential biomarkers, for example, is typicallymuch higher than the number of samples [24-26]. Meta-genomic analyses additionally present their own specificissues, including sequencing errors, chimeric reads[27,28], and complex underlying biology; many micro-bial communities have been found to show remarkablyhigh inter-subject variability. For example, large

* Correspondence: [email protected] of Biostatistics, 677 Huntington Avenue, Harvard School ofPublic Health, Boston, MA 02115, USAFull list of author information is available at the end of the article

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

© 2011 Segata et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

differences are detected even among the gut micro-biomes of twins [29], and both human microbiomes andenvironmental communities are thought to be character-ized by the presence of a long tail of rare organisms[30-32]. Moreover, simply identifying potential biomar-kers without elucidating their biological consistency androles is only a precursor to understanding the underly-ing mechanisms of microbe-microbe or host-microbeinteractions [33]. In many cases, it is necessary toexplain not just how two biological samples differ, butwhy. This problem is referred to as class comparison:how can the differences between phenotypes such astumor subtype or disease state be explained in terms ofconsistent biological pathways or molecularmechanisms?A number of methods have been proposed for class

discovery or comparison in metagenomic data. MEGAN[34] is a metagenomic analysis tool with recent addi-tions for phylogenetic comparisons [35] and statisticalanalyses [36]. MEGAN, however, can only compare sin-gle pairs of metagenomes, as is also the case withSTAMP [37], which does introduce a concept of ‘biolo-gical relevance’ in the form of confidence intervals. Uni-Frac [38] compares sets of metagenomes at a strictlytaxonomic level using phylogenetic distance, while MG-RAST [39], ShotgunFunctionalizeR [40], mothur [41],and METAREP [42] all process metagenomic data usingstandard statistical tests (mainly t-tests with some modi-fications). Most methods for community analysis froman ecological perspective rely on unsupervised clusteranalyses based on principal component analysis [43] orprincipal coordinate analysis [44]. These can successfullydetect groups of related samples, but they fail to includeprior knowledge of phenotypes or environmental condi-tions associated with the groups, and they generally donot identify the biological features responsible for grouprelationships. Metastats [45] is the only current methodthat explicitly couples statistical analysis (to assesswhether metagenomes differ) with biomarker discovery(to detect features characterizing the differences) basedon repeated t statistics and Fisher’s tests on randompermutations. However, none of these methods, eventhose offering nuanced analyses of metagenomic data,provide biological class explanations to establish statisti-cal significance, biological consistency, and effect sizeestimation of predicted biomarkers.In this work, we present the linear discriminant analy-

sis (LDA) effect size (LEfSe) method to support high-dimensional class comparisons with a particular focuson metagenomic analyses. LEfSe determines the features(organisms, clades, operational taxonomic units, genes,or functions) most likely to explain differences betweenclasses by coupling standard tests for statistical signifi-cance with additional tests encoding biological

consistency and effect relevance. Class comparisonmethods typically predict biomarkers consisting of fea-tures that violate a null hypothesis of no differencebetween classes; we additionally detect the subset of fea-tures with abundance patterns compatible with an algor-ithmically encoded biological hypothesis and estimatethe sizes of the significant variations. In particular, effectsize provides an estimation of the magnitude of theobserved phenomenon due to each characterizing fea-ture and it is thus a valuable tool for ranking the rele-vance of different biological aspects and for addressingfurther investigations and analyses. The introduction ofprior biological knowledge in the method contributes toconstrain the analysis and thus to address the challengestraditionally connected with high-dimensional datamining. LEfSe thus aims to support biologists by sug-gesting biomarkers that explain most of the effect differ-entiating phenotypes of interest (two or more) inbiomarker discovery comparative and hypothesis-driveninvestigations. The visualization of the discovered bio-markers on taxonomic trees provides an effective meansfor summarizing the results in a biologically meaningfulway, as this both statistically and visually captures thehierarchical relationships inherent in 16S-based taxo-nomies/phylogenies or in ontologies of pathways andbiomolecular functions.We validated this approach using data from human

microbiomes, a mouse model of ulcerative colitis, andenvironmental samples, in each case predicting groupsof organisms or operational taxonomic units that con-cisely differentiate the classes being compared. Wefurther evaluated LEfSe using synthetic data, observingthat it achieves a substantially better false positive ratecompared to standard statistical tests, at the price of amoderately increased false negative rate (that can beadjusted as needed by the user). An implementation ofLEfSe including a convenient graphical interface incor-porated in the Galaxy framework [46,47] is providedonline at [48].

Results and discussionLEfSe is an algorithm for high-dimensional biomarkerdiscovery and explanation that identifies genomic fea-tures (genes, pathways, or taxa) characterizing the differ-ences between two or more biological conditions (orclasses) (Figure 1). It emphasizes statistical significance,biological consistency and effect relevance, allowingresearchers to identify differentially abundant featuresthat are also consistent with biologically meaningfulcategories (subclasses; see Materials and methods).LEfSe first robustly identifies features that are statisti-cally different among biological classes. It then performsadditional tests to assess whether these differences areconsistent with respect to expected biological behavior;

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 2 of 18

for example, given some known population structurewithin a set of input samples, is a feature more abun-dant in all population subclasses or in just one? Specifi-cally, we first use the non-parametric factorial Kruskal-Wallis (KW) sum-rank test [49] to detect features withsignificant differential abundance with respect to theclass of interest; biological consistency is subsequentlyinvestigated using a set of pairwise tests among sub-classes using the (unpaired) Wilcoxon rank-sum test[50,51]. As a last step, LEfSe uses LDA [52] to estimatethe effect size of each differentially abundant featureand, if desired by the investigator, to perform dimensionreduction.We have specifically designed LEfSe for biomarker dis-

covery in metagenomic data. We thus summarize ourresults here from applying the tool to 16S rRNA gene

and whole genome shotgun datasets to detect bacterialorganisms and functional characteristics differentiallyabundant between two or more microbial environments.These include body sites within human microbiomes(mucosal surfaces and aerobic/anaerobic environments),adult and infant microbiomes, inflammatory bowel dis-ease status in a mouse model, bacterial and viral envir-onmental communities, and synthetic data forquantitative computational evaluation.

Taxa characterizing body sites within the humanmicrobiomeMicrobial community organization at multiple humanbody sites is an area of active current research, sinceboth low- and high-throughput methods have shownboth differences and overlaps among the microbiota of

Figure 1 LEfSe mines a wide range of high-throughput genetic data to find biologically relevant features characterizing one or moreexperimental conditions. The inputs to the system are the specifications of the biological hypothesis under investigation (conditions and inter-condition sample groupings), the high-dimensional data obtained experimentally, and, optionally, prior knowledge from literature or databasesused to define known relationships between features (used for meaningful hierarchical organization of the discovered biomarkers) or samples(used for testing biological consistency of potential biomarkers). LEfSe is a three-step algorithm (detailed in Figure 6). (a) LEfSe first provides thelist of features that are differential among conditions of interest with statistical and biological significance, ranking them according to the effectsize. (b) For problems with known hierarchical structure, either phylogenetic or functional, we then provide a mapping of the differences totaxonomic or functional trees. (c) Finally, the system produces a histogram visualizing the raw data within the specified problem structure foreach relevant feature. While LEfSe has been developed primarily for metagenomic data containing taxon or gene abundances, it can be used forbiomarker discovery in any setting where prior biological knowledge regarding the structure of a comparison is coupled with statisticallysignificant differences in high-dimensional genomic features. KEGG, Kyoto Encyclopedia of Genes and Genomes; WGS, whole genome shotgun.

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 3 of 18

multiple body sites [53,54]. We examined these differ-ences in the 16S-based phylometagenomic dataset from24 individuals enrolled in the Human Microbiome Pro-ject [13,55]. A minimum of 5,000 16S rRNA genesequences were obtained for 301 samples from 24healthy subjects (12 male, 12 female) covering 18 bodysites, including 6 main body site categories: the oral cav-ity (9 sub-sites sampled), the vagina (3 sub-sitessampled), the skin (2 sub-sites sampled), the retroauri-cular crease (2 sub-sites sampled), the nasal cavity (1sample) and the gut (1 sample). We validated LEfSe bycontrasting mucosal versus non-mucosal body siteclasses and by comparing three levels of aerobic envir-onments (anaerobic, microaerobic, and aerobic). In bothcases, the sub-sites within each class of body site wereused as a biological subclass.

Mucosal surfaces are colonized by diverse bacteria; non-mucosal microbiomes are strongly enriched forActinobacteriaOur first analysis focused on differences in microbiotacomposition between mucosal and non-mucosal bodysites. The oral cavity, gut, and vaginal sites were classi-fied as sources of mucosal communities and the anteriorfossa (skin), nasal cavity, and retroauricular crease asnon-mucosal. Mucosal environments differ greatly fromthe other body sites, characterized primarily by interac-tion with the human immune system, oxidative chal-lenge, and hydration [56].LEfSe provides three main outputs (Figure 2), describ-

ing the effect sizes of differences observed amongmucosal/non-mucosal communities, the phylogeneticdistribution of these differences based on the RibosomalDatabase Project (RDP) bacterial taxonomy [57], andthe raw data driving these effects. LEfSe detected 15bacterial clades showing statistically significant and bio-logically consistent differences in non-mucosal bodysites (Figure 2a).The most differentially abundant bacterial taxa in non-

mucosal body sites belong to phyla with prevalent aero-bic members: Actinobacteria, Firmicutes, and Proteobac-teria, including environmental organisms from theBetaproteobacteria and Gammaproteobacteria clades.Non-mucosal overrepresented genera include Propioni-bacterium, Staphylococcus (found exclusively in non-mucosal samples), Corynebacterium, and Pseudomonas.Also of note is the relevant representation of plastidsfrom plant organisms (chloroplasts), for which the dis-tribution of associated taxa varies, as some are limitedto non-mucosal surfaces (environmental exposure andpotentially cosmetic products) and others to the diges-tive track (ingested food). No clades are consistentlypresent in all mucosal body sites, demonstrating the b-diversity of these communities (that is, differences

among their population structure), but many taxa withinActinobacteria, Bacillales, and several other clades arerelatively abundant at all non-mucosal sites. The within-subject b-diversity at all phylogenetic levels is high-lighted in Additional file 1, quantifying the extent towhich distances among different mucosal body sites arelarger than the equivalent distances among non-mucosalsites. This leads to a lack of taxa common to all mucosalbody sites, and therefore no taxa are determined byLEfSe to be characteristic of the mucosa as a whole.The Actinomycetales are usually the most abundant

phylogenetic unit (order level) in non-mucosal commu-nities, with percentages higher than 90% in several skinsamples and at most 20% in the great majority of theoral mucosal samples and substantially lower in thevagina and gut (Figure 2c). From a quantitative view-point, the taxonomic order Actinomycetales makes upessentially all of the detected members of the phylumActinobacteria, except in the vaginal site, whichreported a substantial Bifidobacteriales presence. Bifido-bacteriales themselves are not detected as differentiallyabundant between mucosal and non-mucosal body sites,since this is a feature only of the vaginal samples andnot of all mucosal body sites. The contrast of manyclades’ abundance versus distribution is striking; forexample, the genera Alloscardovia, Parascardovia andScardovia are present in all body sites at very low abun-dances, while Gardnerella is overrepresented only invaginal samples, with over three orders of magnitudedifference in abundance. A similar commonality of dis-tribution was found for the Bacillales at an even lowerabundance. At the genus level, Propionibacterium, Sta-phylococcus, Corynebacterium and Pseudomonas are dif-ferentiated by both distribution and abundance. TheStaphylococcus genus in particular is detected by LEfSewith a very high LDA score (more than five orders ofmagnitude), reflecting marked abundance in non-muco-sal sites (mean 10%, 18% and 21% in the skin, retroauri-cular crease and anterior nares body sites, respectively)and consistently low abundance in mucosal sites (meanless than 0.001%).

Classes with multiple levels: distinct aerobic, anaerobic,and microaerobic communities in the human microbiomeThe roles of anaerobic metabolism in the commensalhuman microbiota have not yet been fully investigateddue to the difficulty of studying these communities inculture. We thus further investigated the aerobicitycharacteristics of human microbial communities at ahigh level by grouping body sites into three classes withdistinct levels of available molecular oxygen. The high-O2 exposure class includes body sites directly and per-manently exposed to oxygen: skin, anterior nares andretroauricular crease. The mid-O2 exposure class

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 4 of 18

Figure 2 LEfSe results on human microbiomes. (a-c) Mucosal body site analysis. Mucosal microbial communities are diverse, while non-mucosal body sites are characterized by several clades, including the Actinobacteria. The analysis reported here is carried out on initial datafrom the Human Microbiome Project [55,56] assigning the main body sites to mucosal and non-mucosal classes, and using the body sites assubclasses. These graphical outputs were generated by the publicly available LEfSe visualization modules applied on the analysis results andintegrating microbial taxonomic prior knowledge [58]. (a) Histogram of the LDA scores computed for features differentially abundant betweenmucosal and non-mucosal body sites. LEfSe scores can be interpreted as the degree of consistent difference in relative abundance betweenfeatures in the two classes of analyzed microbial communities. The histogram thus identifies which clades among all those detected asstatistically and biologically differential explain the greatest differences between communities. (b) Taxonomic representation of statistically andbiologically consistent differences between mucosal and non-mucosal body sites. Differences are represented in the color of the most abundantclass (red indicating non-mucosal, yellow non-significant). Each circle’s diameter is proportional to the taxon’s abundance. This representation,here employing the Ribosomal Database Project (RDP) taxonomy [58], simultaneously highlights high-level trends and specific genera - forexample, multiple differentially abundant sibling taxa consistent with the variation of the parent clade. (c) Histogram of the Actinomycetalesrelative abundances (in the 0[1] interval) in mucosal and non-mucosal body sites. Subclasses (specific body sites) are differentially colored andthe mean and median relative abundance of the Actinomycetales are indicated with solid and dashed lines, respectively. (d,e) Aerobiosisanalysis. The cladograms report the taxa (highlighted by small circles and by shading) showing different abundance values (according to LEfSe)in the three O2-dependent classes as described in Results; for each taxon, the color denotes the class with higher median for both the smallcircles and the shading. (d) The strict (all classes differential) version of LEfSe detects 13 biomarkers whereas (e) the non-strict (at least one classdifferential) version of LEfSe detects 60 microbial biomarkers with abundance differential under aerobic, anaerobic, or microaerobic conditions.Additional file 2 reports the non-strict version of LEfSe focused on the Firmicutes phylum, highlighting several low-O2 specific genera withinRuminococcaceae and Lachnospiraceae.

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 5 of 18

includes the oral and vaginal body sites that can bedirectly, but not permanently, atmospherically exposed,and the low-O2 exposure class (the gut) is mainly anae-robic. The body sites included in the three classes mayhave other distinguishing features in addition to differ-ent oxygen exposure and, in general, these confoundingfactors can cause features unrelated with aerobiosis tobe detected as biomarkers. However, the LEfSe biologi-cal consistency step assures that the detected biomar-kers are characteristic of all the subclasses of a givenclass and with respect to all subclasses of the otherclasses. For example, the high-abundance of a bacterialclade in the mouth due to an oral-specific niche is notdetected as a biomarker unless the same niche is alsopresent in the vaginal samples (the other body site inthe mid-O2 class) and not present in any high-O2 orlow-O2 single body sites. So LEfSe will detect biomar-kers more confidently connected with the aerobiosischaracteristics than traditional methods that do notincorporate subclass information. Moreover, LEfSe isspecifically able to analyze ordinal classes with multiplelevels, and in agreement with established microbiology,we observed specific microbial clades ubiquitous withinand characteristic to each of these three environments,detailed as follows (Figure 2d).LEfSe allows ordinal classes with more than two levels

to be analyzed in two different stringencies. The firstrequires significant taxa to differ between every pair ofclass values (that is, aerobicity in this example; seeMaterials and methods); the discovered biomarkersmust accurately distinguish all individual classes (high-,mid-, and low-O2). In this example (Figure 2d; strictversion), we detected 13 clades with LDA scores above2, showing three distinct abundance levels. Alternatively,LEfSe can determine significant taxa differing in at leastone (and possibly multiple) class value(s) (non-strict ver-sion); in other words, biomarkers that distinguish atleast one individual class. Using this method (Figure 2e),we find 60 clades with LDA scores of at least 2.Using either approach, each oxygen level is broadly

characterized by a specific clade. The overall abundancesof the Actinobacteria phylum are higher in body sitesdirectly exposed to molecular oxygen with several mem-bers of the Actinomycetales order that colonize the skin.Actinomycetales includes the Propionibacterium genus,which is highly abundant on the skin, low in moderate-O2 environments, and absent from the gut. The Lacto-bacillales (primarily Bacilli) are specific to moderate O2

exposure levels, with conversely lower presences in thehigh-O2 exposure class, and are again absent from thegut. The Bacteroidaceae (particularly Bacteroides) areubiquitous in anaerobic samples; interestingly, however,members of this family are more abundant in high oxy-gen availability conditions (particularly in skin and

retroauricular crease) than in medium oxygen availabil-ity, showing the niche diversity within the phylogeneticbranching. This is in agreement with observations thatthe microenvironment of many microbial consortiashows extreme biogeographical variation with respect tonutrients, metabolites, and oxygen availability [58,59].

Bifidobacteria and additional clades are underrepresentedin a mouse model of ulcerative colitisRodent models have been established to provide auniquely accurate and tractable model for studying thegut microbiota, including the molecular and cellularmechanisms driving chronic intestinal inflammation[60-63]. In particular, mouse models of inflammatorybowel disease [63] facilitate a mechanistic evaluation ofthe contribution of the gut microbiota to the initiationand perpetuation of chronic intestinal inflammation, asoccurs in human Crohn’s disease and ulcerative colitis[64]. One host molecular mechanism known to maintainthe balance between immune regulation and the com-mensal microflora is T-bet, a transcription factorexpressed in many immune cell subsets. Its loss in theabsence of an adaptive immune system results in ahighly penetrant and aggressive form of ulcerative colitis[65] that is specifically dependent on and transmissiblethrough the gut flora. We thus sought to investigate thecharacteristics of the fecal microbiota in a mouse modelof spontaneous colitis that occurs in a colony of Balb/cT-bet-/- × Rag2-/- mice using 16S rRNA gene metage-nomic data [66,67].LEfSe was applied to the microbiota data of 20 T-bet-/-

× Rag2-/- (case) and 10 Rag2-/- (control) mice (datasetprovided in Additional File 10), finding 19 differentiallyabundant taxonomic clades (a = 0.01) with an LDAscore higher than 2.0 (Figure 3). These differentiallyabundant clades were consonant with both our prior16S rRNA-based sequence analysis using complete link-age hierarchical clustering and quantitative real timePCR-based experiments performed on the same fecalDNA samples [67]. More specifically, the marked loss inBifidobacteriaceae and Bifidobacterium associated withT-bet-/- × Rag2-/- we observed here may explain thepositive responsiveness of this colitis to a Bifidobacter-ium animalis subsp. lactis fermented milk product vali-dated with low-throughput approaches [67].At the family level, the Rag2-/- enrichment of Bifido-

bacteriaceae, Porphyromonadaceae, Staphylococcaceaeand the T-bet-/- × Rag2-/- enrichment of Lachnospira-ceae confirm our reports in [68] using culture-basedand quantitative real time PCR techniques. LEfSe’s LDAscore more informatively reorders these taxa relative tothe P-values found for these families in our previouswork, highlighting the Bifidobacteria and, interestingly,several clades within the Clostridia. These include the

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 6 of 18

Rag2-/--specific Roseburia and Papillibacter generabelonging to T-bet-/- × Rag2-/--specific families (Lach-nospiraceae and Ruminococcaceae). The significant pre-sence of Metascardovia (Bifidobacteriaceae) in Rag2-/-

mice is also interesting, as it may have a role similar toBifidobacterium and because Metascardovia has beenpreviously observed primarily in the oral cavity [68].This analysis both highlights the agreement of LEfSe’seffect size estimation with respect to low-throughputconfirmations and suggests additional clades to befurther investigated experimentally.

A comparison with current metagenomic analysis toolsusing viral and microbial pathways from environmentaldataWe applied LEfSe to the environmental data of [69], adataset with the goal of characterizing the functionalroles of viromes (that is, viral metagenomes) versusmicrobiomes (that is, bacterial metagenomes). This taskwas used in [45] to characterize the Metastats algorithmon the same raw data. Among the 29 high-level func-tional roles (including unclassified roles) in the subsys-tem hierarchy of the SEED [70] and NMPDR [71]frameworks, LEfSe identifies only the ‘Nucleosides andnucleotides’ subsystem to be strictly differentially abun-dant among all environmental subclasses, specifically

with higher levels in viromes than microbiomes. This isan accurate characterization of exactly the protein func-tion most commonly encoded in viral genomes, whereasbacterial genomes of course encode a wide range of lessspecifically enriched functionality. When LEfSe isrelaxed to detect significant variations consistent for atleast one, rather than all, environmental subclasses, weadditionally determine the ‘Respiration’ subsystem to besignificantly enriched in microbiomes with respect toviromes, likely reflecting the uniformly aerobic bacterialmetabolism captured by these data.In addition to the Nucleosides and nucleotides and

Respiration subsystems, Metastats [45] reports five otherhigh-level functional roles as differentially abundant (P =0.001). However, when taking the subclass structure intoaccount across the sampled environments, these additionaldifferences show much less consistent variation. This isdemonstrated in Figure 4, which reports histograms ofraw data for these cases and the different results of LEfSe,Metastats and the KW test alone. Moreover, since the sub-system framework is hierarchical (three levels), LEfSe’sresults include a cladogram showing the significant differ-ences on each level (see Figure 4 for a two-level clado-gram, and Additional file 2 for a three-level cladogram).Considering all three levels of SEED functional specifi-

city, LEfSe reports 59 subsystems to be more abundant

Figure 3 Comparison between Rag2-/- (control) and T-bet-/- × Rag2-/- (case) mice highlighting that, at the phylum level, Firmicutes areenriched in T-bet-/- × Rag2-/- mice, whereas Actinobacteria are enriched in Rag2-/- mice. In agreement with previous culture-based studies,Bifidobacterium species are underabundant in T-bet-/- × Rag2-/- mice [68], and LEfSe highlights several additional genus-level clades, including thespecifically depleted Roseburia and Papillibacter within the otherwise overabundant Firmicutes.

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 7 of 18

in microbial metagenomes and only 7 that are moreabundant in viral metagenomes (Additional file 3). Bac-terial genomes encode a much greater quantity anddiversity of biomolecular functionality than most viralgenomes, and these differences are thus to be expected.However, they also highlight a consideration specific tomost metagenomic (and, more generally, ecological)analyses, which typically analyze relative abundances. Afew very common subsystems in viromes (that is,Nucleosides and nucleotides) will force the relativeabundance of all other subsystems to decrease, resultingin apparent under-abundance. The subsystems detectedto be virus-specific may thus show this trend in partdue to the normalization of abundances in each sample.This issue is specific to neither LEfSe nor Metastats,however, and must be taken into account during inter-pretation of any relative abundance data, metagenomicor otherwise [72].

Functional activity within the infant and adult microbiotaindicates post-weaning microbial specializationJust as LEfSe can determine whether organisms or path-ways are differentially abundant among several

metagenomic samples, it can also focus on individualenzymes or orthologous groups. Kurokawa et al. [73]analyzed 13 gut metagenomes from nine adults and fourunweaned infants in terms of the functions of ortholo-gous gene families. They originally did this by compar-ing the COGs [74,75] found in each metagenome to areference database; later, White et al. [45] applied theMetastats algorithm to directly detect differencesbetween infant and adult microbiomes. Using signifi-cance a values of 0.01 due to the low cardinality of theclasses (in particular the infant class), LEfSe detected366 COGs to be enriched in either adult or infant meta-genomes, 17 of which have a LDA score higher than 3(Additional file 4).Among the 17 COG profiles with LEfSe scores higher

than 3, 11 are also detected by Metastats. The six COGsnot detected by Metastats (Additional file 5) are Outermembrane protein (COG1538) and Na+-driven multi-drug efflux pump (COG0534), enriched in adults, andTransposase and inactivated derivatives (COG2801,COG2963), Transcriptional regulator/sugar kinase(COG1940) and Transcriptional regulator (COG1309),enriched in infants. All six COGs possess abundance

Figure 4 LEfSe highlights pathways consistently differential between bacterial microbiomes and viromes within diverse environmentalsubclasses. (a) Using the SEED [71] catalog of functional pathways, LEfSe reports Nucleoside and nucleotide metabolism and Respiration todiffer consistently between bacterial microbiomes and viromes across environmental samples described in [70]. The former is significant usingthe strictest all-subclasses test, the latter in the more lenient one-subclass test. (b) A two-level cladogram reporting the significant pathwaydifferences as visualized using the SEED hierarchy (see Additional file 3 for the three-level cladogram and detailed differences). (c) Metastats [45]reports four additional pathways differential among these data (Carbohydrates, DNA metabolism, Membrane transport and Nitrogenmetabolism). Using only the KW test portion of LEfSe (a = 0.05), we obtain results consonant with Metastats (excluding Nitrogen metabolism).However, as shown here, an overview of the abundance histograms of these subsystems demonstrates them to be less consistent acrossenvironments (for example, Coral and Hyper-saline subclasses in the Carbohydrates, Membrane transport and Nitrogen metabolism) and to losesignificance within individual subclasses (as for the DNA metabolism subsystem).

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 8 of 18

profiles that are completely non-overlapping betweeninfant and adult individuals (apart from COG1538, inwhich the lowest level in adults is slightly lower thanthe highest in infants) and are thus nominally quite dis-criminative. On the other hand, among the 192 COGsfound by Metastats, 9 are not detected by LEfSe even atthe lowest LDA score threshold (Additional file 6). Allpossess overlapping abundance values between infantand adult classes (at least two, and often more, of thehighest samples in the less abundant class overlap theputatively more abundant class). This lack of discrimina-tory power precludes LEfSe from highlighting the differ-ences as significant between adults and infants,particularly given the low number of infant samples.Intriguingly, LEfSe’s distinct list of functional activities

in the core infant and adult microbiomes is suggestiveof ‘generalist’ microbial activity during early life and spe-cialization over time [76]. In fact, inspecting the five dif-ferentially abundant COGs with the highest effect sizesfor each class, we find for infants very high-level func-tional groups related to broad transcriptional regulation(COG1609, COG1940, COG1309 and COG3711). Inadults, all five represent more specialized orthologousgroups, including COG1629 (Outer membrane receptorproteins, mostly Fe transport), COG1595 (DNA-directedRNA polymerase specialized sigma subunit, sigma24homolog), and COG4771 (Outer membrane receptor forferrienterochelin and colicins). Since the number of dif-ferentially abundant COGs is very high (366), this obser-vation was only highlighted at the top of the candidatebiomarker list due to LEfSe’s effect size quantification,which allows the most characteristic differences amongclasses to emerge. For the same reason, we can easilyconfirm that sugar metabolism plays a crucial role inthe infant gut and iron metabolism in adults, as alreadystated in [45,73]; the COGs with the highest LDA scoresindeed possess sugar and glucose functional activitiesfor infants and iron-related functionality for adults.

LEfSe achieves a very low false positive rate in syntheticdataWe further investigated the ability of LEfSe to detectbiomarkers using synthetic high-dimensional data (seeMaterials and methods for the description of the data-set) in comparison with the KW test alone (a non-para-metric adaptation of the analysis of variance (ANOVA))and with Metastats [45]. The LDA effect size step ofLEfSe is not considered here for simplicity, and the arti-ficial data are detailed in Figure 5.Theoretically, the settings of the first two experiments

(Figure 5a,b) exactly match the application conditions forthe KW test. The false positive rate (mean 2.5%, regard-less of the distance between feature means and of thestandard deviation of the normal distribution) is in fact

consistent with the a value of 0.05, given that the nega-tive features are half of the total. LEfSe behaved qualita-tively very similar to KW, but with a considerably lowerfalse positive rate (less than 0.5% in the great majority ofthe cases against a mean value of 2.5%) and a higher falsenegative rate. In biology, false positives are often per-ceived as more dramatic than false negatives [77-79]; thisis often attributable to the fact that it is undesirable toinvest in expensive experimental follow-up of false posi-tives, whereas in high-throughput settings, a few truepositives outweigh the false negatives that are left unin-vestigated. With this motivation for minimizing falsepositives, we conclude that LEfSe performs at least aswell as KW when no meaningful subclass structure isavailable. On the other hand, when subclasses can beidentified internally to the classes and some of them donot agree with the trend among classes, LEfSe performsqualitatively and quantitatively much better than KW(Figure 5c). The false positives are in fact always substan-tially lower than KW, whereas the false negatives arehigher only for very noisy features. Metastats [45] seemsto achieve results very similar to KW (Additional file 7)with the same disadvantages with respect to LEfSe.

ConclusionsGaining insight into the structure, organization, andfunction of microbial communities has been proposedas one of the major research challenges of the currentdecade [80], and it will be enabled by both experimentaland computational metagenomic analyses. To this end,we have developed the LEfSe algorithm for comparativemetagenomic studies, permitting the characterization ofmicrobial taxa specific to an experimental or environ-mental condition, the detection of pathways and biologi-cal mechanisms over- or under-represented in differentcommunities, and the identification of metagenomicbiomarkers in mammalian microbiomes. LEfSe is shownhere to be effective in detecting differentially abundantfeatures in the human microbiome (characteristicallymucosal or aerobic taxa) and in a mouse model of coli-tis. A comparison with existing statistical methods andstate-of-the-art metagenomic analyses of environmental,infant gut microbiome, and synthetic data shows thatLEfSe consistently provides lower false positive ratesand can effectively aid in explaining the biology underly-ing differences in microbial communities.These findings demonstrate that a concept of class

explanation including both statistical and biological sig-nificance is highly beneficial in tackling the statisticalchallenges associated with high-dimensional biomarkerdiscovery [28,81,82]. Specifically, LEfSe determines fea-tures potentially able to explain the differences amongconditions rather than the features that simply possessuneven distributions among classes. This is distinct

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 9 of 18

from most current statistical approaches [45] and akinto the incorporation of biological prior knowledge thathas proven highly successful in recent genome-wideassociation studies [83-85]. Moreover, particularly in(often noisy) metagenomic datasets, effect size can serveas an orthogonal measure to complement ranking bio-markers based on P-values alone. Differences betweenclasses can be very statistically significant (low P-value)but so small that they are unlikely to be biologicallyresponsible for phenotypic differences. On the otherhand, a biomarker with a relatively large P-value (forexample, 0.01) may correspond to a large effect size,with statistical significance diminished by technicalnoise. LEfSe investigates both aspects computationallyby testing both the consistency and the effect size of dif-ferences in feature abundance among classes withrespect to the structure of the problem. This is

performed subsequently to standard statistical signifi-cance tests and is integrated in LEfSe by assessing biolo-gically meaningful groups of samples among subclasseswithin each condition. This coupling of statisticalapproaches with biological consistency and effect sizeestimation alleviates possible artifacts or statistical inho-mogeneity known to be common in metagenomic data,for example, extreme variability among subjects or thepresence of a long tail of rare organisms [32,86]. Simi-larly, while multiple hypothesis corrected statistical sig-nificance speaks to the potential reproducibility of aresult, estimation of effect size in high-dimensional set-tings is crucial for addressing biological consistency andinterpretability.The biology highlighted by these investigations speaks

to the potential of metagenomics for both microbialecology and translational applications. For example,

Figure 5 Comparison of LEfSe and the KW test alone for false positive and negative rates in synthetic data. Both tests used a = 0.05 inall cases, and the three artificial datasets comprise 100 samples, each in two classes, each with two subclasses of cardinality 25. The samplesconsist of 1,000 synthetic features taking the place of microbial taxa, pathways, and so on; half are negative (not biomarkers) and the other halfpositive. (a) LEfSe and KW false positive and negative rates at increasing values of the difference between class means. Negative features arenormally distributed with parameters (μ = 10,000, s = 100) across classes; positive features contain classes with increasingly different means. (b)Performance as standard deviation varies within classes (rather than the difference between means, fixed at 2,000). (c) Performance as standarddeviation increases within inconsistent subclasses. Negative features have subclasses sampled from the same normal distribution (and thus notrepresenting consistent biomarkers). Positive features are distributed as in (b). In all cases, LEfSe sacrifices a small number of false negatives inorder to achieve a false positive rate near zero, with the goal of ensuring that biomarkers of large effect size will be both reproducible andbiologically interpretable.

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 10 of 18

certain bacterial clades are frequently detected as bio-markers even in diverse environments, suggesting thatsome species can adapt in surprisingly condition-specificmanners. Staphylococcus and the Bacillales, for example,are discriminative for mucosal tissues, aerobic condi-tions, and murine colitis, whereas no Proteobacteriaconsistently characterize any of these conditions, eventhough they always represent a substantial portion ofthe communities. These observations may reflect exten-sive microenvironmental heterogeneity and the coexis-tence of generalist and specialist bacteria [87-89].In addition to these insights into microbiology, meta-

genomic biomarkers, including the abundances of speci-fic organisms, abundances of entire clades, or thepresence/absence of specific organisms, can serve todescribe host phenotypes, lifestyle, diet, and disease aswell [5-10]. If the depletion of Bifidobacterium speciesin ulcerative colitis proves to occur early in human dis-ease etiology, this and comparable shifts in the micro-biota have potential applications in the detection ofhuman disorders [90,91], especially as shifts in somebacterial consortia can be detected easily and inexpen-sively. Oral microbial biomarkers, for example, can beeasily acquired and analyzed with microarray chips tar-geted for bacterial profiling [92]. These appear particu-larly promising for clinical applications [11], as themicrobial communities in the saliva seem to representone potential proxy for other human microbiota [93].Other important clinical applications of metagenomicanalyses include probiotic treatments [94,95] and micro-biome transplantation [96-99] for gastrointestinaldiseases.LEfSe, the computational approach to biomarker class

comparisons detailed here, thus contributes to the under-standing of microbial communities and guides biologistsin detecting novel metagenomic biomarkers. The algo-rithm’s effectiveness on real and synthetic data has beenhighlighted by several experiments in which we success-fully characterized both host-associated microbiota andenvironmental microbiomes in multiple contexts. Tosupport ongoing metagenomic analyses, we have imple-mented LEfSe as a user-friendly web application that canprovide both raw data and publication-ready graphicalresults, including reports of detected microbial variationon taxonomic trees for visual and biological summariza-tion. LEfSe is freely available online in the Galaxy work-flow framework [46,47] at the following link [48].

Materials and methodsThe LEfSe algorithm is introduced in overview in theResults section, and Figure 6 illustrates in detail the for-mat of the input (a matrix with n rows and m columns)and the three steps performed by the computationaltool: the KW rank sum test [49] on classes, the pairwise

Wilcoxon test [50,51] between subclasses of differentclasses, and the LDA [52] on the relevant features.Each of the n features is represented with a positive-

valued vector containing its abundances in the m sam-ples, and each sample is associated with values describ-ing its class and, optionally, subclass and/or originatingsubject. The factorial KW rank sum test is applied toeach feature with respect to the class factor; the subclassand subject information are used as stratifying sub-groups when present. Features that, according to theKW rank sum test, do not violate the null hypothesis ofidentical value distribution among classes (with defaultP-value, a = 0.05) are not further analyzed. The pairwiseWilcoxon test is applied to retained features belongingto subclasses of different classes. For each feature, thepairwise Wilcoxon test is not satisfied if at least onecomparison between subclasses has a P-value higherthan the chosen a or if the sign of variation is not equalamong all comparisons. For example, if a featureappears in samples from two classes with three sub-classes each, all nine comparisons between subclasses indifferent classes must violate the null hypothesis, and allsigns of the differences between medians must be con-sistent. The features that pass the pairwise Wilcoxontest are considered successful biomarkers. An LDAmodel is finally built with the class as dependent vari-able and the remaining feature values, subclass, and sub-ject values as independent variables. This model is usedto estimate their effect sizes, which are obtained byaveraging the differences between class means (usingunmodified feature values) with the differences betweenclass means along the first linear discriminant axis,which equally weights features’ variability and discrimi-natory power. The LDA score for each biomarker isobtained computing the logarithm (base 10) of thisvalue after being scaled in the [1,106] interval and,regardless of the absolute values of the LDA score, itinduces the ranking of biomarker relevance. For robust-ness, LDA is additionally supported by bootstrapping(default 30-fold) and subsequent averaging.LEfSe’s first two steps employ non-parametric tests

because of the nature of metagenomic data. Relative abun-dances will, in most cases, violate the main assumption oftypical parametric tests (normal population in each class),whereas non-parametric tests are much more robust tothe underlying distribution of the data since they are dis-tribution-free approaches. The only assumption of theWilcoxon and KW tests is that the distributions in eachclass are identically shaped with possible differences in themedians. For example, the bimodal or multimodal abun-dance distribution of an organism violates the assumptionsof parametric tests but not those of non-parametric tests,unless the number of peaks in the distribution (or, moregenerally, the shape of the distribution) also changes

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 11 of 18

among classes. LDA is used for effect size estimation asour experiments determined it to more accurately estimatebiological consistency compared to approaches like differ-ences in group means/medians or support vectormachines (SVMs) [100]. A comparison between LDA andSVM approaches for effect size estimation on the murinemodel of ulcerative colitis (for which low-throughput bio-logical validations of biomarkers are available in [67]) isreported in our supplemental material (Additional files 8and 9) and shows the advantages of LDA with respect toupranking features of potential biological interest. Theore-tically, this is motivated by LDA’s ability to find the axis ofhighest variance and SVM’s focus on features’ combinedpredictive power rather than single feature relevance. Notethat as we are performing class comparison rather thanclass prediction, it is worth specifying that the effect sizeestimation accuracy of an algorithm is not directly con-nected with its predictive ability (for example, SVMapproaches are generally considered more accurate thanLDA for prediction).

Multiclass strategiesComparisons with more than two classes require specialstrategies for applying the Wilcoxon and LDA steps,

whereas the factorial KW test is already appropriate forthis setting. Our multiclass strategy for the Wilcoxontest depends on the problem-specific strategy chosen bythe user to define features differentially distributedamong the n classes. In the most stringent strategy, werequire that all n abundance profiles of a feature are sta-tistically significantly distinct among all n classes. Thisstrategy, called ‘strict’, is implemented by requiring thatall Wilcoxon tests between classes are significant. Amore permissive strategy, called ‘non-strict’, considers afeature as a biomarker if at least one class is significantlydifferent from all others. The more permissive strategythus needs to satisfy only a subset of the Wilcoxontests. Regardless of the strategy, the LDA step alwaysreports the highest score detected among all pairwiseclass comparisons.

Subclass structure variants encoding different biologicalhypothesesDifferent interpretations of the biomarker class compari-son problem are implemented in LEfSe by modifying therequirements for pairwise Wilcoxon comparisons amongsubclasses. If classes contain subclasses that representdistinct strata, we test only comparisons within each

Figure 6 Schematic representation of the statistical and computational steps implemented in LEfSe. Input data consist of a collection ofm samples (columns) each made up of n numerical features (rows, typically normalized per-sample, red representing high values and greenlow). These samples are labeled with a class (taking two or more possible values) that represents the main biological comparison underinvestigation; they may also have one or more subclass labels reflecting within-class groupings. (a) Step 1 analyzes all features, testing whethervalues in different classes are differentially distributed. (b) Features violating the null hypothesis are further analyzed in step 2, which testswhether all pairwise comparisons between subclasses in different classes significantly agree with the class level trend. (c) The resulting subset ofvectors is used to build a LDA model from which the relative difference among classes is used to rank the features. The final output thusconsists of a list of features that are discriminative with respect to the classes, consistent with the subclass grouping within classes, and rankedaccording to the effect size with which they differentiate classes.

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 12 of 18

identical subclass (Figure 4). For example, to assess theeffect of a treatment on two sub-types of the same dis-ease, we compare pre- and post-treatment levels withineach subclass and require that the trend observed at theclass level is significant independently for both sub-classes. To implement this variant, LEfSe performs theWilcoxon step only comparing subclasses with the samename. Alternatively, subclasses may represent covariateswithin which feature levels may vary but for which theproblem does not dictate explicit stratification (Figure2). In both settings, we explicitly require all the pairwisecomparison to reject the null hypothesis for detectingthe biomarker; thus, no multiple testing corrections areneeded.

Subclasses containing few samplesWhen few samples are available, non-parametric testslike the Wilcoxon have reduced power to detect differ-ences. This can affect LEfSe when subclasses are verysmall, preventing the overall test from even rejecting thenull hypothesis. For this reason, small subclasses shouldbe avoided when possible, for example, by excludingthem from the problem or by grouping together all sub-classes with small cardinalities. For cases in whichremoving or grouping subclasses is not possible or dis-rupts the biological consistency of the analysis, LEfSesubstitutes the Wilcoxon test with a test to comparewhether subclass medians differ with the expected sign.The user can choose the subclass cardinality thresholdat which this median comparison is substituted for theWilcoxon test.

Parameter settingsExcept as stated otherwise in Results, all experiments inthis study were run with LEfSe’s a parameter for pair-wise tests set to 0.05 for both class normality and sub-class tests, and the threshold on the logarithmic score ofLDA analysis was set to 2.0. The stringency of theseparameters is easily tunable (also through the web inter-face) and allows the user to detect biomarkers withlower P-values and/or higher effect size in order, forexample, to prioritize additional biological experimentsand validations. All LDA scores are determined by boot-strapping over 30 cycles, each sampling two-thirds ofthe data with replacement, with the maximum influenceof the LDA coefficients in the LDA score of three ordersof magnitude.

Data descriptionExcept as stated otherwise, taxonomic abundances for16S samples were generated from filtered sequencereads using the RDP classifier [101], with confidencesbelow 80% rebinned to ‘uncertain’. For all the datasetsdescribed below, the final input for LEfSe is a matrix of

relative abundances obtained from the read counts withper-sample normalization to sum to one. Witten-Bellsmoothing [102] was used to accommodate rare types,but due to LEfSe’s non-parametric approach, this hasminimal effect on the discovered biomarkers and on theLDA score. This also allows our biomarker discoverymethod to avoid most effects of sequence quality issuesas long as any sequencing biases are homogeneousamong different conditions, as no specific assumptionson the statistical distribution and noise model are madeby the algorithm as is standard for non-parametricapproaches.

Human microbiome dataThe 16S rRNA-based phylometagenomic dataset of thenormal (healthy) human microbiome was made availablethrough the Human Microbiome Project [13], and con-sists of 454 FLX Titanium sequences spanning the V3to V5 variable regions obtained for 301 samples from 24healthy subjects (12 male, 12 female) enrolled at a singleclinical site in Houston, TX. These samples cover 18different body sites, including 6 main body site cate-gories: the oral cavity (9 samples), the gut (1 sample),the vagina (3 samples), the retroauricular crease (2 sam-ples), the nasal cavity (1 sample) and the skin (2 sam-ples). Detailed protocols used for enrollment, sampling,DNA extraction, 16S amplification and sequencing areavailable on the Human Microbiome Project Data Ana-lysis and Coordination Center website [103], and arealso described elsewhere [55,56]. In brief, genomic DNAwas isolated using the Mo Bio PowerSoil kit [104] andsubjected to 16S amplifications using primers designedincorporating the FLX Titanium adapters and a samplebarcode sequence, allowing directional sequencing cov-ering variable regions V5 to partial V3 (primers: 357F5’-CCTACGGGAGGCAGCAG-3’ and 926R 5’-CCGTCAATTCMTTTRAGT-3’). Resulting sequenceswere processed using a data curation pipeline imple-mented in mothur [41], which reduces the sequencingerror rate to less than 0.06% as validated on a mockcommunity. As part of the pipeline parameters, to passthe initial quality control step, one unambiguous mis-match to the sample barcode and two mismatches tothe PCR amplification primers were allowed. Sequenceswith an ambiguous base call or a homopolymer longerthan eight nucleotides were removed from subsequentanalyses, as suggested previously [105]. Based on thesupplied quality scores, all sequences were trimmedwhen a base call with a score below 20 was encoun-tered. All sequences were aligned using a NAST-basedsequence aligner to a custom reference based on theSILVA alignment [106,107]. Sequences that were shorterthan 200 bp or that did not align to the anticipatedregion of the reference alignment were removed from

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 13 of 18

further analysis. Chimeric sequences were identifiedusing the mothur implementation of the ChimeraSlayeralgorithm [108]. Unique reads were classified with theMSU RDP classifier v2.2 [58] using the taxonomy pro-posed by [109], maintained at the RDP (RDP 10 data-base, version 6). The 16S rRNA reads are available inthe Sequence Read Archive at [110].

T-bet-/- × Rag2-/- and Rag2-/- mouse dataT-bet-/- × Rag2-/- and Rag2-/- mice, their husbandry, andtheir chow have been described in [67]. The animal stu-dies and experiments were approved and carried outaccording to Harvard University’s Standing Committeeon Animals as well as National Institutes of Healthguidelines. Collection, processing, and extraction ofDNA from fecal samples were performed as describedin [67]. The V5 and V6 regions of the 16S rRNA genewere targeted for amplification and multiplex pyrose-quencing with error-correcting barcodes. Sequencingwas performed using a Roche FLX Genome Sequencerat DNAVision (Charleroi, Belgium) and data were pre-processed to remove sequences with low-quality scores.There were 7,579 ± 2,379 high-quality 16S reads persample with a mean read length of 278 bp.

Viral and microbial environmental dataWe retrieved from the online supplemental material of[69] the 80 available metagenomes (42 viromes, 38microbiomes). We identified three environments con-taining at least seven samples and grouped them intocoral, hyper-saline, and marine subclasses; the fourthsubclass, other, groups all environments with fewsamples.

Infant and adult microbiome dataThe COG profiles of the nine adult and four unweanedinfant microbiomes were obtained from the supplemen-tal material of [73] and used unmodified in this study.

Synthetic datasetsWe built three collections of artificial datasets in orderto compare LEfSe to KW and Metastats. All datasetshave 1,000 features and 100 samples belonging evenly totwo classes, and the values are sampled from a Gaussiannormal distribution. The samples in the two classes arefurther organized in four subclasses (two per class) withequal cardinality. Of the 1,000 features, 500 featureshave different means across classes and should thus bedetected as biomarkers (positive features), the other 500features are evenly distributed among classes or amongat least one subclass in both classes and should not bedetected as discriminative (negative features). The meth-ods are evaluated assessing the false positive rate (num-ber of erroneously detected biomarkers with respect to

the total number of features) and the false negative rate(number of correctly detected non-discriminant featureswith respect to the total number of features, that is, sen-sitivity). The three collections of datasets (graphicallyshown in Figure 5) differ in the distribution of values inthe subclasses and in the mean/standard deviation ofthe normal distribution. (a) The subclasses in the sameclass have the same parameters (thus the subclass orga-nization is meaningless). Negative features all have μ =10,000 and s = 100, whereas one class of the positivefeatures has μ = 10,000 - t (s = 100) and the other μ =10,000 + t (s = 100) where t is a parameter rangingfrom 1 to 150. The performances of all methods areassessed at regular steps of the t parameter. (b) Datasetsin this collection are defined in the same way as collec-tion (a) but with t = 1,000 for all datasets and s rangingfrom 1,000 to 10,000. (c) The negative class in the thirdcollection has different subclass distribution. In particu-lar, the second subclass of the first class has the samemean of the first subclass of the second class. The othertwo subclasses have different means (μ = 10,000 - t andμ = 10,000 + t, t = 1,000), but the feature is not consid-ered differential since the difference is not consistentbetween subclasses. The positive features are defined inthe same way as dataset (b).

Implementation and availability of the methodLEfSe is implemented in Python and makes use of R sta-tistical functions in the coin [111] and MASS [112]libraries through the rpy2 library [113] and of the mat-plotlib [114] library for graphical output. LEfSe is pro-vided with a graphical interface in the Galaxyframework [46,47], which allows the user to select para-meters (the primary three stringency parameters, themulticlass setting, and other computational, statistical,and graphical preferences), to pipeline data betweenmodules in a workflow framework, to generate publica-tion-quality graphical outputs, and to combine theseresults with other statistical and metagenomic analyses.LEfSe is available at [48].

Additional material

Additional file 1: Supplementary Figure S6. Histogram of within-subject b-diversity (community dissimilarity) between different mucosal(red) and non-mucosal (green) body sites.

Additional file 2: Supplementary Figure S1. Cladogram representingthe differences between viromes and microbiomes on the subsystemframework.

Additional file 3: Supplementary Figure S2. Histogram of LDAlogarithmic scores of biomarkers found by LEfSe comparing microbiomesand viromes within the subsystem framework.

Additional file 4: Supplementary Figure S3. Histogram of LDAlogarithmic scores of COG biomarkers found by LEfSe comparing adultand infant microbiomes.

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 14 of 18

Additional file 5: Supplementary Figure S4. Functional features (COGs)that are discrimantive for the comparison between adult and infantmicrobiomes according to LEfSe but not detected by Metastats amongthe discriminant features with LDA score higher than 3. If we consider allthe discriminant features without threhold on LDA score, LEfSe identifies366 COGs in total, 185 of which are not discriminant for Metastats.

Additional file 6: Supplementary Figure S5. Functional features (COGs)that are discrimantive for the comparison between adult and infantmicrobiomes according to Metastats but not detected by LEfSe. Even ifmedian and variance suggest the differences to be discriminative, thereare always some microbiomes (at least two) that are overlappingbetween classes. This is due to the stringent a-value (0.01) set for theKW test in LEfSe and to the fact that we use non-parametric statistics(differently from Metastats). Notice, however, that even using a low a-value LEfSe detects many more biomarkers than metastats (366 versus192).

Additional file 7: Supplementary Figure S9. Comparison betweenLEfSe and Metastats using the synthetic data described in Figure 5 andin the Materials and methods. LEfSe was applied as detailed in the paper;for Metastats we used the default settings (that is, a = 0.05 andNpermutations = 1,000) and, as for LEfSe and KW, we disabled the per-sample normalization as the features are independent. (a,b) Metastatshas a higher false positive rate (average 5%) than LEfSe (average below0.5%) and lower false negative rate. (c) When the subclass information ismeaningful (see Figure 5 for the representation of the dataset), LEfSeperforms substantially better than Metastats both in terms of falsepositive and false negatives. Overall, on these synthetic data, Metastatsachieves very similar results compared to KW (Figure 5) and neither ofthem can make use of additional information regarding the within-classstructure, thus achieving poor results compared to LEfSe when suchkinds of information are available.

Additional file 8: Supplementary Figure S7. SVM-based effect sizeestimation for the biomarkers found for the Rag2-/- versus T-bet-/-xRag2-/-

comparison reported in Figure 3 of the manuscript. The LDA-basedapproach for assessing effect size (Figure 3) is closer to the biologicalfollow-up experiments and is more visually consistent. The reason forLDA superiority over SVM approaches for effect size estimation istheoretically connected with the ability of LDA to find the axis with thehighest variance, and the SVM effort on evaluating the combined featurepredictive power rather than single feature relevance. It is worthspecifying that the effect size estimation accuracy of an algorithm is notdirectly connected with its predictive ability (SVM approaches are usuallyconsidered more accurate than LDA for prediction).

Additional file 9: Supplementary Figure S8. Comparison between thefeatures with the highest SVM-based effect size (Papillibacter, on the left),the highest LDA-based effect size (Bifidobacterium, in the center), and theActinobacteria phylum (on the right). From a visual analysis,Bifidobacerium shows a larger effect size, which is also evident looking atthe ratios between class means, suggesting LDA as a better option foreffect size estimation than SVM approaches. As detailed in themanuscript, the relevance of Bifidobacterium has been experimentallyvalidated. Moreover, the large difference in the score given by the SVMapproach to Actinobacteria compared to Bifidobacterium and Papillibacteris not consistent.

Additional file 10: T-bet-/- × Rag2-/- - Rag2-/- dataset. Input LEfSe filefor the analysis of the ulcerative colitis phenotype in mice.

Abbreviationsbp: base pair; KW: Kruskal-Wallis; LDA: linear discriminant analysis; LEfSe:linear discriminant analysis effect size; PCR: polymerase chain reaction; RDP:Ribosomal Database Project; SVM: support vector machines.

AcknowledgementsWe would like to thank the entire Human Microbiome Project consortium,including the four sequencing centers (the Broad Institute, WashingtonUniversity, Baylor College of Medicine, and the J Craig Venter Institute),associated investigators from many additional institutions, and the NIH

Office of the Director Roadmap Initiative. This work was supported in partby grant DE017106 from the National Institute of Dental and CraniofacialResearch (JI), NIH grants AI078942 (WSG) and Burroughs Wellcome Fund(WSG), and was funded by NIH 1R01HG005969 to CH.

Author details1Department of Biostatistics, 677 Huntington Avenue, Harvard School ofPublic Health, Boston, MA 02115, USA. 2Department of Molecular Genetics,245 First Street, The Forsyth Institute, Cambridge, MA 02142, USA.3Department of Oral Medicine, Infection and Immunity, 188 Longwood Ave,Harvard School of Dental Medicine, Boston, MA 02115, USA. 4MicrobialSequencing Center, 7 Cambridge Center, The Broad Institute of MIT andHarvard, Cambridge, MA 02142, USA. 5Department of Immunology andInfectious Diseases, 665 Huntington Avenue, Harvard School of PublicHealth, Boston, MA 02115, USA. 6Department of Medicine, 75 Francis Street,Harvard Medical School, Boston, MA 02115, USA. 7Department of MedicalOncology, 44 Binney Street, Dana-Farber Cancer Institute, MA 02215, USA.

Authors’ contributionsNS and CH conceived the study; NS and LM implemented the methodology;NS: JI: LW: DG: WG: and CH analyzed the results; NS: JI: LW: DG: WG: and CHwrote the manuscript. All authors read and approved the manuscript in itsfinal form.

Received: 4 April 2011 Revised: 31 May 2011 Accepted: 24 June 2011Published: 24 June 2011

References1. Golub TR: Molecular classification of cancer: class discovery and class

prediction by gene expression monitoring. Science 1999, 286:531-537.2. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM,

Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomicpatterns in serum to identify ovarian cancer GLOSSARY. Lancet 2002,359:572-577.

3. Tothill RW, Tinker AV, George J, Brown R, Fox SB, Lade S, Johnson DS,Trivett MK, Etemadmoghadam D, Locandro B, Traficante N, Fereday S,Hung JA, Chiew YE, Haviv I, Australian Ovarian Cancer Study Group,Gertig D, DeFazio A, Bowtell DD: Novel molecular subtypes of serous andendometrioid ovarian cancer linked to clinical outcome. Clin Cancer Res2008, 14:5198-5208.

4. Wei X, Li K-C: Exploring the within- and between-class correlationdistributions for tumor classification. Proc Natl Acad Sci USA 2010,107:6737-6742.

5. De Filippo C, Cavalieri D, Di Paola M, Ramazzotti M, Poullet JB, Massart S,Collini S, Pieraccini G, Lionetti P: Impact of diet in shaping gut microbiotarevealed by a comparative study in children from Europe and ruralAfrica. Proc Natl Acad Sci USA 2010, 107:14691-14696.

6. Turnbaugh PJ, Bäckhed F, Fulton L, Gordon JI: Diet-induced obesity islinked to marked but reversible alterations in the mouse distal gutmicrobiome. Cell Host Microbe 2008, 3:213-223.

7. Ley RE, Peterson Da, Gordon JI: Ecological and evolutionary forcesshaping microbial diversity in the human intestine. Cell 2006,124:837-848.

8. Manichanh C, Rigottier-Gois L, Bonnaud E, Gloux K, Pelletier E, Frangeul L,Nalin R, Jarrin C, Chardon P, Marteau P, Roca J, Dore J: Reduced diversityof faecal microbiota in Crohn’s disease revealed by a metagenomicapproach. Gut 2006, 55:205-211.

9. Sokol H, Seksik P, Furet JP, Firmesse O, Nion-Larmurier I, Beaugerie L,Cosnes J, Corthier G, Marteau P, Doré J: Low counts of Faecalibacteriumprausnitzii in colitis microbiota. Inflamm Bowel Dis 2009, 15:1183-1189.

10. Ordovas JM, Mooser V: Metagenomics: the role of the microbiome incardiovascular diseases. Curr Opin Lipidol 2006, 17:157-161.

11. Zhang L, Henson BS, Camargo PM, Wong DT: The clinical value of salivarybiomarkers for periodontal disease. Periodontology 2000 2009, 51:25-37.

12. Zhang L, Farrell JJ, Zhou H, Elashoff D, Akin D, Park NH, Chia D, Wong DT:Salivary transcriptomic biomarkers for detection of resectable pancreaticcancer. Gastroenterology 2010, 138:949-957, e947.

13. NIH HMP Working Group, Peterson J, Garges S, Giovanni M, McInnes P,Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C,Baker CC, Di Francesco V, Howcroft TK, Karp RW, Lunsford RD,Wellington CR, Belachew T, Wright M, Giblin C, David H, Mills M, Salomon R,

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 15 of 18

Mullins C, Akolkar B, Begg L, Davis C, Grandison L, Humble M, Khalsa J,et al: The NIH Human Microbiome Project. Genome Res 2009,19:2317-2323.

14. Hamady M, Fraser-Liggett CM, Turnbaugh PJ, Ley RE, Knight R, Gordon JI:The Human Microbiome Project. Nature 2007, 449:804-810.

15. Magrini V, Turnbaugh PJ, Ley RE, Mardis ER, Mahowald MA, Gordon JI: Anobesity-associated gut microbiome with increased capacity for energyharvest. Nature 2006, 444:1027-1131.

16. Duncan SH, Lobley GE, Holtrop G, Ince J, Johnstone aM, Louis P, Flint HJ:Human colonic microbiota associated with diet, obesity and weight loss.Int J Obesity (Lond) 2008, 32:1720-1724.

17. Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Knight R, Gordon JI: The effectof diet on the human gut microbiome: a metagenomic analysis inhumanized gnotobiotic mice. Sci Transl Med 2009, 1:6ra14.

18. Gao Z, Tseng C-h, Strober BE, Pei Z, Blaser MJ: Substantial alterations ofthe cutaneous bacterial biota in psoriatic lesions. PloS One 2008, 3:e2719.

19. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW,Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM:Comparative metagenomics of microbial communities. Science 2005,308:554-557.

20. Solovyev VV, Allen EE, Ram RJ, Rokhsar DS, Chapman J, Richardson PM,Tyson GW, Rubin EM, Banfield JF, Hugenholtz P: Community structure andmetabolism through reconstruction of microbial genomes from theenvironment. Nature 2004, 428:37-43.

21. Lecuit M, Lortholary O: Immunoproliferative small intestinal diseaseassociated with Campylobacter jejuni. Med Mal Infect 2005, 35(Suppl 2):S56-58.

22. Relman DA, Schmidt TM, MacDermott RP, Falkow S: Identification of theuncultured bacillus of Whipple’s disease. N Engl J Med 1992, 327:293-301.

23. Oakley BB, Fiedler TL, Marrazzo JM, Fredricks DN: Diversity of humanvaginal bacterial communities and associations with clinically definedbacterial vaginosis. Appl Environ Microbiol 2008, 74:4898-4909.

24. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarraysapplied to the ionizing radiation response. Proc Natl Acad Sci USA 2001,98:5116-5121.

25. Smyth GK: Linear models and empirical Bayes methods for assessingdifferential expression in microarray experiments. Stat Appl Genet Mol Biol2004, 3:Article3.

26. Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan Ea, Wang Y: Theproperties of high-dimensional data spaces: implications for exploringgene and protein expression data. Nat Rev Cancer 2008, 8:37-49.

27. Swan Ka, Curtis DE, McKusick KB, Voinov AV, Mapa Fa, Cancilla MR: High-throughput gene mapping in Caenorhabditis elegans. Genome Res 2002,12:1100-1105.

28. Wooley JC, Ye Y: Metagenomics: facts and artifacts, and computationalchallenges*. J Comput Sci Technol 2009, 25:71-81.

29. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE,Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC,Knight R, Gordon JI: A core gut microbiome in obese and lean twins.Nature 2009, 457:480-484.

30. Pedrós-Alió C: Marine microbial diversity: can it be determined? TrendsMicrobiol 2006, 14:257-263.

31. Sogin ML, Morrison HG, Huber Ja, Welch D, Huse SM, Neal PR, Arrieta JM,Herndl GJ: Microbial diversity in the deep sea and the underexplored“rare biosphere”. Proc Natl Acad Sci USA 2006, 103:12115-12120.

32. Gobet A, Quince C, Ramette A: Multivariate Cutoff Level Analysis(MultiCoLA) of large community data sets. Nucleic Acids Res 2010, 38:e155.

33. Dethlefsen L, McFall-Ngai M, Relman DA: An ecological and evolutionaryperspective on human-microbe mutualism and disease. Nature 2007,449:811-818.

34. Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomicdata. Genome Res 2007, 17:377-386.

35. Mitra S, Gilbert JA, Field D, Huson DH: Comparison of multiplemetagenomes using phylogenetic networks based on ecological indices.ISME J 2010, 4:1236-1242.

36. Mitra S, Klar B, Huson DH: Visual and statistical comparison ofmetagenomes. Bioinformatics 2009, 25:1849-1855.

37. Parks DH, Beiko RG: Identifying biologically relevant differences betweenmetagenomic communities. Bioinformatics 2010, 26:715-721.

38. Lozupone C, Knight R: UniFrac: a new phylogenetic method forcomparing microbial communities. Appl Environ Microbiol 2005,71:8228-8235.

39. Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T,Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA: Themetagenomics RAST server - a public resource for the automaticphylogenetic and functional analysis of metagenomes. BMCBioinformatics 2008, 9:386.

40. Kristiansson E, Hugenholtz P, Dalevi D: ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes. Bioinformatics 2009,25:2737-2738.

41. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB,Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B,Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: open-source,platform-independent, community-supported software for describingand comparing microbial communities. Appl Environ Microbiol 2009,75:7537-7541.

42. Goll J, Rusch D, Tanenbaum DM, Thiagarajan M, Li K, Methé BA, Yooseph S:METAREP: JCVI Metagenomics Reports - an open source tool for high-performance comparative metagenomics. Bioinformatics 2010,26:2631-2632.

43. Jolliffe IT: Principal Component Analysis New York: Springer-Verlag; 1986.44. Gower JC: Some distance properties of latent root and vector methods

used in multivariate analysis. Biometrika 1966, 53:325-338.45. White JR, Nagarajan N, Pop M: Statistical methods for detecting

differentially abundant features in clinical metagenomic samples. PLoSComput Biol 2009, 5:e1000352.

46. Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach forsupporting accessible, reproducible, and transparent computationalresearch in the life sciences. Genome Biol 2010, 11:R86.

47. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M,Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool forexperimentalists. Curr Protoc Mol Biol 2010, Chapter 19:Unit 19.10.1-21.

48. LEfSe. [http://huttenhower.sph.harvard.edu/lefse/].49. Kruskal WH, Wallis WA: Use of ranks in one-criterion variance analysis. J

Am Stat Assoc 1952, 47:583-621.50. Wilcoxon F: Individual comparisons by ranking methods. Biometrics 1945,

1:80-83.51. Mann HB, Whitney DR: On a test of whether one of two random

variables is stochastically larger than the other. Ann Math Stat 1947,18:50-60.

52. Fisher RA: The use of multiple measurements in taxonomic problems.Ann Eugenics 1936, 7:179-188.

53. Dal Bello F, Hertel C: Oral cavity as natural reservoir for intestinallactobacilli. Syst Appl Microbiol 2006, 29:69-76.

54. Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R: Bacterialcommunity variation in human body habitats across space and time.Science 2009, 326:1694-1697.

55. Human Microbiome Project clinical sampling protocol. [http://hmpdacc.org/micro_analysis/microbiome_sampling.php].

56. Turner JR: Intestinal mucosal barrier function in health and disease. NatRev Immunol 2009, 9:799-809.

57. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: TheRibosomal Database Project: improved alignments and new tools forrRNA analysis. Nucleic Acids Res 2009, 37:D141-145.

58. Hilbert F, Scherwitzel M, Paulsen P, Szostak MP: Survival of Campylobacterjejuni under conditions of atmospheric oxygen tension with the supportof Pseudomonas spp. Appl Environ Microbiol 2010, 76:5911-5917.

59. Godon J-J, Morinière J, Moletta M, Gaillac M, Bru V, Delgènes J-P: Rarityassociated with specific ecological niches in the bacterial world: the‘Synergistes’ example. Environ Microbiol 2005, 7:213-224.

60. Shah Sa, Simpson SJ, Brown LF, Comiskey M, de Jong YP, Allen D,Terhorst C: Development of colonic adenocarcinomas in a mouse modelof ulcerative colitis. Inflamm Bowel Dis 1998, 4:196-202.

61. Pizarro T: Mouse models for the study of Crohn’s disease. Trends Mol Med2003, 9:218-222.

62. Panwala CM, Jones JC, Viney JL: A novel model of inflammatory boweldisease: mice deficient for the multiple drug resistance gene, mdr1a,spontaneously develop colitis. J Immunol 1998, 161:5733-5744.

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 16 of 18

63. Wirtz S, Neurath MF: Mouse models of inflammatory bowel disease. AdvDrug Delivery Rev 2007, 59:1073-1083.

64. Sartor RB: Mechanisms of disease: pathogenesis of Crohn’s disease andulcerative colitis. Nat Clin Pract Gastroenterol Hepatol 2006, 3:390-407.

65. Garrett WS, Lord GM, Punit S, Lugo-Villarino G, Mazmanian SK, Ito S,Glickman JN, Glimcher LH: Communicable ulcerative colitis induced by T-bet deficiency in the innate immune system. Cell 2007, 131:33-45.

66. Garrett WS, Gallini CA, Yatsunenko T, Michaud M, DuBois A, Delaney ML,Punit S, Karlsson M, Bry L, Glickman JN, Gordon JI, Onderdonk AB,Glimcher LH: Enterobacteriaceae act in concert with the gut microbiotato induce spontaneous and maternally transmitted colitis. Cell HostMicrobe 2010, 8:292-300.

67. Veiga P, Gallini CA, Beal C, Michaud M, Delaney ML, DuBois A, Khlebnikov A,van Hylckama Vlieg JE, Punit S, Glickman JN, Onderdonk A, Glimcher LH,Garrett WS: Bifidobacterium animalis subsp. lactis fermented milk productreduces inflammation by altering a niche for colitogenic microbes. ProcNatl Acad Sci USA 2010, 107:18132-18137.

68. Masaaki O, Yoshimi B, Kai-P L, Nobuko M: Metascardovia criceti Gen. Nov.,Sp. Nov., from hamster dental plaque. Microbiol Immunol 2007,51:747-754.

69. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M,Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C,Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL,Thurber RV, Wegley L, White BA, Rohwer F: Functional metagenomicprofiling of nine biomes. Nature 2008, 452:629-632.

70. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M,de Crécy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED,Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R,Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F,Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, et al: Thesubsystems approach to genome annotation and its use in theproject to annotate 1000 genomes. Nucleic Acids Res 2005,33:5691-5702.

71. Greene JM, Collins F, Lefkowitz EJ, Roos D, Scheuermann RH, Sobral B,Stevens R, White O, Di Francesco V: National Institute of Allergy andInfectious Diseases bioinformatics resource centers: new assets forpathogen informatics. Infect Immun 2007, 75:3212-3219.

72. Krebs CJ: Ecology: The Experimental Analysis of Distribution and AbundanceBenjamin Cummings; 2008.

73. Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, Takami H,Morita H, Sharma VK, Srivastava TP, Taylor TD, Noguchi H, Mori H, Ogura Y,Ehrlich DS, Itoh K, Takagi T, Sakaki Y, Hayashi T, Hattori M: Comparativemetagenomics revealed commonly enriched gene sets in human gutmicrobiomes. DNA Res 2007, 14:169-181.

74. Tatusov RL: A genomic perspective on protein families. Science 1997,278:631-637.

75. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS,Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: newdevelopments in phylogenetic classification of proteins from completegenomes. Nucleic Acids Res 2001, 29:22-28.

76. Turroni F, Foroni E, Pizzetti P, Giubellini V, Ribbera A, Merusi P, Cagnasso P,Bizzarri B, de’Angelis GL, Shanahan F, van Sinderen D, Ventura M: Exploringthe diversity of the bifidobacterial population in the human intestinaltract. Appl Environ Microbiol 2009, 75:1534-1545.

77. Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A: False discoveryrate, sensitivity and sample size for microarray studies. Bioinformatics2005, 21:3017-3024.

78. Suzuki Y, Nei M: False-positive selection identified by ML-based methods:examples from the Sig1 gene of the diatom Thalassiosira weissflogii andthe tax gene of a human T-cell lymphotropic virus. Mol Biol Evol 2004,21:914-921.

79. Boulesteix A-L: Over-optimism in bioinformatics research. Bioinformatics2010, 26:437-439.

80. 2020 visions.. Nature 2010, 463:26-32.81. Hamady M, Knight R: Microbial community profiling for human

microbiome projects: tools, techniques, and challenges. Genome Res2009, 19:1141-1152.

82. Wooley JC, Godzik A, Friedberg I: A primer on metagenomics. PLoSComput Biol 2010, 6:e1000667.

83. Ritchie MD: Using prior knowledge and genome-wide association toidentify pathways involved in multiple sclerosis. Genome Med 2009, 1:65.

84. Tintle N, Lantieri F, Lebrec J, Sohns M, Ballard D, Bickeböller H: Inclusion ofa priori information in genome-wide association analysis. Genet Epidemiol2009, 33(Suppl 1):S74-80.

85. Lin W-Y, Lee W-C: Incorporating prior knowledge to facilitate discoveriesin a genome-wide association study on age-related maculardegeneration. BMC Res Notes 2010, 3:26.

86. Reeder J, Knight R: The ‘rare biosphere’: a reality check. Nat Methods 2009,6:636-637.

87. Taylor MW, Schupp PJ, Dahllof I, Kjelleberg S, Steinberg PD: Host specificityin marine sponge-associated bacteria, and potential implications formarine microbial diversity. Environ Microbiol 2003, 6:121-130.

88. Tamames J, Abellán JJ, Pignatelli M, Camacho A, Moya A: Environmentaldistribution of prokaryotic taxa. BMC Microbiol 2010, 10:85.

89. Kassen R: The experimental evolution of specialists, generalists, and themaintenance of diversity. J Evol Biol 2002, 15:173-190.

90. Frank DN, Pace NR, Peterson DA, Gordon JI: Metagenomic approaches fordefining the pathogenesis of inflammatory bowel diseases. Cell HostMicrobe 2008, 3:417-427.

91. Young C, Sharma R, Handfield M, Mai V, Neu J: Biomarkers for infants atrisk for necrotizing enterocolitis: clues to prevention? Pediatric Res 2009,65:91R-97R.

92. Asikainen S, Doğan B, Turgut Z, Paster BJ, Bodur A, Oscarsson J: Specifiedspecies in gingival crevicular fluid predict bacterial diversity. PLoS ONE2010, 5:e13589.

93. Wong D, Zhang L, Farrell J, Zhou H, Elashoff D, Gao K, Paster B: Salivarybiomarkers for pancreatic cancer detection. J Clin Oncol 2009, 27:4630.

94. Culligan EP, Hill C, Sleator RD: Probiotics and gastrointestinal disease:successes, problems and future prospects. Gut Pathog 2009, 1:19.

95. Preidis GA, Versalovic J: Targeting the human microbiome withantibiotics, probiotics, and prebiotics: gastroenterology enters themetagenomics era. Gastroenterology 2009, 136:2015-2031.

96. Borody TJ, Warren EF, Leis S, Surace R, Ashman O: Treatment of ulcerativecolitis using fecal bacteriotherapy. J Clin Gastroenterol 2003, 37:42-47.

97. Khoruts A, Dicksved J, Jansson JK, Sadowsky MJ: Changes in thecomposition of the human fecal microbiome after bacteriotherapy forrecurrent Clostridium difficile-associated diarrhea. J Clin Gastroenterol 2010,44:354-360.

98. Manichanh C, Reeder J, Gibert P, Varela E, Llopis M, Antolin M, Guigo R,Knight R, Guarner F: Reshaping the gut microbiome with bacterialtransplantation and antibiotic intake. Genome Res 2010, 20:1411-1419.

99. You D, Franzos MA: Successful treatment of fulminant Clostridium difficileinfection with fecal bacteriotherapy. Ann Intern Med 2008, 148:632-633.

100. Chang Y-w, Lin C-j: Feature ranking using linear SVM. J Machine LearningRes 2008, 3:53-64.

101. Wang Q, Garrity GM, Tiedje JM, Cole JR: Naive Bayesian classifier for rapidassignment of rRNA sequences into the new bacterial taxonomy. ApplEnviron Microbiol 2007, 73:5261-5267.

102. Bell TC, Cleary JG, Witten IH: Text Compression Prentice-Hall, Inc; 1990.103. HMP Data Analysis and Coordination Center. [http://www.hmpdacc.org/

tools_protocols/tools_protocols.php].104. Mo Bio PowerSoil kit. [http://www.mobio.com/].105. Huse SM, Huber Ja, Morrison HG, Sogin ML, Welch DM: Accuracy and

quality of massively parallel DNA pyrosequencing. Genome Biol 2007, 8:R143.

106. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO:SILVA: a comprehensive online resource for quality checked and alignedribosomal RNA sequence data compatible with ARB. Nucleic Acids Res2007, 35:7188-7196.

107. Schloss PD: A high-throughput DNA sequence aligner for microbialecology studies. PloS ONE 2009, 4:e8230.

108. Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, Ciulla D,Tabbaa D, Highlander SK, Sodergren E, Methé B, DeSantis TZ, HumanMicrobiome Consortium, Petrosino JF, Knight R, Birren BW: Chimeric 16SrRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res 2011, 21:494-504.

109. Garrity GM, Lilburn TG, Cole JR, Harrison SH, Euzeby J, Tindall BJ: TaxonomicOutline of the Bacteria and Archaea 2007 [http://www.taxonomicoutline.org/index.php/toba/article/viewFile/190/223].

110. Sequence Read Archive: SRP002012 Human Microbiome Project 454Clinical Production Pilot (PPS). [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP002012#].

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 17 of 18

111. Hothorn TH, Hornik K, van De Wiel MA, Zeileis A: Implementing a class ofpermutation tests: the coin package. J Stat Software 2008, 28:1-23.

112. Venables WN, Ripley BD: Modern Applied Statistics with S. 4 edition. Springer;2002.

113. rpy2. [http://rpy.sourceforge.net/rpy2.html].114. Hunter JD: Matplotlib: a 2D graphics environment. Computing Sci Eng

2007, 9:90-95.

doi:10.1186/gb-2011-12-6-r60Cite this article as: Segata et al.: Metagenomic biomarker discovery andexplanation. Genome Biology 2011 12:R60.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Segata et al. Genome Biology 2011, 12:R60http://genomebiology.com/2011/11/6/R60

Page 18 of 18