an ensemble approach to the structure-function problem in

77
An ensemble approach to the structure-function problem in microbial communities Chandana Gopalakrishnappa, 1,Karna Gowda, 2,3,Kaumudi Prabhakara, 2,3,Seppe Kuehn 2,3* 1 Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA 2 Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA 3 Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL 60637, USA These authors contributed equally to this work. * Correspondence: [email protected] Abstract The metabolic activity of microbes has played an essential role in the evolu- tion and persistence of life on Earth. Microbial metabolism plays a primary role in the flow of carbon, nitrogen and other essential resources through the biosphere on a global scale. Microbes perform these metabolic activities in the context of complex communities comprised of many species that interact in dynamic and spatially-structured environments. Molecular genetics has revealed many of the metabolic pathways microbes utilize to generate en- ergy and biomass. However, most of this knowledge is derived from model organisms, so we have a limited view of role of the massive genomic diver- sity in the wild on metabolic phenotypes. Further, we are only beginning to get a glimpse of the principles governing how the metabolism of a com- munity emerges from the collective action of its constituent members. As a result, one of the biggest challenges in the field is to understand how the metabolic activity of a community emerges from the genomic structure of the constituents. Here we propose an approach to this problem that rests on the quantitative analysis of metabolic activity in ensembles of microbial com- munities. We propose quantifying metabolic fluxes in diverse communities, either in the laboratory or the wild. We suggest that using sequencing data to quantify the genomic, taxonomic or transcriptional variation across an Preprint submitted to iScience November 12, 2021 arXiv:2111.06279v1 [q-bio.PE] 11 Nov 2021

Upload: others

Post on 04-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An ensemble approach to the structure-function problem in

An ensemble approach to the structure-function

problem in microbial communities

Chandana Gopalakrishnappa,1,† Karna Gowda,2,3,† KaumudiPrabhakara,2,3,† Seppe Kuehn2,3∗

1Department of Physics, University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA

2Department of Ecology and Evolution, University of Chicago,Chicago, IL 60637, USA

3Center for the Physics of Evolving Systems, University of Chicago,Chicago, IL 60637, USA

†These authors contributed equally to this work.∗Correspondence: [email protected]

Abstract

The metabolic activity of microbes has played an essential role in the evolu-tion and persistence of life on Earth. Microbial metabolism plays a primaryrole in the flow of carbon, nitrogen and other essential resources through thebiosphere on a global scale. Microbes perform these metabolic activities inthe context of complex communities comprised of many species that interactin dynamic and spatially-structured environments. Molecular genetics hasrevealed many of the metabolic pathways microbes utilize to generate en-ergy and biomass. However, most of this knowledge is derived from modelorganisms, so we have a limited view of role of the massive genomic diver-sity in the wild on metabolic phenotypes. Further, we are only beginningto get a glimpse of the principles governing how the metabolism of a com-munity emerges from the collective action of its constituent members. Asa result, one of the biggest challenges in the field is to understand how themetabolic activity of a community emerges from the genomic structure ofthe constituents. Here we propose an approach to this problem that rests onthe quantitative analysis of metabolic activity in ensembles of microbial com-munities. We propose quantifying metabolic fluxes in diverse communities,either in the laboratory or the wild. We suggest that using sequencing datato quantify the genomic, taxonomic or transcriptional variation across an

Preprint submitted to iScience November 12, 2021

arX

iv:2

111.

0627

9v1

[q-

bio.

PE]

11

Nov

202

1

Page 2: An ensemble approach to the structure-function problem in

ensemble of communities can reveal low-dimensional descriptions of commu-nity structure that can explain or predict their emergent metabolic activity.We motivate this approach broadly, and situate it historically. We surveythe types of communities for which this approach might be best suited andthen review the analytical techniques available for quantifying metabolitedynamics in communities. Finally, we discuss what types of data analysisapproaches might be lucrative for learning the structure-function mappingin communities from these data.

Keywords: microbial communities, metabolism, dimension reduction,machine learning, genomic structure,

Introduction: the structure-function problem in microbial commu-nities.

The evolutionary history of the biosphere is inextricably linked to themetabolic activities of microbes. Since life arose on this planet, microbeshave lived in consortia that saturated nearly every biochemical niche on theplanet, driving global transformations in the chemical composition of thebiosphere via metabolic processes from fermentation to photosynthesis torespiration [1, 2, 3, 4, 5]. As such, microbes and the communities in whichthey reside are the result of an ongoing eco-evolutionary process that couplesthe transformation of metabolites to the complex dynamics of interactingecological systems across many spatial and temporal scales.

Given the importance of the metabolic activity of microbial communitieswe argue that a major goal for the field should be to predict, design, andcontrol the metabolism of microbial communities in complex, natural, andengineered settings. Accomplishing this goal requires understanding how thestructure of a community, in terms of the taxa present and its genomic com-position, determines its metabolic activity in a given environmental context.The sequencing revolution has revealed the structure of microbial communi-ties at the level of the taxa present, the genes they posses and the dynamicsof gene expression. This means that we now have a detailed and dynamic“parts list” for microbial communities in terms of taxa and genomic com-position across a range of environments, from anaerobic digesters [6, 7, 8]to the human gut [9, 10], soils [11], and the ocean [4]. For some metabolicprocesses we can interpret gene content and taxa in terms of the specificmetabolic processes that they are capable of. For example, we know the

2

Page 3: An ensemble approach to the structure-function problem in

dominant taxa that perform processes such as nitrification [12] or polysac-charide degredation [13]. Further, by annotating metagenomic data we canassign specific functional roles for many (but not all) of the genes presentin a given community. As a result, we can measure the prevalence of en-zymes that perform the reactions necessary for specific metabolic processes.Despite the remarkable scale and breadth of these sequencing data, we stilldo not have a predictive, quantitative framework for using these data to un-derstand, predict, and design the metabolism of the communities in complexenvironments.

In this perspective we explore what makes this problem both challengingand important. We propose a specific approach to begin to address thisquestion, and examine what types of communities and associated metabolicprocesses might be amenable to this approach. We review the techniquesthat are relevant to implementing the approach with a focus on methods forquantifying metabolites.

The significance of finding a solution

Microbial communities play an outsized role in driving fluxes of nutrientsthrough the biosphere. Photosynthetic microbes are responsible for nearlyhalf of the carbon fixation on the planet [14]. These phototrophs work inconcert with heterotrophic bacteria that enable primary productivity in ter-restrial, marine and fresh water environments [15, 16]. We are only beginningto glimpse the role of the collective in this nearly 100 gigaton annual carbonflux. Bacteria and archaea in anaerobic environments degrade complex car-bon sources to methane, playing an important role in carbon recycling andclimate change [16].

In the nitrogen cycle, microbes play a key role in nitrogen fixation (dini-trogen gas to ammonia), nitrification (ammonia to nitrate), and denitrifica-tion (nitrate to dinitrogen gas) [17]. These processes are key for wastewatertreatment [18] and human health [19]. A critical challenge is to form a quanti-tative and predictive understanding of how microbial communities drive thesefluxes. To give a concrete example, the process of denitrification, performedby bacterial communities in soils, reduces nitrate to dinitrogen gas. An in-termediate in the conversion of nitrate to dinitrogen is the potent greenhousegas and ozone depleting compound nitrous oxide [20]. Denitrifying commu-nities in some cases (especially in agricultural soils) can leak nitrous oxide,but in other cases fully convert nitrous oxide to harmless dinitrogen gas.The question then becomes: what controls the production of nitrous oxide

3

Page 4: An ensemble approach to the structure-function problem in

from denitrifying communities in soils? Can we manipulate these microbialcommunities to limit nitrous oxide production? To address this we needto understand how the structure of the community and the environmentalcontext determine the flux of metabolites through the system.

Similarly, the essential importance of resident microbiota in host healthis now clear [19], but as yet it is unclear how to rationally manipulate thesecommunities to benefit the host. There exists tantalizing evidence that thiscan be done, for example by altering metabolic phenotypes [21] or treatingpersistent infections [22], but we lack general approaches for developing suchstrategies. Here we focus on environmental microbiomes, but we emphasizethat the strategy proposed here could, and in a few cases has been [10],applied to host associated communities.

Defining structure and function

Before moving forward we take a moment to define community structureand function. We define the structure of the community as the taxa present,as well as the genomic structure of each taxon, which may include everythingfrom the detailed knowledge of the regulatory architecture of each gene, tothe syntenic organization of the genome [23], to even the presence of phage.The structure of the community may, if necessary, include transcriptional orproteomic information at the metagenomic or single-taxon level as well.

We define the function of a community as the collective metabolic ac-tivity of all constituent organisms, which therefore operates in the space ofmetabolites. The dynamic or steady-state flux of metabolites through theconsortium define its metabolic function. Depending on the context, themost important metabolic fluxes may include electron donors (e.g., organiccarbon), acceptors (e.g., oxygen, nitrate), secondary metabolites, biomass,overall catabolic activity, or byproducts. A note about usage: some readersmay find the term “function” teleological, implying some sort of purpose onmicrobial communities. We use the term function to mean the activity oraction of a community without any implication of purpose. Despite this po-tential confusion, we find that using the term function is a useful shorthand.In particular, we would like to invoke a certain symmetry between the ideaspresented here and the problem of sequence, structure and function at thelevel of proteins. The structure-function problem for microbial communitiesis therefore to deduce the mapping from the space of genes, transcripts, pro-teins and taxonomic organization to metabolite fluxes, and to understand theenvironmental dependence and context in which this mapping is relevant.

4

Page 5: An ensemble approach to the structure-function problem in

How should we approach the problem?

Understanding how the metabolic activity of a microbial community emergesfrom the taxa present and their metabolic capabilities is a problem of connect-ing heirarchical scales of biological organization, from genes to phenotypesand interactions in specific environmental contexts. What makes this chal-lenging is the fact that processes at different scales feed back on one another.For example, compounds that mediate interactions between strains can doso by modulating gene expression [24]. Similarly, phenotypic variation in oneorganism can modify interactions [25], and organisms modify their environ-ment with widespread impacts on other members of the consortium [26].

One way to proceed is via the reductionist mode that has motivatedbiology over the last century [27]. In the context of communities, this wouldmean dissecting the mechanistic and physiological metabolic properties ofeach member of the community and understanding how metabolite dynamicsemerge at the level of the collective. The challenge is the complexity ofthese systems, which makes a detailed, mechanistic understanding of thecollective metabolism a massive undertaking. For synthetic communitiescomprising model organisms, where detailed information are available, thistype of approach has had some success [28, 29]. For comparatively simplecommunities of a few strains, such as bacteria cross-feeding amino acids [30]or ciliates consuming bacteria [25], detailed models have been constructedand validated. However, in environmental or host-associated contexts, wherethe number of strains present is enormous and many organisms present arepoorly studied or challenging to work with in the laboratory, this approachfaces huge challenges.

In this scenario what are we to do? On the one hand, building detailedmodels as described above is ill-advised. Even when such detailed models canbe built, the challenge of distilling simple principles from these models in-creases with their complexity (see Borges “On the exactitude of science” [31]).On the other hand, we know from many examples that a huge number of pro-cesses from antibiotic warfare [32] to competition [33], mutualism [34], andstress responses [35] influence interactions and metabolism. So how can wejustify not building models that include these details?

In the spirit of Philip Anderson’s influential essay [36], it may be thatunderstanding communities requires explaining entirely new, and potentiallysimpler properties that are emergent at the level of the collective. In thiscase, due to the scale and complexity of the community, new and distinct

5

Page 6: An ensemble approach to the structure-function problem in

phenomena emerge from the individual parts, and discovering the organiz-ing principles requires an approach that goes beyond dissecting the detailedmetabolic phenotypes of each individual member of the community. To beclear, we are not advocating the idea that reductionist approaches are notuseful. Their utility is clear from many examples [37, 38, 39, 40, 35]. Instead,we are asking whether making headway on the “structure-function problem”as we have defined it might require a complementary approach that focusesnot on the detailed mechanisms and physiology of each organism, but onpatterns that are evident at the level of the collective. Here we discuss suchan approach.

Learning the right variables: the power of statistics across ensem-bles

We find it useful to cast the structure-function question in terms of aprediction problem. In this framing, the goal is to predict metabolite dy-namics or fluxes from community structure for a given environment or set ofenvironmental conditions. The key question then becomes, what structuralelements must we quantify to predict function? Equivalently, we could ask,how predictive of metabolic function is community gene content, taxonomy,transcriptional or proteomic data?

One approach to this problem, which has found success in both physicsand biology, is to look for statistical regularities across replicate systems andallow these patterns to naturally define variables that can be used to makepredictions. In many cases, this approach can reveal the salient, emergent,features of a system and provide deep insight into their function.

Specifically, from proteins [41] to multicellular organisms [42], examin-ing statistical variation across many replicates of a system, or an ensemble,has proven powerful. For example, covariation across ensembles of homolo-gous proteins have been used to reveal which amino acids are in contact inthe folded structure [43], and co-evolving groups of residues that correlatewith enzyme function [41]. Careful analysis of behavioral variation in largenumbers of microbes [44] and flies [45] suggest that apparently very complexbehaviors can in fact be described by a relatively small number of elementarybehavioral features [45, 46]. Statistical analysis of morphological variationacross higher organisms suggests that morphologies adhere to constraintsthat some believe are associated with specific functional capabilities [47, 48].A recent study of variation in patterning in the fly wing shows that a singlemode of variation describes the response of the wing developmental program

6

Page 7: An ensemble approach to the structure-function problem in

to genetic and environmental perturbations [42]. For a recent piece discussingwhy low-dimensionality might be an inherent property of evolved systems seeEckmann and Tlusty [49].

The common feature of all of these examples is that by judiciously study-ing the variation across a carefully chosen ensemble of systems, one can oftendiscover simple, relatively low-dimensional features that enable the predic-tion of the functional properties of the system. Here we advocate a similarapproach to communities.

An ensemble approach to the structure-function problem

In light of these considerations, we propose an ensemble approach to thestructure-function problem in communities. We motivate this approach byanalogy to a similar approach taken at the level of proteins (Fig. 1). Forproteins the structure function problem is to predict the fold and functionof a protein from its amino acid sequence. One way to approach this prob-lem is by performing detailed physico-chemical simulations of a polypeptidechain via molecular dynamics (Fig 1A). While much progress has been madein this approach, it has proven a huge technical challenge. However, a sta-tistical approach, ignoring the mechanistic details and instead consideringonly statistics of a multiple sequence alignment does a remarkably good jobof predicting protein folds [50]. Similar approaches reveal low-dimensionalstructure in proteins which is predictive of function [41, 51]. These studiessuggest that much can be learned by carefully considering variation across asuitably chosen ensemble of systems.

In the context of microbial communities flux balance models of commu-nity metabolism are analogous to molecular dynamics simulations in pro-teins because they attempt a detailed mechanistic accounting of all of thephenomena within a community (Fig. 1B). However, in communities thereis comparatively little work pursuing a statistical approach analogous to theone taken at the level of proteins.

Taking such an approach is precisely what we are advocating here. Wepropose quantifying the structure of a collection, or ensemble, of communi-ties using sequencing while simultaneously measuring metabolite dynamics.Given such data, one can then approach the structure-function problem byasking whether variation in community structure (e.g., across metagenomesor metatransciptomes) permit quantitative insights into the functional prop-erties of these communities. The proposal is then to leverage these insightsto design, predict and control community function. Several studies in the

7

Page 8: An ensemble approach to the structure-function problem in

past few years have begun to explore statistical approaches [52, 10]. How-ever, we suggest that this approach is under explored and that that there is apressing need to collect new data sets that are explicitly designed to pursuea statistical approach to the structure-function problem in communities.

Overview of the paper

We will focus on three main challenges that must be surmounted to applythe ensemble approach to the structure-function problem: (1) choosing anensemble across which one should make comparisons and look for patterns,(2) measuring metabolite dynamics, which requires analytical chemistry tech-niques that are often not standard practice in microbial ecology labs, and (3)using these data to distill the mapping from structure to function.

Our intention is to provide a roadmap for how such an approach might beapplied across communities of interest. We recognize that this roadmap is farfrom complete, and that many pitfalls exist that may render this approachchallenging in various circumstances.

We will not review sequencing technologies, which have been widely andcapably recapped elsewhere [53]. We will focus largely on microbial commu-nities in environmental contexts rather than health-related (human micro-biome) contexts, in part because environmental microbiology is where ourexpertise lies, but also because of the abundance of existing literature on thelatter topic. Finally, we will neglect the many recent advances in theoreticalecology, in particular the renaissance in consumer-resource models appliedto communities, in service of focusing on questions that can be settled em-pirically.

Choosing communities and ensembles

The first challenge is choosing a community and an associated metabolicprocess to interrogate. The choice is far from trivial and there is no sim-ple prescription. Instead we appeal to the intuition of microbial ecologists,experts in physiology and the quantitative considerations of applied mathe-matics and physics. The goal should be to circumscribe a well-defined com-munity and, if possible, an associated metabolic process where quantitativemeasurements in many replicates can be made.

Choosing a metabolic process, and therefore specific metabolite measure-ments to be made, is a significant challenge, and compromises are inevitable.Here we reiterate the point that microbial communities are not “functional”

8

Page 9: An ensemble approach to the structure-function problem in

Figure 1: Sequence, structure and function in proteins and microbial communi-ties. We propose that there exist analogous solutions to the sequence-structure problemin protein folding and the structure-function problem in microbial communities. (A) Themapping from amino acid sequence to 3D protein structure can be accomplished eitherby a simulation approach (e.g., molecular dynamics) or by a statistical approach (e.g.,direct coupling analysis). The former is a computationally-intensive strategy to simulate3D protein structure based on first-principles modeling of atomic interactions. The latterleverages information about residue coevolution from an ensemble of amino acid sequencesto infer which residues are in contact, allowing for an elegant and interpretable statisticalinference of 3D structure. (B) The mapping from genomic and metagenomic sequences tocommunity metabolic activity can be achieved through community flux balance modeling,or, as we propose, a statistical ensembles approach. The former requires genome-levelmetabolic models of each organism to be built, a labor-intensive iterative process that sofar has been successful primarily in a handful of model organisms. The latter leveragesthe diversity and variation in an ensemble of communities to learn an an effective mappingbetween community sequence content metabolic activity.

9

Page 10: An ensemble approach to the structure-function problem in

in the sense that they are designed with some purpose. Despite this fact, aswe will discuss below, the importance of metabolic activity of communities inspecific niches is clear, as is the relevance of these processes for the biospheremore broadly. To make this point clear, we begin by discussing specific com-munities and their associated environments and metabolic processes, with aneye towards how one might apply the ensemble approach to dissecting theirfunction.

Model microbial communities

Structure-function in the wild

Soils

Perhaps no microbial community on Earth is more important than thatwhich inhabits soils. The soil microbiome plays a key role in plant growthand physiology [54], in particular through nitrogen fixing bacteria that pro-vide reduced nitrogen for plant hosts. The storage of carbon, and productionof CO2 via respiration of reduced organic compounds in soils are key compo-nents of the global carbon cycle [55, 55]. As the climate warms, so the rateof microbial respiration of CO2 from soils increases [56, 57], potentially driv-ing a positive feedback loop with dire consequences for the global climate.Similarly, denitrification in agricultural soils is responsible for roughly 80 %of the anthropogenic release of nitrous oxide (N2O) (see https://www.epa.

gov/ghgemissions/overview-greenhouse-gases and [20]). Nitrous oxideis 300-fold more potent than CO2 as a greenhouse gas and responsible for ap-proximately 10 % of the global warming potential from human activity. Forthese reasons, there is keen interest in associating soil microbiome structureto process rates such as CO2 or N2O production and N2 fixation. However,most attempts to find a relationship between soil microbiome structure andthe rates of key metabolic processes in soils have found only marginal suc-cess [58, 59, 60, 61].

Despite these difficulties we see reason for optimism. It is known thata relatively small number of environmental factors are the dominant driversof variation in soil community structure: pH, moisture, carbon and nitrogenavailability, temperature, and redox potential [61]. Moreover, while soils areroutinely cited as very complex microbial communities, much like the humangut they are typically dominated by a handful of taxa (e.g., Acidobacteria,Verrucomicrobia [61, 62]), with most other strains present in relatively low

10

Page 11: An ensemble approach to the structure-function problem in

Figure 2: Community structure and function in the wild. (A) Algal blooms area microbial successional processes that follow from the input of exogenous nutrients toaquatic environments. Reduced carbon fixed from CO2 by algae are consumed, alongwith other nutrients, by heterotrohic bacteria in reproducible successional dynamics. (B)Marine snow particles are aggregates of organic carbon that are formed near the ocean’ssurface and subsequently sink to the ocean floor. Microbial communities can degradethese particles, and the amount of carbon that is mineralized to CO2 versus the amountthat is sequestered on the ocean floor depends strongly on the structure of the microbialcommunity. (C) Microbial mats are layered communities that occur at air-water interfaces,often in extreme thermal environments such as hot springs. The spatial structure ofthese communities follows from exchanges of nutrients governed by redox gradients. (D)Pink berries are microbial aggregates that cryptically (internally) cycle sulfur betweenphotosynthetic purple sulfur bacteria and anaerobic sulfate reducing bacteria.

11

Page 12: An ensemble approach to the structure-function problem in

abundances. Moreover, there are clear patterns in the abundances of bacte-ria and fungi in soils, with high biomass turnover environments like grass-lands dominated by bacteria and low biomass turnover forests dominated byfungi [61].

As acknowledged in recent meta-studies [58], one challenge in associat-ing soil community structure to metabolic function is a lack of high qualitydata sets where process rates (e.g., CO2 production) are measured in a largeensemble of soil communities. Exceptions to this include a recent surveyof global topsoil microbiomes [11], and microcosm studies documenting therole of multiple environmental perturbations applied to soils [59]. Despitethese advances, the consensus remains that predicting metabolic processesin soils from microbial community structure is challenging [61]. However,we note that in some cases this conclusion is derived from meta-studies thataggregate data from different experiments or labs. Such comparisons canbe challenging given systematic errors in sequencing measurements betweenprotocols [63], and the strong dependence of measurements like soil pH onthe technique employed [64].

Given these considerations we propose that one route forward is the ju-dicious collection of data from large ensembles of soil communities followedby careful quantification of process rates and community structure in a con-sistent manner. We acknowledge that even in the presence of such data,relating community composition at the taxonomic, genomic or transcrip-tomic levels to metabolite fluxes may remain a challenge. It may be that soiltaxa are not “good variables” for understanding function, and that chemicalor abiotic properties such as the redox state of available organic carbon arenecessary [65]. Regardless, the importance of understanding the metabolicactivities of soil communities cannot be overstated.

Algal blooms

In aquatic systems, exogenous inputs of nitrogen and phosphorous, of-ten driven by human activity, result in the dramatic successional processof the algal bloom [66] (Fig. 2A), wherein photosynthetic microbes explodein abundance. The rise of phototrophic microbes brings with it a complexcommunity of non-photosynthetic (heterotrophic) bacteria that form tightsymbioses with the phototrophs. These phototroph-heterotroph communi-ties cycle large quantities of carbon, with phototrophs fixing CO2 to reducedorganic carbon, which is in turn consumed by the associated heterotrophicbacteria.

12

Page 13: An ensemble approach to the structure-function problem in

In this system, a natural framing of the structure-function problem wouldbe to ask how the taxonomic composition of the community impacts thefixation of carbon and eventual bacterial biomass production. Teeling etal. [66] have found that specific classes of bacteria grow at different phasesof the bloom, e.g., Alphaproteobacteria dominated the pre-bloom, as bloom-ing commenced, Bacteroidetes increased more rapidly than others, and thenGammaproteobacteria grew much later. Concurrent metagenomic studiesshowed that the abundances of enzymes capable of degrading carbohydratesand sulfatases required to degrade sulfated algal polysaccharides increasedin abundance during the bacterial succession. The degradation of the largerpolysaccharides leads to the production of shorter organic compounds, whichwas revealed by the expression of the relevant transporters. These results,and those of other studies [67, 68, 69, 70, 71], indicate a link between taxon-omy of bacteria associated with certain photosynthetic strains during bloom-ing events.

These results suggest that these associations are at least in part deter-mined by the carbon catabolic activity of the bacteria and the fixed carbonthat phototrophs excrete (see also [72]). Taking an ensemble approach tothis problem could be accomplished by recapitulating bloom dynamics in alaboratory context [73], where the identity of the phototroph and the com-position of the bacterial community could be manipulated and the resultingtotal carbon flux quantified [74]. Such studies might shed insights on how tocontrol blooms in natural settings.

Marine snow

Marine snow is a term used to describe micrometer to millimeter-scaleaggregates in the ocean, which are typically made of detritus containing car-bon and nitrogen in the form of polysaccharides, microbes, and inorganicsubstances. These particles form at the upper levels of the ocean and sinkto the ocean floor, transporting carbon in a “pump” that removes carbonfrom the atmosphere over geologic timescales [75, 76]. Communities of bac-teria and other microbes consume these aggregates as they sink (Fig. 2B).Understanding how the structure of marine snow communities impacts thedegradation of carbon and the production of metabolic byproducts (e.g.,CO2) is a critical question for understanding global carbon fluxes.

Studying these particles in situ is challenging [77]. To circumvent thisdifficulty, some studies use synthetic particles to study community assemblyand function. One study used agar beads [78] to study the colonization by

13

Page 14: An ensemble approach to the structure-function problem in

microbes in raw seawater, revealing successional dynamics underlying parti-cle degradation. The particles were initially colonized by bacteria, and laterby flagellates. Cordero and colleagues have taken a similar approach to un-derstand colonization dynamics [79, 80, 81, 82]. These studies shed light onthe metabolic roles played by different players during particle colonizationdynamics: primary degraders excrete enzymes to break down polysaccharidechains, which creates a niche for secondary degraders capable of consumingmonomers and oligomers of the polysaccharide, which in turn makes way forscavengers that take up metabolic byproducts of the primary and secondarydegraders (ammonia, amino acids).

In terms of the approach proposed here for mapping structure to function,particle degradation offers a powerful model system because so much is knownabout how the communities collectively degrade the particle. For example,it would be interesting to see whether one could take a statistical approachto inferring the key metabolic traits of primary and secondary degradersand scavengers from sequencing data alone. In this case the function of thecommunity to predict would be the fraction of carbon degraded or the totalCO2 respired. The power of this system is the ability to manipulate structure,but the particles make it challenging to measure carbon degradation directly(see the next section for further discussion).

Plastisphere

Closely related to the degradation of polysaccharide particles in the oceanis the recent rise of plastic debris in freshwater and marine environments, andthe microbial communities associated with its degradation [83]. Given thefact that a few million tons of plastic enter the ocean per year [84], this is animportant ecosystem to study, not only to understand how these pollutantsaffect the ecosystem, but also to find potentially efficient plastic degradingcommunities. Further, plastic is a comparatively new environment on evo-lutionary timescales, making it interesting to study from the perspective ofevolution. It has been shown that the community composition on plastic isdifferent from the composition on other substances in the same conditions,and that certain taxa are commonly found on plastics [85, 86]. Diatomshave been observed in high numbers, although a strong succession dynam-ics was observed. Other photoheterotrophs, heterotrophs, ciliates, fungi, andpathogens have also been observed. Although the functional role of these mi-crobes are unclear, it is speculated that chemotaxis, interactions with metalsand degradation of low molecular weight polymers are important factors that

14

Page 15: An ensemble approach to the structure-function problem in

determine community composition [83]. In this system the structure-functionproblem is similar to that outlined for marine snow - how does communitystructure determine the rate of plastic degradation?

Microbial mats

Microbial mats are stratified communities that often form at air-water in-terfaces in extreme environments such as hot springs [87]. Mat communitiesharbor a top layer of photoautotrophic bacterium (typically Synechococcussp.) that use light to fix carbon during the day and often fix nitrogen atnight. Below the top layer are strata of various heterotrophs and anaer-obic autotrophs [88] (Fig. 2C). These communities have been studied fordecades, and much is known about the metabolic roles each strain plays inthe community [89, 90] and metagenomic datasets are available [91]. It is re-markable that similar mat structures form in many different contexts acrossthe globe. In these mats much of the nutrients are fixed from CO2 andN2 [92], heterotrophic community members then consume reduced organiccarbon excreted by the autotrophs. In this context the question of relatingstructure to function falls to mapping the flux of C, N and other metabolitesthrough the mat to the taxonomic and metagenomic structure of the system.For example, what is the simplest community that will stably form a mat?What pathways, for example in the primary autotroph, are essential to matformation and which are dispensable? Further, given that mats support re-markable allelic diversity driven by extensive recombination [93], how is thefunctional genetic repertoire of these communities maintained? Preliminarywork on mats in California [94] showed the presence of a core genome acrosssamples, and various other genes thought to be useful for specialized func-tions, but further metagenomic analysis is likely to be useful in addressingthe structure-function question in mats. The main two challenges in thesecommunities are obtaining axenic isolates from the mats and making highquality metabolite dynamics measurements in situ. Petroff and colleagueshave recently made exquisite quantitative measurements of oxygen dynam-ics in mats [95, 96], addressing the latter challenge, which opens the doorto making quantitative links between community composition and metabolicactivity.

Cryptic sulfur cycling in microbial aggregates

Microbial metabolism drives the global cycling of sulfur through energy-generating oxidation and reduction reactions. It has become increasingly

15

Page 16: An ensemble approach to the structure-function problem in

evident that much of this cycling occurs cryptically [2, 97] (i.e., with lowsteady state metabolites levels but substantial fluxes), often in the context ofcellular aggregates where oxidation and reduction reactions occur in physicalproximity [98, 97]. Inferring the presence and rate of cryptic sulfur cyclingin microbial aggregates has important implications for our understanding ofother elemental cycles, since sulfur cycling is often tightly coupled with thecarbon and iron cycles.

Remarkable examples of cryptic sulfur cycling phenomenon are the so-called “pink berry” consortia [99, 98], which are discrete macroscopic (∼1cm) aggregates that occur on the surface of submerged sediments in theintertidal pools of Sippewissett Salt Marsh (Massachusetts, USA) (Fig. 2D).The bright pink coloration of these aggregates is attributable to the purplesulfur bacteria (PSB) that make up the majority of cellular biomass. ThesePSB oxidize sulfide to sulfate via the process of anoxygenic photosynthesis.Accompanying these PSB are sulfate reducing bacteria (SRB), which deriveenergy from anaerobic respiration by catalyzing the reverse process, reducingsulfate to sulfide. Together these PSB and SRB are capable of locally andcryptically cycling sulfur via the syntrophic exchange of oxidized and reducedsulfur compounds [98].

While culture-based approaches to characterizing and reconstituting thepink berry consortia in the lab remain a challenge, the discrete nature ofthe pink berries presents an opportunity to statistically characterize thestructure-function relationship at the level of individual aggregates harvestedfrom the wild. Bulk metagenomic sequencing of pink berry consortia indi-cates that typically 2-3 phylotypes make up the majority of biomass in theaggregates [98]. Laboratory measurements of sulfide oxidation or sulfate re-duction rates using spatially-explicit microprobe measurements followed by16S metagenomic sequencing on harvested pink berries could provide insightinto how the abundance of these phylotypes quantitatively relates to sulfurcycling. More generally, such an approach could be applied to characterizingcryptic sulfur cycling in other aggregated contexts, such as in marine parti-cles [97]. Predicting sulfur cycling is important to understanding other crit-ical elemental cycles: sulfide oxidation by PSB can contribute substantiallyto carbon fixation [100], and sulfate reduction by SRB can be an importantsink for reduced carbon [101] and iron [102] in environments where electronacceptors are otherwise scarce.

16

Page 17: An ensemble approach to the structure-function problem in

Figure 3: Community structure and function under domestication. (A) Kefirgrains extracellular-polymeric aggregates host to microbial communities that inoculatemilk for the production of kefir. These communities undergo a reproducible successionalprocess that involves the production and consumption of fermentation byproducts, whichultimately give kefir its desired flavor. (B) Anaerobic bioreactors often use granulated mi-crobial communities to remove waste products such as reduced carbon, nitrogen, and phos-phorous from water. Improving the performance and efficiency of these systems throughthe ensembles-informed design of communities would increase their viability as alterna-tives to traditional wastewater treatment approaches that expend significant energy onaeration.

17

Page 18: An ensemble approach to the structure-function problem in

Community structure and function under domestication

Many key insights in the Origin of Species came from studying trait vari-ation under domestication. In the same way, we propose that learning therules for mapping structure to function in communities should leverage themany instances in which communities have been domesticated. Here we ex-plore some of these opportunities.

Microbial communities in the dairy industry

The production of yogurt, cheese and other dairy products relies heavilyon microbes. The work of Dutton and colleagues on cheese rinds [103] is animportant example in this context. This study compared and characterizedmore than 100 cheese rinds from across 10 countries, and found that less than15 bacterial taxa and 10 fungal taxa were present in abundances of more than1%. Further, most of these taxa are not present in the starting cultures, andtheir function is unknown. They found that the community composition isstrongly correlated to the aging process and moisture rather than geographyor milk source. The functional profile of the communities, found using shot-gun metagenomics, was correlated to pH as well, and pathways were shownto correlate as expected with the cheese types. These results point to theidea that the chemical environment is perhaps the strongest determinant ofcommunity structure.

Remarkably, when the dominant taxa were reconstituted in vitro andcultured as a community in media to mimic treatments in different typesof cheese rinds, divergent communities were formed depending on the treat-ments, which showed some properties similar to the original cheese rinds in-cluding their abundance dynamics. This remarkable result shows that thesecommunities can be reconstituted in the laboratory to recapitulate some ofthe basic functional features of the domesticated cheese communities. Itwould be interesting to extend these studies further by looking quantita-tively at the metabolite dynamics. The fact that these communities can beso readily manipulated means that learning the relationship between com-position and metabolic activity is now accessible. What remains unclear,to us, is what the salient metabolic features of these communities are thatshould be explained. One way to approach this question would be to chemi-cally characterize the transformations that occur in specific rinds and to thenask whether subsets of the full community can or cannot recapitulate theseprocesses.

18

Page 19: An ensemble approach to the structure-function problem in

Another promising microbiome in the dairy sector is the kefir commu-nity (Fig. 3A). Kefir grains are used to make a fermented drink like yogurt.They have about 50 bacterial and yeast taxa, which are resilient to stresses,and most of which perform lactic and acetic acid fermentation. A recentstudy [104] found that kefir grains, which are polysaccharide matrices syn-thesized by the bacterial consortium, collected from diverse locations had avery similar core community and differed only in rare species. Kefir grainsare sustained much like sourdough starters and added to milk to initiatea fermentation process. When added to milk the composition of the graincommunity is stable while the community present in the milk exhibits a suc-cession. Metabolite changes during the colonization showed similar succes-sion dynamics. The study dissects the interactions between the communityon the grains and that in the milk, and demonstrates reproducible metabo-lite dynamics in this system. As a result, kefir constitutes another power-ful platform for manipulating community structure (e.g., composition of thecommunity on the grains) and learning the impact of those changes on themetabolite dynamics. Blasche et al. have already made significant progressin this regard, but it remains to investigate in a high-throughput statisticalfashion, how the composition of the grain community confers the remarkablerobustness they observe.

Anaerobic bioreactors

The treatment of wastewater for reuse and release into the environmentrequires the removal of large quantities of organic matter, much of whichis insoluble or otherwise slow to degrade. Anaerobic bioreactors serve animportant role in this industrial process, harnessing microbial metabolismto degrade such organic matter into CO2 and CH4 gases (Fig. 3B). Becausethe bioreactors are fed a range of inputs and are operated under a variety ofconditions, the microbial communities that populate bioreactors are highlyfunctionally and taxonomically diverse [105]. Organisms that excrete extra-cellular enzymes degrade insoluble polymers into soluble monomers, whilefermenters consume these compounds and excrete products including acetateand H2. Methanogenic archaea can then ferment these products to produceCH4 [7]. Many functional attributes are used in practice to characterize theperformance of anaerobic bioreactors, including removal of chemical oxygendemand (COD, a proxy for aerobically metabolizable matter in a water sam-ple) and methanogenic activity. The resilience of a bioreactor community toinput fluctuations has also been of interest [106, 107, 105].

19

Page 20: An ensemble approach to the structure-function problem in

Several studies have explored the statistical relationship between com-munity taxonomic structure and methanogenic bioreactor performance. Inan important early study, Tiedje and colleagues found that, under constantconditions, COD removal in a laboratory bioreactor was stable while commu-nity composition varied substantially over a two year period [108]. This worksuggested the role of functional redundancy [109] in maintaining communityfunction, and implied a degeneracy in the relationship between communitytaxonomy and function. However a more recent study of several industrial-scale bioreactors observed relative stability in both community taxonomy andreactor performance over a year-long period [105]. Notably, variations in re-actor performance were found to be related to community composition. Thisindicated that taxonomy is predictive of community function, though theauthors argued that taxonomy is simply a proxy for functional genomic con-tent due a close correspondence between phylogeny and metabolic functionfor organisms in anaerobic bioreactor systems. The conflict between thesetwo results indicates that much is still unknown about the structure-functionrelationship in anaerobic bioreactors.

Recent work has advanced our understanding of structure and functionin methanogenic bioreactors, leveraging a statistical ensembles approach todiscover a predictive relationship between gene content and methanogenic ac-tivity [6]. The authors generated 49 diverse enrichment cultures by seedinglaboratory bioreactors with inocula taken from a large and eclectic collectionof industrial-scale bioreactors. This ensemble enabled a linear regression ap-proach to mapping methanogen activity (as measured by methane productionrate per unit biomass) to variation in genomic content, specifically the abun-dance of sequence variants of a gene important to methanogenesis (mcrA).Remarkably their approach produced a predictive model with relatively fewvariables, suggesting that only a few key mcrA variants, or strains possessingthese variants, are important for methanogenic activity. A powerful conse-quence of this approach is a prediction for which gene variants would improvethe performance of an underperforming bioreactor.

Given the importance and widespread use of bioreactors to process or-ganic matter, these constitute important model systems. The field faces twoimportant challenges. First, cultivating many of the slow growing taxa inthese communities is difficult and this means that only equipped and experi-enced labs can readily work with these organisms. Second, the complexity ofthese communities makes carefully controlled and reproducible experimentsa challenge, and as a result comparisons from one study to the next can be

20

Page 21: An ensemble approach to the structure-function problem in

difficult. Standardizing conditions and starting inocula would therefore be amajor advance.

Bringing wild communities into the lab: enrichment cultures

Another route to studying communities that is similar to domesticationis to bring complex communities from nature into the laboratory and culturethem in defined conditions. While these experiments allow the experimen-talist to control the growth and incubation conditions, they often result ina drastic loss of diversity. As a result, this approach is likely poorly suitedto understanding the structure of natural systems, but none the less shouldenable important insights into engineering and controlling communities.

Winogradsky columns

One of the most well-known methods for enrichment culture is a Wino-gradsky column (Fig. 4A). The method was developed by Sergei NikolaievichWinogradsky to study (and discover) chemosynthesis, a process where en-ergy is derived from inorganic compounds [110]. Winogradsky columns areglass cylinders (or bottles) loaded with a sediment, supplemented an organiccarbon source (typically paper) and a sulfur source, that is then sealed offand illuminated. With time, different microbes occupy different levels ofthe column based on their metabolic capabilities. Each layer is character-ized by distinct redox reactions (electron donor/acceptor pairs) that supportmetabolism locally (much like the Mats discussed above), and via diffusionto other strata impact the metabolism of the entire column. For example,the top layer supports phototrophs that produce carbon and oxygen, whichdrives a second layer that aerobically respires carbon and oxygen, resultingin a third layer that is anaerobic and typically uses alternate electron accep-tors such as nitrate or sulfate. Recent work has begun to show that thesecomplex communities are amenable to quantitative interrogation in the lab-oratory [111, 112]. Of particular note is a study that used centimeter longglass capillaries to assemble stratified communities cultured in the presenceof dyes to report pH and redox [113]. A simple imaging setup then permittedthe acquisition of quantitative spatiotemporal data on community assemblyin many replicates. These systems were used to simulate communities inthe lung of a cystic fibrosis patient, but an opportunity to extend this workremains and the platform is well-suited to the ensemble approach outlinedhere.

21

Page 22: An ensemble approach to the structure-function problem in

Figure 4: Bringing wild communities into the lab. (A) Winogradsky columns arelaboratory-assembled are lab-assembled communities with distinctive and reproduciblespatially-stratified metabolite fluxes. These fluxes, and consequently community struc-ture, arise from emergent redox gradients. (B) Materially closed ecosystems are com-munities grown in sealed vessels whose only energy input is light. Nutrient cycling inclosed ecosystems arises from phototrophic organisms generating reduced carbon by fix-ing CO2, which can then be consumed by heterotrophic organisms. Predators such asciliates can consume whole cells, facilitating the recycling of macromolecular biomass.(C) Serially-passaged communities enrich a complex environmentally-derived communityon laboratory-controlled nutrient conditions (e.g., a fixed carbon source). The resultingcommunities are typically low-complexity, and demonstrate reproducible trophic roles andpatterns of nutrient exchange.

22

Page 23: An ensemble approach to the structure-function problem in

Given that metabolic niches are spatially separated in Winogradsky cul-ture systems they are ideal systems for understanding how this stratificationself-organizes, their energetics and how this organization depends on theboundary and initial conditions in the system. For example, if the capillarysystem of Quinn et al. [113] could be combined with defined nutrient con-ditions, controlled illumination and quantitative imaging one could ask howthe layers (and therefore the metabolite fluxes) depend on community com-position. Given the timescale of these experiments (months) working withmany replicates in parallel will be key.

Materially closed ecosystems

Closely related to Winogradsky columns are materially closed microbialecosystems (CES, Fig. 4B). Closed ecosystems are hermetically sealed, typ-ically aquatic, microbial communities that have been shown to sustain lifewith only light as an input for decades in some cases. These ‘ecospheres’ areavailable commercially (https://eco-sphere.com/) and have been studiedin an academic setting since the 1960s. CES contain photosynthetic mi-crobes, typically algae or bacteria, and either simple or complex consortiaof heterotrophic bacteria and predators. When provided with only lightthese communities self-organize to sustain nutrient cycles and therefore thecommunity itself. CES act as model biospheres since they require nutrientcycling to persist [114]. In the context of the structure-function problemthe salient metabolic property of CES is nutrient cycling, and one questionis how the composition of the community determines nutrient cycling ratesand persistence.

Work from a number of groups [115, 116, 117, 118, 119, 120] showedthat CES containing primarily microbes were tractable model systems andmethods for quantifying nutrient cycling were developed [115, 120]. Morerecently, Leibler and Hekstra [121] and later Leibler, Frentz and Kuehn [122]studied population dynamics in a synthetic closed ecosystem comprised ofthree species using sophisticated microscopy-based methods. These studiesrevealed remarkably deterministic population dynamics in CES. However,few studies have quantitatively characterized the nutrient cycling capabilitiesof CES with the notable exceptions of Obenhuber [115] and later Taub [120].The early work of Obenhuber showed that complex bacterial communitiesmixed with photoautotrophic algae or bacteria could sustain a carbon cyclefor many months [115]. Inspired by these studies we recently developed ahigher throughput method for quantifying carbon cycling that uses low-cost

23

Page 24: An ensemble approach to the structure-function problem in

microelectromechanical sensors (MEMS) made for mobile devices to measuresmall changes in pressure inside a CES [74]. As appreciated by Obenhuberand Folsome, changes in pressure reflect carbon cycling because oxygen islower solubility in water than CO2. When photosynthetic microbes fix CO2

they produce oxygen and the pressure rises. When bacteria respire carbonthe opposite happens and one can quantify carbon cycling by measuring pres-sure oscillations during light-dark cycles. We constructed CES using bacterialcommunities derived from soil samples combined with a domesticated strainof the alga Chlamydomonas reinhardtii. We found that taxonomically diverseCES stably cycled carbon for as long as six months. Metabolic profiling ofthese communities showed that despite taxonomic variability across repli-cate CES each community exhibited similar metabolic capabilities in termsof the carbon compounds they could utilize. It remains to be understoodhow such taxonomically distinct communities can retain this diversity whileexhibiting similar metabolic capabilities. To address this question using theapproach outlined here would require studying many synthetic communitieswith varying composition while measuring carbon cycling.

We propose that CES constitute powerful model communities for under-standing nutrient cycling. In the context of the approach outlined here, CEScould be used to understand how initial nutrient supply (C, N, P, S etc)controls community structure and cycling. Further, since light is the onlysource of energy CES can be used to explore how energy availability impactsthe structure-function mapping in terms of cycling.

Enrichment in defined media

Recently Sanchez and collaborators performed serial enrichment cultureson complex natural communities in simple defined media containing a singlecarbon source [123, 124] (Fig. 4C). In this case, the function of the com-munity is the conversion of glucose to CO2 and biomass. Over a few tensof generations these communities assembled into relatively simple, and pre-dictable consortia comprised of a few cross feeding strains. This convergentstructure was typified by characteristic ratio of strains from the Enterobacte-riaceae and Pseudomonadaceae families. To explain this conserved structure,strains were isolated from the endpoint of the enrichment experiment. Theywere then assayed for growth on glucose, and Enterobacteriaceae strains werefound to grow faster on glucose than the Pseudomonadaceae. Metabolomicmeasurements (mass spectrometry) revealed that Enterobacteriaceae strainsexcreted intermediates such as acetate, which Pseudomonadaceae were ob-

24

Page 25: An ensemble approach to the structure-function problem in

served to consume rapidly. Therefore, similar to the polysaccharide particleexperiments discussed above [79], these experiments revealed a primary car-bon degrader that rapidly consumed the supplied carbon source but releasedsecondary metabolites (in this case acetate) that supported the growth ofother strains. Very recent work suggests that this type of cross-feeding mayarise from stress induced release of nutrients that arises due to serial dilu-tion [35]. In these communities the salient metabolic property of the systemis carbon degradation. These studies have already revealed much about howthe structure of these communities impacts the carbon degradation (e.g., therole of secondary consumers). It would be interesting to more fully elucidatethe mechanistic basis of this cross-feeding in the service of understandinghow the genotypes of each strain present determine their uptake and releaseof carbon compounds. Hopefully these insights would allow us to understandhow the cross-feeding depends on available carbon compounds and environ-mental parameters such as pH. These generalizations could prove insightfulin natural contexts such as soils where carbon degradation is a critical phe-nomenon for climate change.

Bridging structure and function with synthetic communities

In recent years a number of studies have attempted to bridge structureand function of wild communities by performing experiments on syntheticallyassembled communities of natural isolates. This approach offers an oppor-tunity to capture much of the substantial taxonomic and genomic diversityof natural communities within a setting where environment and compositioncan be controlled, and function can be measured accurately.

A handful of studies have explored how interactions between strains ina community affect carbon utilization. In part, these studies attempt todetermine how prevalent different types of interaction between strains are(e.g., mutualism, parasitism, etc.), and how these interactions vary basedon community composition and the identity of the carbon substrate pro-vided. Using strains isolated from tree-associated environments, Foster andBell [125] assembled synthetic communities and measured CO2 evolution dur-ing growth on a complex medium as a metric for community productivity.The measured values of productivity were compared to a “non-interacting”null prediction obtained by summing the productivities of the constituentstrains grown in monoculture. What was observed in the vast majority ofpair cultures and higher-order communities was consistent with resource com-petition rather than mutualism, suggesting that competitive interactions for

25

Page 26: An ensemble approach to the structure-function problem in

carbon utilization dominate in natural communities. This implies that thecarbon utilization of a community should saturate with the diversity of acommunity, which is precisely what is shown in a more recent study by Yu etal. [126]. In this study a diverse ensemble of communities with varying straincomposition and diversity was generated from a seawater inoculum via serialdilution and dilute-to-extinction approaches. Cell density, protein concentra-tion, and CO2 evolution were measured along with community diversity (via16S metagenomic sequencing). These community function measurementsshow saturating behavior as a function of taxonomic richness. Abundantstrains were isolated to disentangle the effects driven by individuals and ef-fects driven by interactions. Interactions increase and saturate with diversity,suggesting both competition and complementation increase simultaneouslywith diversity.

Additional recent studies have added nuance to the picture of what in-teractions are prevalent in carbon-degrading communities [127, 128], findinga high prevalence of parasitic and mutualistic interactions within communi-ties of isolates that are consistent with cross-feeding. These studies lever-age a high-throughput droplet microfluidics platform to perform combina-torial community assembly and culture in multiple different carbon sourceconditions, and the growth of community constituents was measured usingfluorescent labelling. In the first study [127], a high prevalence of com-petitive interactions was observed, particularly for communities and carbonsources where the constituent strains all showed strong growth on the car-bon source in monoculture. However, positive interactions were frequentlyobserved in cases where one strain grew poorly on a carbon source in mono-culture, consistent with that strain growing in a community due to cross-feeding of metabolic intermediates. The second study [128] broadened thisobservation by focusing on communities comprising strains from two taxo-nomic orders, Enterobacterales and Pseudomonadales. Again it was observedthat positive interactions were common in communities where one of the con-stituent strains grew poorly by itself on a given carbon source. It is likelythat the positive interactions were generated by cross-feeding of overflowmetabolism intermediates [35]. These positive interactions likely arise fromthe same mechanism discussed above in the enrichment culture experimentsof Sanchez and colleagues [123, 124]. Altogether, these results indicate thatcommunity carbon degradation can be a relatively simple function of thetaxonomic structure of a community. Linking these relationships to genomicstructure remains to be accomplished.

26

Page 27: An ensemble approach to the structure-function problem in

Another study by our group [52] set out to explicitly identify the genomicattributes of a community that are predictive of community function. Us-ing a statistical ensemble of isolates that perform denitrification, a processof anaerobic respiration involving the reduction of oxidized nitrogen com-pounds, we first mapped the genotypes of individual isolates to the preciselyquantified kinetics of nitrate and nitrite reduction, which were parameterizedusing a consumer-resource modeling approach. We used a regularized linearregression approach to predict nitrate and nitrite reduction kinetics from thepresence and absence of denitrification-pathway genes possessed by each iso-late. We then assembled communities of these isolates and determined thatresource-competitive interactions are prevalent and predictable from single-strain kinetics via the consumer-resource model. Thus we inferred that theconserved properties of metabolic genes allow the prediction of community-level function. This study shows that synthetic communities comprised ofnatural isolates, combined with statistical approaches, can yield insights intothe mapping from gene content to metabolite dynamics. It remains to beseen if this approach can be applied to more complex communities.

Our work in Gowda et al. [52] points to the idea that focusing on thegenomics and ecology of specific metabolic processes can be a powerful ap-proach. In particular, this study leads us to believe that denitrification offersa remarkable system for the quantitative interrogation structure and func-tion both using the statistical approach outlined here, but also for detailedphysiological studies of specific metabolic processes. The advantages of den-itrification include the fact that the taxa that perform the process are easilyisolated and grow well in the laboratory [129], the metabolites can be quan-tified in high-throughput [52], the molecular genetics of denitrification arewell-understood [38] and the process has been characterized in the wild tosome extent [130]. These opportunities have spawned several compelling re-cent studies [131, 132, 133], including a study of the role of carbon sourceidentity in driving denitrification in wild communities [134].

Experimental advances and opportunities to enable the ensembleapproach

Having reviewed a variety of model communities for undertaking an en-semle approach to the structure function problem we now turn our attentionto experimental methods to study these systems. We focus here on culturingand isolation methods as well as analytical techniques for measuring metabo-

27

Page 28: An ensemble approach to the structure-function problem in

lites. We do not review sequencing approaches that have been discussed indetail elsewhere [135].

Higher throughput/targeted isolation techniques

A vast majority of microbes that are known to exist in nature remainuncultured in the laboratory. The staggering difference between the num-ber of cells counted from microscopy and those obtained on agar plates wasdiscovered in the early 19th century [136] and was dubbed as the ‘the greatplate count anomaly’ in 1985 by Staley and Konopka [137]. Developments insequencing techniques have further widened the gap between the number ofcultured and uncultured bacteria. The causes for microbial uncultivabilityinclude requirement of growth factors present in the natural environment,slow growth, need for interspecies interactions and transitions to dormancy.Development of isolation techniques that overcome these drawbacks is im-portant to capture the high microbial diversity that exists in the wild. Suchtechniques will aid in the construction of synthetic communities with highgenotypic and phenotypic diversity and hence benefit the study of structure-function problem.

Recently developed isolation techniques that offer some advantages overthe conventional approach of plating on agar include culturomics, micro-droplets and diffusion chambers. In culturomics, communities are tested forgrowth in a multitude of media conditions using high-throughput techniques,followed by subjecting the communities to mass spectrometry and sequenc-ing [138, 139]. By performing MALDI-TOF (see below) mass spectrometrydirectly on colonies, bacterial taxa can be identified with high fidelity [139].In case MALDI-TOF fails to identify taxa, 16S sequencing is used. The useof mass spectroscopy combined with sequencing facilitates accurate, rapid,and comprehensive strain identification. In a recent study [140], the cultur-omics approach was shown to be very successful in increasing the number ofspecies isolated from the human gut (by ˜two fold).

A higher-throughput method involves encapsulating cells from naturalcommunities in gel microdroplets (GMDs) made of agar. The GMDs arethen incubated in growth chambers flushed with low nutrient media. Fol-lowing this, GMDs with microcolonies are sorted using flow cytometry andindividual GMDs are subsequently transferred into microtiter plate wells con-taining rich organic medium for biomass enrichment and isolation.[141, 142]In this technique, the porous nature of the GMDs facilitates exchange ofmetabolites between droplets during the incubation. As a result, strains

28

Page 29: An ensemble approach to the structure-function problem in

that require metabolites produced from resource-mediated interspecies inter-actions can be isolated using this technique.

Diffusion chambers work by culturing communities in chambers exposedto their native environments through porous membranes [143, 144, 145].Thus, the setup allows for the growth of microbes that require growth factorspresent in their natural environments and/or produced from native commu-nity interactions. Significant developments improving the throughput of thisisolation method include the isolation chip [146] and the Hollow-Fiber Mem-brane Chamber (HMFC) [147].

In addition to these non-targeted isolation techniques, targeted isolationtechniques have been recently attempted. These involve designing the iso-lation methods to target desired phenotypes (e.g., antibiotic resistance orsporulation [148]). One recent study successfully isolated desired cell typesusing fluorescently labeled antibodies against predicted cell surface proteinscombined with flow cytometry for cell sorting [149].

Overall, both the available targeted and stochastic isolation techniqueshave proven to be useful for isolating previously unculturable bacteria. Hence,these techniques may prove valuable for generating ensembles of bottom-upassembled microbial communities.

High throughput experimental platforms

Our proposed ensemble approach for studying structure-function rela-tionship in microbial communities requires creation of many replicate com-munities. Hence, high throughput culturing platforms are critical for itsimplementation.

A majority of high throughput experimental platforms so far have beendroplet-based or microfluidic-based devices. One such recently developeddevice is ’kChip’, a microfluidic platform that facilitates combinatorial con-struction of microbial communities [150]. A study involving syntheticallyconstructed microbial communities on kChips successfully identified sets ofstrains among 19 soil isolates that promotes growth of model plant sym-biont Herbaspirillum frisingense, by screening ∼100 000 multispecies com-munities [151]. Though kChip is a high throughput platform, it only enablesbottom-up construction of microbial communities that requires isolation ofmicrobes prior to the experiments. Additionally, only metabolic functionswith optical readouts can be assayed, as physical access to the microdropletsat this scale is not feasible.

29

Page 30: An ensemble approach to the structure-function problem in

Another similar microfluidic platform that enables parallel co-cultivationof microbial communities was developed by Park et al. [152]. Their platformwas able to successfully detect pairwise symbiotic interactions in communi-ties when the symbionts were in as low an abundance as 3 percent of the totalpopulation. Here again, only optically detectable metabolic properties can bemeasured, but the device enables top-down construction of microbial consor-tia through random compartmentalisation of community members. Thoughthis was not a structure-function study per se, inclusion of automated dropletsorting and characterization of communities in the retrieved droplets can eas-ily enable structure-function studies. In fact, this was achieved in a morerecent study by Terekhov et al., where microbes conferring antibiotic resis-tance in the oral microbiota of Siberian bears were identified [153]. Thiswas done by functional profiling of the encapsulated communities from theoral microbiome that suppressed the growth of the pathogen Staphylococcusaureus.

From the aforementioned studies, it can be said that the choice of theexperimental platform largely depends on the nature of the study. Someexisting platforms support a top-down approach whereas others support abottom-up approach. Further, the methods of determining structure andfunction can differ across platforms. There is room to improve these meth-ods to incorporate other analytical techniques for measuring metabolites.For example, if large scale culturing platforms could be combined with spec-troscopic or automated mass spectrometry methods, this would enable therapid construction of large quantitative datasets.

Measuring metabolite dynamics

Metabolites, unlike nucleic acids, require distinct analytical techniquesdepending on the metabolite of interest. Here we review the available meth-ods, their applicability and opportunities for improving these methods formicrobial consortia. See Table 1 for the specific strengths and limitations ofeach technique discussed here.

Nuclear magnetic resonance spectroscopy

A number of high quality textbooks describe the fundamental physics [154]and chemistry [155] of nuclear magnetic resonsance (NMR) spectroscopy.Here we give an intuitive explanation of the basis of this technique and goon to the practical applications to measuring metabolites in microbial com-munities.

30

Page 31: An ensemble approach to the structure-function problem in

Method Sensitivity SpecificityRange of

applicability Throughput

NMR low high high lowMass Spectrometry high high high low/moderate

Infrared/Raman moderate high high highUV/Vis moderate low low high

Targeted assays moderate high low high

Table 1: Comparison of methods for measuring metabolites in microbial communities.Sensitivity refers to the minimum detectable concentration. Specificity refers to the abilityof the assay to detect a specific metabolite. Range of applicability refers to the diversityof metabolites that can be detected with the technique. Throughout is the number ofmeasurements that can be made in parallel.

NMR spectroscopy exploits the spin magnetic moment of atomic nucleisuch as hydrogen, carbon, and nitrogen to characterize chemical structure.In an applied magnetic field, nuclei behave as weak magnets, collectivelyaligning with the field. The collective alignment of the nuclear magneticmoments can then be manipulated with applied electromagnetic fields in theradio frequency (MHz) and detected as emitted fields in the same spectralregion. The small magnetic moments of nuclei cause them to emit veryweak radiation meaning that relatively high concentrations of metabolitesof interest are necessary for detection. Nuclei experience minuscule changesin the local magnetic field due to their local chemical context, resulting inwhat are termed “chemical shifts.” For example, a proton in hydrogen onan alkane (e.g., methane) will emit a distinct radio frequency (its resonancefrequency) from one in an aromatic hydrocarbon (e.g., benzene). These smallchanges in emitted radio frequency fields are of the order of parts per million(ppm). Typical resolution of modern instruments are a fraction of a ppmand depend on the field strength of the spectrometer and technical details ofthe detection scheme. State-of-the-art spectrometers (operating at 600MHzand above) are widely available at core facilities.

For metabolomics the two most common types of NMR are proton (1H)and carbon NMR, each of which has both advantages and disadvantages.First, the advantages of proton NMR are (1) rapid acquisition due to therelatively high signal-to-noise ratio in proton spectra, (2) the fact that noisotopic labeling is necessary, and (3) the ability of the technique to detecta broad range of relevant organic compounds. One disadvantage is the fact

31

Page 32: An ensemble approach to the structure-function problem in

that spectra from mixtures of unknown metabolites are complex, often con-taining hundreds of peaks corresponding to the many different compoundspresent. As a result, it can be challenging to detect the presence or absence ofspecific metabolites via proton NMR. Further, water contributes a broad andstrong solvent signal in the middle of the relevant range of chemical shifts.There are two main routes to removing this signal: (1) drying the sampleand replacing the water with D2O and (2) using clever pulse sequences todecouple the water signal from the metabolite signals [156]. The former re-quires specialized equipment and increases the cost and reduces throughput.Therefore, it is recommended to use decoupling. The fundamental physics ofhow this decoupling works is beyond the scope of this review, but it is rec-ommended to use 1-D NOESY (Nuclear Overhauser Effect Spectroscopy) toisolate metabolite signals from water [157]. The approach is robust, widelyapplied, and requires no sample processing to be done. It should be notedthat because of variation in technical specifications between instruments,performing measurements on a single spectrometer across samples is key tomaintaining reproducibility of measurements [156].

Carbon NMR in contrast, detects signals from the magnetic moments ofcarbon nuclei. As with protons, the chemical context of the carbon nucleusgives rise to chemical shifts in the resonance and this permits the disambigua-tion of carbon nuclei in different compounds. One advantage of Carbon NMRis that it does not require suppression of water signals. The main drawbackto carbon NMR is sensitivity. The dominant isotope of carbon (12C, 99 %prevalence) is not NMR active, while 13C is NMR active, but present at about1 % natural abundance. This means that most nuclei do not contribute to theobserved signal. Second, 13C has a magnetic moment that is roughly 4-foldlower than 1H reducing the signal-to-noise ratio. These two considerationsimply that acquiring carbon spectra requires extensive averaging and cantake hours for a single sample. However, the low isotopic abundance of 13Ccan be overcome by using 13C labeled compounds as nutrients, with order ofmagnitude increases in signal-to-noise. Unfortunately, these compounds areexpensive (hundreds of dollars per gram) increasing costs.

Carbon and proton NMR are the two most commonly applied metabolomicprofiling techniques for microbial communities. Typical studies range fromtargeted detection of a single metabolite in a well-defined community [158] tountargeted profiling in very complex consortia such as anaerobic digesters[159].Here we present a few examples of NMR based measurements of metabolitedynamics in communities as case-studies that might be more broadly appli-

32

Page 33: An ensemble approach to the structure-function problem in

cable.One compelling approach taken by Date et al. [160] and Nakanishi et

al. [161] is to combine NMR based metabolite measurements in time withquantification of abundance dynamics. Date et al. use 13C labeled glucose toinitiate growth in fecal microbiota [160] (note that 13C is a stable isotope).The authors then performed time series of abundance dynamics and carbonNMR measurements. Given labeled glucose as the sole carbon source, theauthors could track the dynamic production and consumption of 13C labeledcompounds as the glucose was converted to organic acids in time. The au-thors could then (crudely, given the electrophoresis methods at the time of thestudy) classify the community into primary and secondary degraders. Thecompelling aspect of this study is the potential to statistically correlate largescale variation in the community structure with metabolite dynamics. Onecould imagine a similar experiment with amplicon sequencing-based abun-dance dynamics measurements. This approach would be especially powerfulfor looking at the community level response to carbon fixation by autotrophsin systems like mats or CES. In these situations initiating a community with13C labeled bicarbonate as the sole carbon source would allow the directmeasurement of carbon flux from autotrophs to associated heterotroph com-munities. Since NMR relies on magnetic fields and emitted RF signals, it isnon-invasive and could be applied to ongoing experiments (e.g. CES) withoutinvasive sampling.

We conclude with a brief note on throughput. NMR spectrometers arelarge, expensive machines that rely on superconducting magnets to applylarge magnetic fields. Running parallel experiments on NMR machines istherefore prohibitive. Throughput is achieved by using robotics or fluidicsystems to automatically load samples into the spectrometer. Such experi-ments typically achieve a throughput of order 100 samples per day [162, 163].Increasing NMR throughput by a factor of 10-100 would constitute a majoradvance.

Finally, despite the current limitations, there is a revolution underway inquantum sensing systems from superconducting quantum interference devices(SQUIDs) [164] to spin-based magentic field sensors, impurities in diamonds(nitrogen-vacancy (NV) centers) [165] or force measurements[166]. SQUIDsenable ultra-low field NMR, obviating the need for large expensive magnets,and NV-centers enable high sensitivity magnetic resonance detection at thesingle-molecule level. The applications of these technologies to metabolicfunction in microbial communities await future discovery, but one can imag-

33

Page 34: An ensemble approach to the structure-function problem in

ine massively parallel NMR measurements or in situ detection of metabolitesin complex settings.

Mass spectrometry

Mass spectrometry is the most widely used platform for metabolomicsand several good reviews of the methodology are available [167, 168, 169, 170,171, 172]. Therefore, our discussion of mass spectrometry will be limited, butwe include it as an important point of comparison with the other techniquesdiscussed in this section.

Mass spectrometry ionizes the molecules in a sample, using a varietyof different methods, and then accelerates the charged molecules using anelectric field. The beam of ions is then passed through a magnetic field that(via the Lorentz force law) results in a force on each ion that depends on itsmass to charge ratio. The result is a physical separation of ions in space inproportion to the mass-to-charge ratio. A huge number of variations existon this basic theme, including measurements of time of flight (TOF) andquadrupole mass filters that apply oscillating fields to the ion beam. Thereviews cited above contain detailed discussions of the type of ionization anddetection methods that are best suited for metabolomic applications.

In the context of metabolomics, where samples are often of consider-able chemical complexity, mass spectrometry is almost always preceded byeither gas or liquid chromatography to separate compounds and thereforeincrease the specificity and sensitivity of downstream mass spectrometry.As a result, gas chromatography-mass spectrometry (GC-MS) and liquidchromatography-mass spectrometry (LC-MS) are the two most widely usedforms of mass spectrometry for metabolomics. GC-MS is restricted to volatilecompounds typically of molecular weight less than 600 Da [167], while LC-MSapplies to a broader spectrum of metabolites.

Mass spectrometry has significantly higher sensitivity that NMR, andcan achieve single molecule sensitivity [173], although this is not routine.This advantage is especially crucial for measuring metabolite fluxes in cellsand communities where concentrations are often low (micromolar and be-low) [174, 40]. This combination of sensitivity and the ability to distinguishbroad classes of compounds has contributed to the widespread usage of MS-based metabolomics methods [175].

The throughput of these techniques is significantly lower than the opti-cal methods discussed below and comparable, at present, to NMR. So it is

34

Page 35: An ensemble approach to the structure-function problem in

routine to run ∼100 samples over the course of a day or two (see [172] or a re-view). Some robotic systems have been developed to automate the samplingand analysis process [176]. However, making mass spectrometry measure-ments on thousands or tens of thousands of samples, while feasible, is costlyand slow. Given the remarkable sensitivity, specificity and broad applicabilityof the technique especially for untargeted measurements of metabolite pools,it would represent a major advance if mass spectrometry could be routinelyapplied to thousands of samples in parallel. Here, we mention a handful ofnotable studies that leverage mass spectrometry to quantify metabolite dy-namics in microbial communities. The goal is not to present an exhaustivelist but simply to point the reader towards some representative studies.

Amarnath et al. [35] used untargeted metabolomics to study the metabo-lites that are exchanged between two strains of bacteria in a serial dilutionexperiment. The authors revealed a broad spectrum of metabolites excretedby one strain in response to stress. These excreted compounds facilitatedcross-feeding between the two strains. Shi et al. [177] use LC-MS to studymetabolite exchange in a fungal-bacterial community. A statistical analy-sis of the LC-MS data shows that the metabolites excreted are distinct forco-cultures and mono-cultures.

Mass spectrometry is widely used to study the degradation of compoundsfrom pharmaceuticals to soil contaminants [178, 179]. For example, a com-mon environmental contaminant are polycyclic aromatic hydrocarbons (PAHs),which are routinely degraded by bacterial consortia. The breakdown of thesecompounds in time is typically interrogated by GC- or LC-MS [180].

Mass spectrometry is widely applied to food-related microbial communi-ties. In these cases the untargeted nature of mass-spectrometry is importantas the compounds of interest (e.g., for flavor) are typically unknown. Forexample, GC-MS has been used to identify starting components in tradi-tional Cambodian rice wine [181]. Similarly, it was used in fermentation ofred peppers to investigate changes in bacterial and fungal communities andvolatile flavor compounds [182], and high throughput GCMS has been usedto correlate metabolites with taxonomic structure in kimchi fermentation[183], glutinous rice wine [184], the liquor Daqu [185] and pickled radishes[186].

Infrared spectroscopy

Infrared spectroscopy detects absorption and emission of photons in theinfrared region of the electromagnetic spectrum and characterizes molecular

35

Page 36: An ensemble approach to the structure-function problem in

vibrations. As with NMR, the resonance frequency of a molecular vibrationdepends strongly on molecular structure. This dependence affords IR spec-troscopy its chemical specificity. Infrared spectroscopy offers perhaps thebest combination of sensitivity, specificity, high-dimensional characterizationof complex metabolite pools and throughput. There are two commonly usedmethods for measuring infrared spectra that differ in their fundamental phys-ical mechanisms: (1) IR absorption and (2) Raman spectroscopy.

Infrared absorption involves passing light in the infrared range (∼2.5 µmto 10 µm wavelengths or 1000 cm−1 to 4000 cm−1 wavenumbers) through anaqueous or gas phase sample and measuring the absorption. Compounds con-taining different chemical bonds absorb light at different frequencies and theresulting spectrum can provide extensive information on the chemical compo-sition of the sample. As with proton NMR, a major downside of absorptionspectroscopy is the broad absorption of water in the informative region ofthe spectrum (around 3200 cm−1 and 1600 cm−1), which can limit the infor-mation for aqueous samples without cumbersome drying. Simple dispersivespectrometers that shine a narrow band of wavelengths through a samplehave limits on sensitivity, spectral resolution and the duration of acquisition.These limitations can be overcome using Fourier Transform infrared spec-troscopy (FTIR), which uses broadband excitation and an interferometer torapidly acquire spectra in specific spectral bands, and this is the most com-monly applied technique. Plate readers that perform FTIR measurementsen masse on microtiter plates are available and can acquire data from bothliquid and solid phase samples.

In contrast to FTIR or dispersive IR spectroscopy, Raman spectroscopymeasures molecular vibrations using photons in the visible portion of thespectrum. When visible photons interact with a sample, most are scatteredwith the same energy as the incident photon (Reyleigh scattering). How-ever, with low probability, incident light undergoes inelastic scattering andin the process photons are emitted from the sample with either slightly lower(Stokes) or higher (anti-Stokes) energy than the incident radiation. Thesesmall changes in the emitted photon wavelength correspond to the molecu-lar vibrations in the sample. For reasons beyond the scope of this review,Raman spectroscopy does not suffer from broad band absorption from wa-ter, making it especially attractive for microbial communities in the aqueousphase. However, due to the inefficiency of inelastic photon scattering, Ra-man spectroscopy requires high laser power and the resulting heating canbe a problem for biological samples, a limitation that can be overcome by

36

Page 37: An ensemble approach to the structure-function problem in

techniques such as resonance Raman spectroscopy or surface enhanced Ra-man scattering. However, these methods are not yet routine for metabolomicprofiling. Raman spectroscopy can be performed on bulk samples using platereaders, or integrated with a microscope for localized measurements. Morerecently, Raman spectra can be acquired via flow cytometry [187, 188]. Theseplatforms enable much higher throughput than is now standard by NMR ormass spectrometry.

FTIR and Raman spectroscopy have proven to be powerful methods forinterrogating cellular physiology at the single-cell level when combined withmicroscopy (see [189] for a recent review). Remarkably, Raman spectroscopysignals can be used as fingerprints to demarcate cells of one species in dif-ferent growth states [190], or different taxa at the strain, species and genuslevels [191, 192]. A recent study showed that the global transcriptional profileof yeast and bacteria could be predicted via a linear model from single cellRaman spectra[193]. This success owes to the high-dimensional nature of Ra-man spectra, which are often challenging to interpret in terms of individualpeaks but are rich in information that can be decoded statistically.

Despite the power of infrared spectroscopies for chemical characteriza-tion, they have seen comparatively little use in the context of communitiesof microbes. We regard this as a missed opportunity, and suggest that thesemethods could and should be used more broadly. One of the limitations of thetechnique is the challenge of assigning specific peaks to specific compounds.As Kobayashi and co-workers have shown, this limitation can be overcomeby using simple statistical methods to map infrared spectra to other cellularproperties [193]. The approach is to measure Raman or IR spectra on a setof samples and then use a lower throughput technique such as LC-MS tomeasure the absolute concentration of a metabolite of interest. A combi-nation of dimensionality reduction and regression can then be used to mapLC-MS data to infrared spectra. This approach has been used to track sub-strate concentrations in time in monocultures [194] and phenol degredationin complex communities [195].

The advantages of Raman spectroscopy, and to a lesser extent FTIR, overmass spectrometry and NMR are the potential for the rapid acquisition ofhigh-dimensional characterization of metabolite pools. The complexity of theresulting spectra is similar to proton NMR, and therefore is perhaps mostuseful for statistically characterizing differences between community metabo-lite profiles. Such complex spectra can then be used either to measure specificmetabolites via a calibration approach discussed in the previous paragraph,

37

Page 38: An ensemble approach to the structure-function problem in

or to demarcate global metabolic states of consortia without concern forspecific metabolite levels.

UV-visible spectroscopy

We briefly note that simple ultraviolet and visible spectroscopy (UV-Vis)can be used to characterize electronic transitions in compounds of interestfor metabolic characterization. These methods can be performed with widelyavailable plate readers, particularly those that are equipped with monochro-mators rather than filters, which permit excitation and emission to be ar-bitrarily selected by the user. The main limitation of this technique is thefact that electronic transitions in the visible are restricted primarily to chem-ical species with delocalized electron density (e.g., conjugated rings such asbenzene, tryptophan). As a result, these spectra are low specificity and can-not be used for targeted metabolomics. However, the high throughput ofcommon plate readers facilitates rapid measurements, and the spectra cangive coarse characterization of excreted compounds from autotrophs, for ex-ample [196]. Moreover, targeted UV-Vis measurements can be integratedwith common fluorescence microscopes and therefore offer the possibility ofdetecting metabolites in massively parallelized platforms such as droplet mi-crofluidics [127, 128] or in spatially structured communities.

Targeted assays

In situations where the metabolic function of the community under studyis known apriori and restricted to a specific chemical compound a targetedmetabolite assay can be powerful. Such assays typically utilize chemicalreactions to create an optically active compound in proportion to the con-centration of a metabolite of interest. For example, starch can be degradedto glucose enzymatically and then glucose concentration can be assayed viastandard methods [197] or iodine can be used to stain starch directly and de-tect degredation [198]. Similarly, nitrate and nitrite can be detected via theGriess assay [199], which utilizes a colorometric reporter (dye) generated viaa reaction with nitrite. Such assays can easily be performed in 96-well platesfacilitating high throughput [52]. These measurements can be powerful forstudying specific metabolic processes in communities. However, typically thechemistry involved is not easily automated nor are the conditions of the reac-tion biocompatible. So, such measurements are made offline after samplingand are challenging to automate in situ.

38

Page 39: An ensemble approach to the structure-function problem in

Quantifying structure

We briefly outline the main ways in which community structure can bequantified. As mentioned above a suite of next generation sequencing tech-nologies are capably of quantifying community structure on multiple levels.For example, amplicon sequencing of the 16S rRNA gene uses PCR to am-plify this universally conserved ribosomal subunit and then uses the numberof reads mapping to sequence variants (amplicon sequence variants, ASVs) asa proxy for the relative abundance of each taxon. This widely used methodhas many well documented downsides including variation in the copy num-ber of the 16S gene across taxa, PCR bias and the challenge of associatingtaxonomy with metabolic capabilities of each strain [200]. Despite theseshortcomings, amplicon sequencing does permit rapid and high throughputcharacterization of the community composition, and methods exist for infer-ring metabolic capabilites of strains from 16S gene sequence alone [201].

In contrast, shotgun metagenomic sequencing amplifies all genomic DNAin a sample. These data enable the characterization of the gene contentof an entire community by annotating reads. Metagenomics gives a muchmore complete picture of the genomic structure of a community but severaltechnical hurdles limit this approach. First, annotating reads mapping togenes remains a challenge and roughly 30-50% of the open reading framesare annotated, leaving much of the genomic content unclassified in terms offunction. Second, assigning annotated genes to specific taxa within the com-munity and inferring their relative abundances remains a hard problem. Inparticular, assembling reads into genomes (metagenome-assembled genomes,MAGs) has been performed but the quality of these assemblies remains hardto assess. Applying cutting edge machine learning methods is likely to im-prove this process [202]. Another method to analyze shotgun metagenomicsdata is to use a database of reference genomes as templates to recruit readsfrom a metagenome. The upside of this approach is the ability to reliablydetect variation at the level of single nucleotides [203], and reliably assemblegenomes from metagenomes. The cost of course is that the method missesany diversity that is not present in the reference genome database. Despitethese challenges, metagenomics perhaps gives clearest picture of the genomicstructure of a community as a whole.

Transcriptional profiling of entire communities is also feasible via RNA-sequencing based methods [204, 205]. Despite the potential predictive powerof knowing which genes in a community are transcriptionally active, these

39

Page 40: An ensemble approach to the structure-function problem in

measurements have been applied much less widely than taxonomic ampli-con or shotgun metagenomic methods. However, as the costs of sequencingcontinue to fall, it remains a compelling proposition to use metagenomicsand transcriptional profiling on the same samples. We propose that suchmeasurements could very well lead to deeper insights into the communitystructure and function by potentially simplifying the picture. For exam-ple, transcriptional profiling could reveal which collective components of themetagenome are inactive and therefore could potentially be left out of apredictive framework.

Learning from data: function from structure

Equipped with measurements of metabolites (either dynamically or at afixed point in time) and some characterization of the community structure,we are then left to ask what to do with these data. In reality, the answer tothis question is an empirical one that depends on the structure the data, themodel system under study and the precise question being posed. We offer nopipeline or prescription for how best to proceed, but instead offer a few sug-gestions and examples and point out some important technical pitfalls. Ourintention is to suggest some approaches to learning the structure-functionmapping from these data and to leverage the results for predicting commu-nity metabolism. As with most data analysis tasks, simple is better. Usingthe latest methods in machine learning or dimension reduction may be tempt-ing but it is almost always better to explore the data with methods that aresimple to interpret and straightforward to implement.

Typically any sequencing based characterization of community structurewill be high dimensional. For example, 16S amplicon sequencing will oftenyield 10 to 1000 of taxa per sample, similarly metagenomic data can contain10 000 or more annotated open reading frames depending on the complex-ity of the community. In contrast, assembling an ensemble of more than100 or 1000 communities is a huge challenge even for the highest throughputmethodologies. As a result, we are almost always in the limit of a small num-ber of data points (communities) and large number of variables (taxa, genes,transcripts etc). In this regard, predicting functional properties from thesestructural data requires reducing the dimensionality of the data describingcommunity structure. For the purposes of the discussion below we define thenumber of features (genes, taxa, transcripts) as p and the number of com-munities in a given ensemble n (Fig. 5A). A sequencing data set can then be

40

Page 41: An ensemble approach to the structure-function problem in

described by a matrix X that is n× p with n << p in most cases. The rowsof this matrix, ~xi correspond to sequencing data for the ith community inthe dataset. The entries of this vector are then the number of reads mappingto a specific ASV in a 16S data set or gene in the metagenome of that com-munity. For each of these communities we assume that the data set includessome functional measurement, yi, which may be a dynamic quantity (yi(t)).The goal then is to learn a representation of this functional measurement ofthe community in terms of the columns of X.

Compositional data and zero counts

All of the standard sequencing methods for quantifying community struc-ture result in compositional data - that is they do not report the absoluteabundances of taxa, genes or transcripts in the sample, but only the relativecontribution (e.g., the rows of X are defined only up to an unknown con-stant). Recently, methods using qPCR or the addition of oligos at knownconcentrations have been developed to measure absolute abundances via se-quencing. However, as yet these methods are not widespread. Therefore,in any analysis we must contend with the compositional nature of sequenc-ing data. Much has been written about this problem [206], and membersof the field are now generally aware of the issues that can arise when thecompositional nature of the data are ignored.

Briefly, compositional data can, and should, be log-ratio transformed us-ing log ratios of counts. Log-ratio transformations take compositional datafrom a simplex and map them to real numbers with the properties of a vectorspace. This transformation therefore permits the application of conventionalstatistical approaches to compositional data. Typically, this is done via thecenter-log transform (CLR) or an additive log-transform (ALR) [207, 206].Computing log-ratio is not compatible with zeros (e.g., zero abundances ofan ASV or gene transcript), a problem that has received a significant amountof attention. A host of methods from adding psuedocounts uniformly to allzeros or using Bayesian approaches to replacing zeros [208, 209, 210]. Weurge caution here, as many methods are both ad hoc and can qualitativelyimpact the results of downstream analyses. We will not review the tech-nical details, but readers should to engage carefully with their data ratherthan blindly applying existing pipelines for log-transforms and handling ze-ros. For example, variance decompositions applies to log-transformed datacan be dominated by large numbers of taxa or transcripts with zero counts.In this scenario the details of how zeros are handled (e.g., the magnitude of

41

Page 42: An ensemble approach to the structure-function problem in

the psuedocounts added to all taxa) can have huge impacts on the variancedecomposition.

Dimension reduction: with or without supervision

The goal is to associate structural components or sets of componentswith the metabolic function of a community. As discussed above we arealmost always in the limit of small numbers of data points. We are thereforeforced to consider reducing the dimensionality of the data from p by someform of dimension reduction. Above we suggested that low-dimensionalityis a common feature of biological systems from proteins to higher organismsand behavior. Despite the fact that this observation seems to hold quitebroadly it is important that we not take low-dimensionality as given in anyanalysis of a microbial community. In this sense, we must make a principledsearch for a simpler description of community structure while judiciouslyconsidering the possibility that no such description exists or that we areconsidering the system at the wrong level of organization. For example, inan ensemble of communities with strong functional redundancy, where verydistinct taxa perform similar metabolic functions, it may not be possible tofind a low-dimensional description of the ensemble in terms of taxa presentacross replicates in the ensemble.

There are two ways to go about approaching this problem: supervisedand unsupervised dimension reduction (Fig. 5B-C). We note that for most ofthe techniques described below the statistical learning textbook from Hastie,Tibshirani and Friedman [211] provides an excellent and readable reference.For a more recent review article we recommend the paper of Mehta et al. [212]that covers both supervised and unsupervised methods.

Unsupervised learning: In unsupervised dimension reduction we seeka lower dimensional representation of the matrix X without any explicitconsideration given to y in selecting low-dimensional features (Fig. 5B). Theidea is to reduce the dimensionality of X and then examine the relationshipbetween this lower dimensional representation of community structure andfunction, ~y. The hope is that some lower dimensional representation of thecommunity captures all of the useful information in the matrixX. All of thesemethods change the basis of X in order to recast community structure interms of groups of genes or taxa. The dimension reduction arises when a smallnumber of such groups of taxa provide a good description of all communitiesin the data set. If this is case each community ~x can be represented not interms of the abundances of p taxa, but instead as a combination of m << p

42

Page 43: An ensemble approach to the structure-function problem in

Figure 5: Learning the structure-function map from data. (A) Hypotheticalstructure-function data from an ensemble of n communities. y denotes a metabolite mea-surement, either level or rate that could also be dynamic. Colored dots correspond to datapoints in (B,C). X denotes a matrix of n rows each denoting a single community in anensemble. The columns denote the relative abundances (colored bars) of taxa, genes inthe metagenome or transcripts in the metatranscriptome. (B) An unsupervised approachwhere dimensionality reduction is applied to X yielding a lower dimensional representationof community structure that is then associated with communities of differing function. (C)A supervised approach where the function f(~x) is learned for mapping structural variationto functional variation. Regressors denote independent variables in a lower dimensionalrepresentation of X that provides good predictive power of y.

43

Page 44: An ensemble approach to the structure-function problem in

groups or combinations of taxa. Formally, these methods are based on amatrix decomposition (X = AB, with A an n ×m matrix and B an m × pmatrix) subjected to constraints. Dimension reduction comes about if Xcan be well approximated for m << p. In this case, each community inan ensemble can be represented in terms of just m features. These low-dimensional representations, when they exist, can sometimes dramaticallysimplify our understanding of complex systems.

The canonical unsupervised methods are variance-based decompositionslike principle components analysis (PCA) or the more general version, singular-values decomposition (SVD). These methods find the set of orthogonal di-rections in the p-dimensional feature space that maximize the variance in thedata along each direction (eigenvectors of the covariance matrix of X). Theadvantage here is the clean interpretability of these decompositions. Eachdirection (principle component) is a simple p-dimensional vector represent-ing a direction of high data variance. Each data point (community) in theoriginal data set can be represented in the basis of principle components~xi · ~p1, ..., ~xi · ~pj, ... (with ~pj the principle components). If a small numberof principle components capture a substantial fraction of the data variancethen m is small and each community can be represented as projections on ahandful of ~pj components.

Many modifications to this basic idea exist. Perhaps the two most no-table are independent components analysis (ICA)and non-negative matrixfactorization (NMF). ICA finds a lower dimensional representation of X interms of components that are statistically independent, rather than simplyuncorrelated, through an iterative process. The approach is to represent eachcommunity ~x as a sum of statistically independent components [213]. In thiscase ~xi =

∑j ai,j~sj where the ai,j are the weights that ‘mix’ the independent

components s to form the observed data ~x. ICA is most often applied tosignal separation problems where multiple independent inputs are combined.One can speculate that this perspective may be useful for communities ifthey possess a modular organization where independent modules within thecommunity are responsible for functionally distinct processes. NMF is an-other matrix decomposition method that is applicable whenever the entriesof X are constrained to be positive (as is the case in sequencing data). NMFdecomposes the data as X ≈ WH with all entries of W and H positive [214].

In this case ~xi =∑

j wi,j~hj where the w are weights and the ~h are vectors

of length p. The advantage of this approach is that the columns of H, that

44

Page 45: An ensemble approach to the structure-function problem in

act as ‘eigen-communities’ contain all positive entries and are therefore inter-pretable. In contrast, the eigenvectors of a PCA decomposition can containnegative entries, which is not interpretable in the context of X that containsonly positive values (abundances or relative abundances). NMF has seenlimited application in the microbiome context [215]. We have focused hereon well-established methods with simple interpretations. We are aware thatover the past two decades many new methods have been developed, especiallythose that can learn low-dimensional representations of highly nonlinear data(e.g., autoencoders [216] or stochastic neighbor embedding methods [217]).These methods may be useful in the context discussed here, but we advocatestarting with the simpler approaches discussed above before moving on tothese methods.

Regardless of the method of unsupervised learning applied the result isa new representation of community structure (X) in a new basis (e.g., prin-

ciple components, ~p, ~s, ~h). Ideally, m << p and communities with manyhundreds or thousands of genes or taxa can be represented in a much lowerdimensional space ((Fig. 5B). The task then remains to associate this lowerdimensional representation with metabolic function (yi). A common ap-proach to this problem is simply to ask which basis vectors correlate withspecific metabolic properties of the community. One common approach isto treat the question as a regression problem to predict yi from the decom-posed ~xi for example by using the projections of each ~xi along each principlecomponent as independent variables.

Supervised learning: One major shortcoming of the unsupervised ap-proach is that the low-dimensional representation of X that we learn byunsupervised dimension reduction may not be the best representation of thedata in terms of predicting yi. In essence, PCA or NMF may find a lowdimensional representation of the data, but there is no reason or guaranteethat this representation will be informative of the community metabolic func-tion. Indeed, the unsupervised approach artificially separates the process offinding low-dimensional descriptions of the community and predicting theresponse variable (metabolic function). A supervised approach overcomesthis limitations by performing both dimension reduction and prediction atthe same time.

In the supervised approach we seek some prediction of the metabolicfunction in terms of the structure (Fig. 5C). Concretely, we would like toestimate yi = f(~xi) for our entire data set. This can be posed as a regression

45

Page 46: An ensemble approach to the structure-function problem in

problem where we either posit a functional form for f(~xi) or, using moreflexible but less interpretable methods like neural networks, learn a mappingfrom ~xi to yi without positing a specific functional form.

Before discussing how one can approach this problem we would like toclarify the meaning of f(~xi). We are proposing learning a statistical mapfrom ~xi to yi. We are not proposing fitting an explicit ecological model suchas a consumer-resource or Lotka-Volterra model to the data. We regard fit-ting such a model as a much harder proposition than learning a statisticalmapping from structure to function. Indeed, a statistical approach to learn-ing f necessarily abstracts away these dynamics that relate ~xi to yi. We notethat two of us recently took a hybrid approach to the problem with syn-thetic communities that did explicitly model the ecological dynamics [52].In this case, we used a regression to map gene content to consumer-resourcemodel parameters and then used the consumer-resource model to predictmetabolite dynamics in consortia. Remarkably, the approach worked, butthe downside is that it requires isolating individual taxa and constructingsynthetic communities - a more laborious task than studying communitiesdirectly.

There are many statistical approaches to learning the function f . Thesimplest approaches are linear regression methods that simply posit a modelof the form f(~xi) = β0 +

∑pk=1 βkxi,k, where the β are regression coefficients.

There are two major problems with this approach. First, if n << p we havemany fewer data points than independent variables, which means that anordinary least squares regression will almost certainly overfit and yield poorout of sample predictions (as determined by cross validation). The secondrelated issue is that this approach gives equal weight to each entry (gene,taxon) in ~x and does not provide any dimensional reduction. One way tosolve this problem is via regularization [211] where the model is optimizedwith an additional penalty term seeks to reduce the number of non-zero βcoefficients (see LASSO and Ridge Regression). Regularization provides asolution to the problem of selecting which regressors (entries of ~xi) providethe most predictive power while also avoiding overfitting. In situations wherethe levels of noise are not too high and a sparse solution (small number of non-zero βk) do allow for good predictions, these regularization methods typicallysucceed [218]. However, if the noise levels are high or the underlying processis not sparse, then even these methods will fail. Care must be taken indiagnosing when such a regression works and when it does not, see [218, 52].

A major shortcoming of the simple formulation outlined above it that

46

Page 47: An ensemble approach to the structure-function problem in

it lacks any interaction terms (e.g., xk,ixl,i). Adding these terms to theregression above increases the number of independent variables from p to∼ p + p2/2. Given limited data n, it is typically not advised to take thisapproach. However, including such interaction terms is desireable given theutility of considering pairwise interactions in complex systems [219, 220].One way to proceed is to use linear regression approaches that use groupsof independent variables as regressors. For example, principle componentregression [211] uses principle components as independent variables in a lin-ear regression. This is qualitatively similar to the unsupervised approachoutlined above.

The models above are linear, and this aids in their interpretation. How-ever, explicitly non-linear methods for estimating f are also possible. Whenand why such approaches are more or less appropriate in the microbial con-text is not yet clear. However, decision tree based methods such as randomforests have proven useful for relating taxonomic variation to host pheno-types in the microbiome [9]. Random forests can model complex non-linearrelationships between regressors and response variables [211], while retainingsome interpretability by assigning ‘importance scores’ to each independentvariable. As a result these methods can be used to assess the impact of agiven taxon or gene on the community function (in a statistical, not causal,sense). Random forest regressions fit many decision trees to bootstrappedreplicates of the data and average the result. This averaging procedure, oftencalled ‘bagging’, reduces the variance of the prediction and as a result, highvariance/low bias regression approaches can provide lower variance predic-tions.

Finally, as the amount of high quality data on microbial communitiesincreases, the applicability of more recently developed neural network super-vised prediction methods [212] will become more appropriate. These methodstypically contain many millions or even billions of parameters and thereforerequire reasonably sized data sets to train. Advances in methods like trans-fer learning [221], where existing trained networks are trained to solve a newproblem given some data, mean that the user need not start from scratch.The challenge will be what we can learn from these networks once they areused to approximate f . In many cases neural networks do not generalizewell and are susceptible to small amounts of noise on the input variables.We regard the application and interpretation of these network approaches tocommunities a problem at the forefront. It may be that we need to recon-sider how networks are trained to properly learn the salient features of the

47

Page 48: An ensemble approach to the structure-function problem in

structure function problem in microbial communities [222].

What we do and do not learn from a statistical approach

What can the ensemble approach coupled with a statistical analysis likethe one described above teach us about communities? We contend that bylooking at an ensemble of well-chosen communities, and learning the mainstatistical features of community structure that determine function we canbegin to learn what general properties of communities must be present toadmit their functional properties. A handful of recent studies have begunto show the power of this approach [123, 9, 10, 52]. What we recover fromthese studies is what reproducible features of communities are retained acrossreplicate consortia. These reproducible features can be regarded as ‘goodvariables’ for predicting community function from structure. In some cases,understanding these variables lets us control or predict the functional prop-erties of consortia in synthetic systems [52] and in hosts [9]. Ultimately, wehope this approach can be used to design, predict and control microbial con-sortia in engineered and wild contexts to address the existential threat ofclimate change.

However, even when these statistical approaches succeed in predictingcommunity structure from function we often do not understand why, at amechanistic level, the prediction succeeds. For example, in the case of Gowdaet al. [52] the reason for the success of the regression from gene content tophenotypes is not entirely clear. Nor do we believe that the regression canpredict the impact of gene gain or loss mutations. Similarly, in the case ofBlanton et al. [9] the precise metabolic role of each bacterial taxon in thestunting of the host is unclear. In essence, the statistical approach lets usfind good variables for design and control of communities, but it does not,by itself, tell us why these are good variables.

Future directions: evolutionary rules of the structure-function map-ping

So, while the ensemble approach can help us solve the structure-functionproblem the deeper question of why nature constructs communities the way itdoes remains. We argue that the answer to this question will require consid-ering the eco-evolutionary basis of the observed structure function mapping.Addressing this question is subject enough for a separate manuscript. How-ever, we hope that through the careful application of quantitative methods,

48

Page 49: An ensemble approach to the structure-function problem in

like those discussed here, to some of the model systems discussed above, wecan open the door to understanding how nature constructs dynamic func-tional consortia.

References

[1] P. Falkowski, R. J. Scholes, E. Boyle, J. Canadell, D. Canfield, J. Elser,N. Gruber, K. Hibbard, P. Hogberg, S. Linder, F. T. Mackenzie, B. M.Iii, T. Pedersen, Y. Rosenthal, S. Seitzinger, V. Smetacek, W. Steffen,The Global Carbon Cycle: A Test of Our Knowledge of Earth as aSystem, Science 290 (2000) 291–296. Publisher: American Associationfor the Advancement of Science Section: Review.

[2] D. E. Canfield, F. J. Stewart, B. Thamdrup, L. De Brabandere, T. Dals-gaard, E. F. Delong, N. P. Revsbech, O. Ulloa, A Cryptic Sulfur Cyclein Oxygen-Minimum-Zone Waters off the Chilean Coast, Science (NewYork, N.Y.) 330 (2010) 1375–1378.

[3] M. B. Nelson, A. C. Martiny, J. B. H. Martiny, Global biogeographyof microbial nitrogen-cycling traits in soil, Proceedings of the NationalAcademy of Sciences 113 (2016) 8033–8040.

[4] S. Sunagawa, L. P. Coelho, S. Chaffron, J. R. Kultima, K. Labadie,G. Salazar, B. Djahanschiri, G. Zeller, D. R. Mende, A. Alberti, F. M.Cornejo-Castillo, P. I. Costea, C. Cruaud, F. d’Ovidio, S. Engelen,I. Ferrera, J. M. Gasol, L. Guidi, F. Hildebrand, F. Kokoszka, C. Lep-oivre, G. Lima-Mendez, J. Poulain, B. T. Poulos, M. Royo-Llonch,H. Sarmento, S. Vieira-Silva, C. Dimier, M. Picheral, S. Searson,S. Kandels-Lewis, Tara Oceans coordinators, C. Bowler, C. de Var-gas, G. Gorsky, N. Grimsley, P. Hingamp, D. Iudicone, O. Jaillon,F. Not, H. Ogata, S. Pesant, S. Speich, L. Stemmann, M. B. Sullivan,J. Weissenbach, P. Wincker, E. Karsenti, J. Raes, S. G. Acinas, P. Bork,E. Boss, C. Bowler, M. Follows, L. Karp-Boss, U. Krzic, E. G. Rey-naud, C. Sardet, M. Sieracki, D. Velayoudon, Structure and functionof the global ocean microbiome, Science 348 (2015) 1261359–1261359.

[5] E. J. Zakem, M. F. Polz, M. J. Follows, Redox-informed models ofglobal biogeochemical cycles, Nature Communications 11 (2020) 5680.Number: 1 Publisher: Nature Publishing Group.

49

Page 50: An ensemble approach to the structure-function problem in

[6] B. T. W. Bocher, K. Cherukuri, J. S. Maki, M. Johnson, D. H. Zito-mer, Relating methanogen community structure and anaerobic digesterfunction, Water Research 70 (2015) 425–435.

[7] D. F. Toerien, W. H. J. Hattingh, Anaerobic digestion I. The microbi-ology of anaerobic digestion, Water Research 3 (1969) 385–416.

[8] I. Vanwonterghem, P. D. Jensen, P. G. Dennis, P. Hugenholtz,K. Rabaey, G. W. Tyson, Deterministic processes guide long-term syn-chronised population dynamics in replicate anaerobic digesters, TheISME Journal 8 (2014) 2015–2028. Number: 10 Publisher: NaturePublishing Group.

[9] L. V. Blanton, M. R. Charbonneau, T. Salih, M. J. Barratt,S. Venkatesh, O. Ilkaveya, S. Subramanian, M. J. Manary, I. Trehan,J. M. Jorgensen, Y.-M. Fan, B. Henrissat, S. A. Leyn, D. A. Rodi-onov, A. L. Osterman, K. M. Maleta, C. B. Newgard, P. Ashorn, K. G.Dewey, J. I. Gordon, Gut bacteria that prevent growth impairmentstransmitted by microbiota from malnourished children, Science (NewYork, N.Y.) 351 (2016) aad3311.

[10] A. S. Raman, J. L. Gehrig, S. Venkatesh, H.-W. Chang, M. C. Hib-berd, S. Subramanian, G. Kang, P. O. Bessong, A. A. Lima, M. N.Kosek, W. A. Petri, D. A. Rodionov, A. A. Arzamasov, S. A. Leyn,A. L. Osterman, S. Huq, I. Mostafa, M. Islam, M. Mahfuz, R. Haque,T. Ahmed, M. J. Barratt, J. I. Gordon, A sparse covarying unit thatdescribes healthy and impaired human gut microbiota development,Science 365 (2019) eaau4735. Publisher: American Association for theAdvancement of Science.

[11] M. Bahram, F. Hildebrand, S. K. Forslund, J. L. Anderson, N. A.Soudzilovskaia, P. M. Bodegom, J. Bengtsson-Palme, S. Anslan, L. P.Coelho, H. Harend, J. Huerta-Cepas, M. H. Medema, M. R. Maltz,S. Mundra, P. A. Olsson, M. Pent, S. Polme, S. Sunagawa, M. Ryberg,L. Tedersoo, P. Bork, Structure and function of the global topsoilmicrobiome, Nature 560 (2018) 233–237. Number: 7717 Publisher:Nature Publishing Group.

[12] E. Bock, M. Wagner, Oxidation of Inorganic Nitrogen Compounds as

50

Page 51: An ensemble approach to the structure-function problem in

an Energy Source, Springer Berlin Heidelberg, Berlin, Heidelberg, pp.83–118.

[13] A. Sanchez-Gorostiaga, D. Bajic, M. L. Osborne, J. F. Poyatos,A. Sanchez, High-order interactions distort the functional landscapeof microbial consortia, PLOS Biology 17 (2019) e3000550. Publisher:Public Library of Science.

[14] P. G. Falkowski, The role of phytoplankton photosynthesis in globalbiogeochemical cycles, Photosynthesis Research 39 (1994) 235–258.

[15] D. L. Kirchman, Processes in microbial ecology, Oxford UniversityPress, Oxford ; New York, 2012. OCLC: ocn757930826.

[16] M. T. Madigan, B. , Kelly S, B. , Daniel H, W. M. Sattley, D. A. Stahl,Brock biology of microorganisms, 2018. OCLC: 958205447.

[17] L. Y. Stein, M. G. Klotz, The nitrogen cycle, Current Biology 26(2016) R94–R98.

[18] A. Cydzik-Kwiatkowska, M. Zielinska, Bacterial communities in full-scale wastewater treatment systems, World Journal of Microbiology &Biotechnology 32 (2016) 66.

[19] P. J. Turnbaugh, R. E. Ley, M. Hamady, C. M. Fraser-Liggett,R. Knight, J. I. Gordon, The Human Microbiome Project, Nature449 (2007) 804–810.

[20] H. Tian, R. Xu, J. G. Canadell, R. L. Thompson, W. Wini-warter, P. Suntharalingam, E. A. Davidson, P. Ciais, R. B. Jackson,G. Janssens-Maenhout, M. J. Prather, P. Regnier, N. Pan, S. Pan,G. P. Peters, H. Shi, F. N. Tubiello, S. Zaehle, F. Zhou, A. Arneth,G. Battaglia, S. Berthet, L. Bopp, A. F. Bouwman, E. T. Buiten-huis, J. Chang, M. P. Chipperfield, S. R. S. Dangal, E. Dlugokencky,J. W. Elkins, B. D. Eyre, B. Fu, B. Hall, A. Ito, F. Joos, P. B. Krum-mel, A. Landolfi, G. G. Laruelle, R. Lauerwald, W. Li, S. Lienert,T. Maavara, M. MacLeod, D. B. Millet, S. Olin, P. K. Patra, R. G.Prinn, P. A. Raymond, D. J. Ruiz, G. R. van der Werf, N. Vuichard,J. Wang, R. F. Weiss, K. C. Wells, C. Wilson, J. Yang, Y. Yao, Acomprehensive quantification of global nitrous oxide sources and sinks,

51

Page 52: An ensemble approach to the structure-function problem in

Nature 586 (2020) 248–256. Number: 7828 Publisher: Nature Publish-ing Group.

[21] P. J. Turnbaugh, R. E. Ley, M. A. Mahowald, V. Magrini, E. R. Mardis,J. I. Gordon, An obesity-associated gut microbiome with increasedcapacity for energy harvest, Nature 444 (2006) 1027–1031.

[22] T. D. Lawley, S. Clare, A. W. Walker, M. D. Stares, T. R. Connor,C. Raisen, D. Goulding, R. Rad, F. Schreiber, C. Brandt, L. J. Deakin,D. J. Pickard, S. H. Duncan, H. J. Flint, T. G. Clark, J. Parkhill,G. Dougan, Targeted Restoration of the Intestinal Microbiota with aSimple, Defined Bacteriotherapy Resolves Relapsing Clostridium diffi-cile Disease in Mice, PLOS Pathogens 8 (2012) e1002995. Publisher:Public Library of Science.

[23] I. Junier, P. Fremont, O. Rivoire, Universal and idiosyncratic char-acteristic lengths in bacterial genomes, Physical Biology 15 (2018)035001. Publisher: IOP Publishing.

[24] A. S. Beliaev, M. F. Romine, M. Serres, H. C. Bernstein, B. E. Linggi,L. M. Markillie, N. G. Isern, W. B. Chrisler, L. A. Kucek, E. A. Hill,G. E. Pinchuk, D. A. Bryant, H. Steven Wiley, J. K. Fredrickson,A. Konopka, Inference of interactions in cyanobacterial–heterotrophicco-cultures via transcriptome sequencing, The ISME Journal 8 (2014)2243–2255.

[25] H. Mickalide, S. Kuehn, Higher-Order Interaction between SpeciesInhibits Bacterial Invasion of a Phototroph-Predator Microbial Com-munity, Cell Systems 9 (2019) 521–533.e10.

[26] C. Ratzke, J. Denk, J. Gore, Ecological suicide in microbes, Natureecology & evolution 2 (2018) 867–872.

[27] C. R. Woese, A new biology for a new century, Microbiology andMolecular Biology Reviews 68 (2004) 173–186.

[28] J. D. Orth, I. Thiele, B. Palsson, What is flux balance analysis?,Nature Biotechnology 28 (2010) 245–248. Number: 3 Publisher: NaturePublishing Group.

52

Page 53: An ensemble approach to the structure-function problem in

[29] W. Harcombe, W. Riehl, I. Dukovski, B. Granger, A. Betts, A. Lang,G. Bonilla, A. Kar, N. Leiby, P. Mehta, C. Marx, D. Segre, MetabolicResource Allocation in Individual Microbes Determines Ecosystem In-teractions and Spatial Dynamics, Cell Reports 7 (2014) 1104–1115.

[30] E. H. Wintermute, P. A. Silver, Emergent cooperation in microbialmetabolism, Molecular Systems Biology 6 (2010) 407.

[31] J. L. Borges, A. Hurley, Collected fictions, Viking, New York, NewYork, U.S.A., 1998. OCLC: 38989853.

[32] K. Vetsigian, R. Jajoo, R. Kishony, Structure and Evolution of Strep-tomyces Interaction Networks in Soil and In Silico, PLoS Biology 9(2011) e1001184.

[33] J. Friedman, L. M. Higgins, J. Gore, Community structure followssimple assembly rules in microbial microcosms, Nature Ecology &Evolution 1 (2017) 1034.

[34] E. F. Y. Hom, A. W. Murray, Niche engineering demonstrates a latentcapacity for fungal-algal mutualism, Science 345 (2014) 94–98.

[35] K. Amarnath, A. V. Narla, S. Pontrelli, J. Dong, T. Caglar, B. R. Tay-lor, J. Schwartzman, U. Sauer, O. X. Cordero, T. Hwa, Stress-inducedcross-feeding of internal metabolites provides a dynamic mechanism ofmicrobial cooperation, Technical Report, 2021. Company: Cold SpringHarbor Laboratory Distributor: Cold Spring Harbor Laboratory Label:Cold Spring Harbor Laboratory Section: New Results Type: article.

[36] P. W. Anderson, More Is Different, Science 177 (1972) 393–396. Pub-lisher: American Association for the Advancement of Science Section:Articles.

[37] F. Jacob, J. Monod, Genetic regulatory mechanisms in the synthesisof proteins, Journal of Molecular Biology 3 (1961) 318–356.

[38] W. G. Zumft, Cell biology and molecular basis of denitrification., Mi-crobiology and Molecular Biology Reviews 61 (1997) 533–616.

[39] U. Alon, M. G. Surette, N. Barkai, S. Leibler, Robustness in bacterialchemotaxis, Nature 397 (1999) 168–171. Bandiera abtest: a Cg type:

53

Page 54: An ensemble approach to the structure-function problem in

Nature Research Journals Number: 6715 Primary atype: ResearchPublisher: Nature Publishing Group.

[40] M. Basan, T. Honda, D. Christodoulou, M. Horl, Y.-F. Chang,E. Leoncini, A. Mukherjee, H. Okano, B. R. Taylor, J. M. Sil-verman, C. Sanchez, J. R. Williamson, J. Paulsson, T. Hwa,U. Sauer, A universal trade-off between growth and lag in fluctu-ating environments, Nature 584 (2020) 470–474. Bandiera abtest: aCg type: Nature Research Journals Number: 7821 Primary atype:Research Publisher: Nature Publishing Group Subject term: Bac-terial systems biology;Metabolomics;Systems biology Subject term id:bacterial-systems-biology;metabolomics;systems-biology.

[41] N. Halabi, O. Rivoire, S. Leibler, R. Ranganathan, Protein Sectors:Evolutionary Units of Three-Dimensional Structure, Cell 138 (2009)774–786. Publisher: Elsevier.

[42] V. Alba, J. E. Carthew, R. W. Carthew, M. Mani, Global constraintswithin the developmental program of the Drosophila wing, eLife 10(2021) e66750. Publisher: eLife Sciences Publications, Ltd.

[43] W. P. Russ, D. M. Lowery, P. Mishra, M. B. Yaffe, R. Ranganathan,Natural-like function in artificial WW domains, Nature 437 (2005)579–583. Number: 7058 Publisher: Nature Publishing Group.

[44] D. Jordan, S. Kuehn, E. Katifori, S. Leibler, Behavioral diversity inmicrobes and low-dimensional phenotypic spaces, Proceedings of theNational Academy of Sciences 110 (2013) 14018–14023.

[45] G. J. Berman, D. M. Choi, W. Bialek, J. W. Shaevitz, Mapping thestereotyped behaviour of freely moving fruit flies, Journal of the RoyalSociety, Interface 11 (2014) 20140672.

[46] A. Y. Katsov, L. Freifeld, M. Horowitz, S. Kuehn, T. R. Clandinin,Dynamic structure of locomotor behavior in walking fruit flies, eLife 6(2017) e26410. Publisher: eLife Sciences Publications, Ltd.

[47] D. M. Raup, A. Michelson, Theoretical Morphology of the Coiled Shell,Science, New Series 147 (1965) 1294–1295.

54

Page 55: An ensemble approach to the structure-function problem in

[48] O. Shoval, H. Sheftel, G. Shinar, Y. Hart, O. Ramote, A. Mayo,E. Dekel, K. Kavanagh, U. Alon, Evolutionary Trade-Offs, ParetoOptimality, and the Geometry of Phenotype Space, Science 336 (2012)1157–1160.

[49] J.-P. Eckmann, T. Tlusty, Dimensional Reduction in Complex LivingSystems: Where, Why, and How, BioEssays 43 (2021) 2100062. ArXiv:2103.02436.

[50] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander,R. Zecchina, J. N. Onuchic, T. Hwa, M. Weigt, Direct-coupling anal-ysis of residue coevolution captures native contacts across many pro-tein families, Proceedings of the National Academy of Sciences 108(2011) E1293–E1301. Publisher: National Academy of Sciences Sec-tion: PNAS Plus.

[51] W. P. Russ, M. Figliuzzi, C. Stocker, P. Barrat-Charlaix, M. Socol-ich, P. Kast, D. Hilvert, R. Monasson, S. Cocco, M. Weigt, R. Ran-ganathan, An evolution-based model for designing chorismate mutaseenzymes, Science 369 (2020) 440–445. Publisher: American Associationfor the Advancement of Science.

[52] K. Gowda, D. Ping, M. Mani, S. Kuehn, Genomic structure predictsmetabolite dynamics in microbial communities, biorxiv.org (2021).

[53] R. Knight, A. Vrbanac, B. C. Taylor, A. Aksenov, C. Callewaert,J. Debelius, A. Gonzalez, T. Kosciolek, L.-I. McCall, D. McDon-ald, A. V. Melnik, J. T. Morton, J. Navas, R. A. Quinn, J. G.Sanders, A. D. Swafford, L. R. Thompson, A. Tripathi, Z. Z. Xu,J. R. Zaneveld, Q. Zhu, J. G. Caporaso, P. C. Dorrestein, Bestpractices for analysing microbiomes, Nature Reviews Microbiology16 (2018) 410–422. Bandiera abtest: a Cg type: Nature ResearchJournals Number: 7 Primary atype: Reviews Publisher: NaturePublishing Group Subject term: Bioinformatics;Gene expressionanalysis;Genomic analysis;Metabolomics;Metagenomics;Microbialecology;Microbiome;Sequencing Subject term id:bioinformatics;gene-expression-analysis;genomic-analysis;metabolomics;metagenomics;microbial-ecology;microbiome;sequencing.

55

Page 56: An ensemble approach to the structure-function problem in

[54] M. Saleem, J. Hu, A. Jousset, More Than the Sum of Its Parts: Mi-crobiome Biodiversity as a Driver of Plant Growth and Soil Health,Annual Review of Ecology, Evolution, and Systematics 50 (2019) 145–168. eprint: https://doi.org/10.1146/annurev-ecolsys-110617-062605.

[55] R. Lal, Soil Carbon Sequestration Impacts on Global Climate Changeand Food Security, Science 304 (2004) 1623–1627. Publisher: AmericanAssociation for the Advancement of Science.

[56] M. U. F. Kirschbaum, The temperature dependence of soil organicmatter decomposition, and the effect of global warming on soil organicC storage, Soil Biology and Biochemistry 27 (1995) 753–760.

[57] S. D. Allison, M. D. Wallenstein, M. A. Bradford, Soil-carbon responseto warming dependent on microbial physiology, Nature Geoscience 3(2010) 336–340. Number: 5 Publisher: Nature Publishing Group.

[58] E. B. Graham, J. E. Knelman, A. Schindlbacher, S. Siciliano, M. Breul-mann, A. Yannarell, J. M. Beman, G. Abell, L. Philippot, J. Prosser,A. Foulquier, J. C. Yuste, H. C. Glanville, D. L. Jones, R. Angel,J. Salminen, R. J. Newton, H. Burgmann, L. J. Ingram, U. Hamer,H. M. P. Siljanen, K. Peltoniemi, K. Potthast, L. Baneras, M. Hart-mann, S. Banerjee, R.-Q. Yu, G. Nogaro, A. Richter, M. Koranda, S. C.Castle, M. Goberna, B. Song, A. Chatterjee, O. C. Nunes, A. R. Lopes,Y. Cao, A. Kaisermann, S. Hallin, M. S. Strickland, J. Garcia-Pausas,J. Barba, H. Kang, K. Isobe, S. Papaspyrou, R. Pastorelli, A. Lago-marsino, E. S. Lindstrom, N. Basiliko, D. R. Nemergut, Microbes asEngines of Ecosystem Function: When Does Community Structure En-hance Predictions of Ecosystem Processes?, Frontiers in Microbiology7 (2016) 214.

[59] M. C. Rillig, M. Ryo, A. Lehmann, C. A. Aguilar-Trigueros, S. Buchert,A. Wulf, A. Iwasaki, J. Roy, G. Yang, The role of multiple global changefactors in driving soil functions and microbial biodiversity, Science 366(2019) 886–890. Publisher: American Association for the Advancementof Science Section: Report.

[60] J. D. Rocca, E. K. Hall, J. T. Lennon, S. E. Evans, M. P. Waldrop,J. B. Cotner, D. R. Nemergut, E. B. Graham, M. D. Wallenstein, Rela-tionships between protein-encoding gene abundance and corresponding

56

Page 57: An ensemble approach to the structure-function problem in

process are commonly assumed yet rarely observed, The ISME Journal9 (2015) 1693–1699. Number: 8 Publisher: Nature Publishing Group.

[61] N. Fierer, Embracing the unknown: disentangling the complexities ofthe soil microbiome, Nature Reviews Microbiology 15 (2017) 579–590.Number: 10 Publisher: Nature Publishing Group.

[62] A. Crits-Christoph, S. Diamond, C. N. Butterfield, B. C. Thomas,J. F. Banfield, Novel soil bacteria possess diverse genes forsecondary metabolite biosynthesis, Nature 558 (2018) 440–444.Bandiera abtest: a Cg type: Nature Research Journals Number: 7710Primary atype: Research Publisher: Nature Publishing Group Sub-ject term: Antibiotics;Metagenomics;Microbial ecology;Soil microbiol-ogy Subject term id: antibiotics;metagenomics;microbial-ecology;soil-microbiology.

[63] M. R. McLaren, A. D. Willis, B. J. Callahan, Consistent and cor-rectable bias in metagenomic sequencing experiments, eLife 8 (2019)e46923. Publisher: eLife Sciences Publications, Ltd.

[64] R. O. Miller, D. E. Kissel, Comparison of Soil pHMethods on Soils of North America, Soil Science So-ciety of America Journal 74 (2010) 310–316. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.2136/sssaj2008.0047.

[65] M. Keiluweit, T. Wanzek, M. Kleber, P. Nico, S. Fendorf, Anaerobicmicrosites have an unaccounted role in soil carbon stabilization, NatureCommunications 8 (2017) 1771. Bandiera abtest: a Cc license type:cc by Cg type: Nature Research Journals Number: 1 Primary atype:Research Publisher: Nature Publishing Group Subject term: Carboncycle Subject term id: carbon-cycle.

[66] H. Teeling, B. M. Fuchs, D. Becher, C. Klockow, A. Gardebrecht,C. M. Bennke, M. Kassabgy, S. Huang, A. J. Mann, J. Waldmann,M. Weber, A. Klindworth, A. Otto, J. Lange, J. Bernhardt, C. Rein-sch, M. Hecker, J. Peplies, F. D. Bockelmann, U. Callies, G. Gerdts,A. Wichels, K. H. Wiltshire, F. O. Glockner, T. Schweder, R. Amann,Substrate-controlled succession of marine bacterioplankton populationsinduced by a phytoplankton bloom, Science 336 (2012) 608–611.

57

Page 58: An ensemble approach to the structure-function problem in

[67] J. A. Kimbrel, T. J. Samo, C. Ward, D. Nilson, M. P. Thelen, A. Sic-cardi, P. Zimba, T. W. Lane, X. Mayali, Host selection and stochasticeffects influence bacterial community assembly on the microalgal phy-cosphere, Algal Research 40 (2019) 101489.

[68] I. Louati, N. Pascault, D. Debroas, C. Bernard, J.-F. Humbert,J. Leloup, Structural Diversity of Bacterial Communities Associatedwith Bloom-Forming Freshwater Cyanobacteria Differs According tothe Cyanobacterial Genus, PLOS ONE 10 (2015) e0140614.

[69] G. A. McFeters, S. A. Stuart, S. B. Olson, Growth of HeterotrophicBacteria and Algal Extracellular Products in Oligotrophic Waters, Ap-plied and Environmental Microbiology 35 (1978) 383–391.

[70] N. N. Parulekar, P. Kolekar, A. Jenkins, S. Kleiven, H. Utkilen, A. Jo-hansen, S. Sawant, U. Kulkarni-Kale, M. Kale, M. Sæbø, Character-ization of bacterial community associated with phytoplankton bloomin a eutrophic lake in South Norway using 16S rRNA gene ampliconsequence analysis, PLOS ONE 12 (2017) e0173408.

[71] R. Ramanan, Z. Kang, B.-H. Kim, D.-H. Cho, L. Jin, H.-M. Oh, H.-S.Kim, Phycosphere bacterial diversity in green algae reveals an apparentsimilarity across habitats, Algal Research 8 (2015) 140–144.

[72] A. Buchan, G. R. LeCleir, C. A. Gulvik, J. M. Gonzalez, Master recy-clers: features and functions of bacteria associated with phytoplanktonblooms., 2014.

[73] L. Riemann, G. F. Steward, F. Azam, Dynamics of Bacterial Com-munity Composition and Activity during a Mesocosm Diatom Bloom,APPL. ENVIRON. MICROBIOL. 66 (2000) 11.

[74] L. M. d. J. Astacio, K. H. Prabhakara, Z. Li, H. Mickalide, S. Kuehn,Closed microbial communities self-organize to persistently cycle car-bon, bioRxiv (2020) 2020.05.28.121848. Publisher: Cold Spring HarborLaboratory Section: New Results.

[75] A. L. Alldredge, M. W. Silver, Characteristics, dynamics and signifi-cance of marine snow, Progress in Oceanography 20 (1988) 41–82.

58

Page 59: An ensemble approach to the structure-function problem in

[76] M. Gralka, R. Szabo, R. Stocker, O. X. Cordero, Trophic Interactionsand the Drivers of Microbial Community Assembly, Current Biology30 (2020) R1176–R1188.

[77] T. Kiørboe, The Sea Core Sampler: a simple water sampler that al-lows direct observations of undisturbed plankton, Journal of PlanktonResearch 29 (2007) 545–552.

[78] T. Kiørboe, K. Tang, H. P. Grossart, H. Ploug, Dynamics of micro-bial communities on marine snow aggregates: Colonization, growth,detachment, and grazing mortality of attached bacteria, Applied andEnvironmental Microbiology 69 (2003) 3036–3047.

[79] M. S. Datta, E. Sliwerska, J. Gore, M. F. Polz, O. X. Cordero, Micro-bial interactions lead to rapid micro-scale successions on model marineparticles, Nature Communications 7 (2016) 11965.

[80] A. Ebrahimi, J. Schwartzman, O. X. Cordero, Cooperation and spatialself-organization determine rate and efficiency of particulate organicmatter degradation in marine bacteria, Proceedings of the NationalAcademy of Sciences 116 (2019) 23309–23316. Publisher: NationalAcademy of Sciences Section: PNAS Plus.

[81] T. N. Enke, M. S. Datta, J. Schwartzman, N. Cermak, D. Schmitz,J. Barrere, A. Pascual-Garcıa, O. X. Cordero, Modular Assembly ofPolysaccharide-Degrading Marine Microbial Communities, Current Bi-ology 29 (2019) 1528–1535.e6.

[82] S. Pontrelli, R. Szabo, S. Pollak, J. Schwartzman, D. Ledezma-Tejeida,O. X. Cordero, U. Sauer, Hierarchical control of microbial commu-nity assembly, Technical Report, 2021. Company: Cold Spring HarborLaboratory Distributor: Cold Spring Harbor Laboratory Label: ColdSpring Harbor Laboratory Section: New Results Type: article.

[83] L. A. Amaral-Zettler, E. R. Zettler, T. J. Mincer, Ecology of the plas-tisphere, 2020.

[84] J. R. Jambeck, R. Geyer, C. Wilcox, T. R. Siegler, M. Perryman,A. Andrady, R. Narayan, K. L. Law, Plastic waste inputs from landinto the ocean, Science 347 (2015) 768–771.

59

Page 60: An ensemble approach to the structure-function problem in

[85] C. Dussud, A. L. Meistertzheim, P. Conan, M. Pujo-Pay, M. George,P. Fabre, J. Coudane, P. Higgs, A. Elineau, M. L. Pedrotti, G. Gorsky,J. F. Ghiglione, Evidence of niche partitioning among bacteria living onplastics, organic particles and surrounding seawaters, EnvironmentalPollution 236 (2018) 807–816.

[86] I. V. Kirstein, A. Wichels, E. Gullans, G. Krohne, G. Gerdts, ThePlastisphere – Uncovering tightly attached plastic “specific” microor-ganisms, PLOS ONE 14 (2019) e0215859.

[87] C. G. Klatt, W. P. Inskeep, M. J. Herrgard, Z. J. Jay, D. B. Rusch,S. Tringe, M. Niki Parenteau, D. M. Ward, S. M. Boomer, D. A.Bryant, S. Miller, Community Structure and Function of High-Temperature Chlorophototrophic Microbial Mats Inhabiting DiverseGeothermal Environments, Frontiers in Microbiology 4 (2013) 106.

[88] D. M. Ward, M. J. Ferris, S. C. Nold, M. M. Bateson, A NaturalView of Microbial Biodiversity within Hot Spring Cyanobacterial MatCommunities, Microbiology and Molecular Biology Reviews 62 (1998)1353–1370. Publisher: American Society for Microbiology Section: AR-TICLE.

[89] M. M. Bateson, D. M. Ward, Photoexcretion and Fate of Glycolate ina Hot Spring Cyanobacterial Mat, Applied and Environmental Micro-biology 54 (1988) 1738–1743.

[90] K. L. Anderson, T. A. Tayne, D. M. Ward, Formation and Fate of Fer-mentation Products in Hot Spring Cyanobacterial Mats, Applied andEnvironmental Microbiology 53 (1987) 2343–2352. Publisher: Ameri-can Society for Microbiology Section: General Microbial Ecology.

[91] J. Z. Lee, R. C. Everroad, U. Karaoz, A. M. Detweiler, J. Pett-Ridge,P. K. Weber, L. Prufert-Bebout, B. M. Bebout, Metagenomics revealsniche partitioning within the phototrophic zone of a microbial mat,PLOS ONE 13 (2018) e0202792. Publisher: Public Library of Science.

[92] A.-S. Steunou, D. Bhaya, M. M. Bateson, M. C. Melendrez, D. M.Ward, E. Brecht, J. W. Peters, M. Kuhl, A. R. Grossman, In situanalysis of nitrogen fixation and metabolic switching in unicellular

60

Page 61: An ensemble approach to the structure-function problem in

thermophilic cyanobacteria inhabiting hot spring microbial mats, Pro-ceedings of the National Academy of Sciences 103 (2006) 2398–2403.

[93] M. J. Rosen, M. Davison, D. Bhaya, D. S. Fisher, Fine-scale diver-sity and extensive recombination in a quasisexual bacterial populationoccupying a broad niche, Science 348 (2015) 1019–1023. Publisher:American Association for the Advancement of Science Section: Re-port.

[94] J. Z. Lee, R. C. Everroad, U. Karaoz, A. M. Detweiler, J. Pett-Ridge,P. K. Weber, L. Prufert-Bebout, B. M. Bebout, Metagenomics revealsniche partitioning within the phototrophic zone of a microbial mat,PLOS ONE 13 (2018) e0202792.

[95] A. P. Petroff, F. Tejera, A. Libchaber, Subsurface Microbial Ecosys-tems: A Photon Flux and a Metabolic Cascade, Journal of StatisticalPhysics 167 (2017) 763–776.

[96] F. Tejera, A. Libchaber, A. P. Petroff, Oxygen dynamics in a two-dimensional microbial ecosystem, Physical Review E 98 (2018) 042409.

[97] C. M. Callbeck, G. Lavik, T. G. Ferdelman, B. Fuchs, H. R. Gruber-Vodicka, P. F. Hach, S. Littmann, N. J. Schoffelen, T. Kalvelage,S. Thomsen, H. Schunck, C. R. Loscher, R. A. Schmitz, M. M. M.Kuypers, Oxygen minimum zone cryptic sulfur cycling sustained byoffshore transport of key sulfur oxidizing bacteria, Nature Communi-cations (2018) 1–11.

[98] E. G. Wilbanks, U. Jaekel, V. Salman, P. T. Humphrey, J. A. Eisen,M. T. Facciotti, D. H. Buckley, S. H. Zinder, G. K. Druschel, D. A. Fike,V. J. Orphan, Microscale sulfur cycling in the phototrophic pink berryconsortia of the Sippewissett Salt Marsh., Environmental Microbiology16 (2014) 3398–3415.

[99] A. P. Seitz, T. H. Nielsen, J. Overmann, Physiology of purple sulfurbacteria forming macroscopic aggregates in Great Sippewissett SaltMarsh, Massachusetts 12 (1993) 225–235.

[100] S. Dyksma, K. Bischof, B. M. Fuchs, K. Hoffmann, D. Meier, A. Mey-erdierks, P. Pjevac, D. Probandt, M. Richter, R. Stepanauskas,

61

Page 62: An ensemble approach to the structure-function problem in

M. Mußmann, Ubiquitous Gammaproteobacteria dominate dark car-bon fixation in coastal sediments, The ISME Journal (2016) 1–15.

[101] W. Liamleam, A. P. Annachhatre, Electron donors for biological sulfatereduction., Biotechnology advances 25 (2007) 452–463.

[102] D. Enning, J. Garrelfs, Corrosion of iron by sulfate-reducing bacteria:new views of an old problem., Applied and Environmental Microbiology80 (2014) 1226–1236.

[103] B. E. Wolfe, J. E. Button, M. Santarelli, R. J. Dutton, Cheese rindcommunities provide tractable systems for in situ and in vitro studiesof microbial diversity, Cell 158 (2014) 422–433.

[104] S. Blasche, Y. Kim, R. A. Mars, D. Machado, M. Maansson, E. Kafkia,A. Milanese, G. Zeller, B. Teusink, J. Nielsen, V. Benes, R. Neves,U. Sauer, K. R. Patil, Metabolic cooperation and spatiotemporal nichepartitioning in a kefir microbial community, Nature Microbiology 6(2021) 196–208.

[105] J. J. Werner, D. Knights, M. L. Garcia, N. B. Scalfone, S. Smith,K. Yarasheski, T. A. Cummings, A. R. Beers, R. Knight, L. T. An-genent, Bacterial community structures are unique and resilient infull-scale bioenergy systems., Proceedings of the National Academy ofSciences of the United States of America 108 (2011) 4158–4163.

[106] A. S. Fernandez, S. A. Hashsham, S. L. Dollhopf, L. Raskin, O. Glagol-eva, F. B. Dazzo, R. F. Hickey, C. S. Criddle, J. M. Tiedje, Flexi-ble community structure correlates with stable community function inmethanogenic bioreactor communities perturbed by glucose., Appliedand Environmental Microbiology 66 (2000) 4058–4067.

[107] S. A. Hashsham, A. S. Fernandez, S. L. Dollhopf, F. B. Dazzo, R. F.Hickey, J. M. Tiedje, C. S. Criddle, Parallel processing of substratecorrelates with greater functional stability in methanogenic bioreac-tor communities perturbed by glucose., Applied and EnvironmentalMicrobiology 66 (2000) 4050–4057.

[108] A. Fernandez, S. Huang, S. Seston, J. Xing, R. Hickey, C. Criddle,J. Tiedje, How stable is stable? Function versus community composi-tion., Applied and Environmental Microbiology 65 (1999) 3697–3704.

62

Page 63: An ensemble approach to the structure-function problem in

[109] S. Louca, M. F. Polz, F. Mazel, M. B. N. Albright, J. A. Huber, M. I.O’Connor, M. Ackermann, A. S. Hahn, D. S. Srivastava, S. A. Crowe,M. Doebeli, L. W. Parfrey, Function and functional redundancy inmicrobial systems, Nature Ecology & Evolution 2 (2018) 936–943.

[110] G. A. Zavarzin, Winogradsky and modern microbiology, 2006.

[111] E. Pagaling, F. Strathdee, B. M. Spears, M. E. Cates, R. J. Allen,A. Free, Community history affects the predictability of microbialecosystem development, The ISME Journal 8 (2014) 19–30.

[112] E. Pagaling, K. Vassileva, C. G. Mills, T. Bush, R. A. Blythe,J. Schwarz-Linek, F. Strathdee, R. J. Allen, A. Free, Assem-bly of microbial communities in replicate nutrient-cycling modelecosystems follows divergent trajectories, leading to alternate stablestates, Environmental Microbiology 19 (2017) 3374–3386. eprint:https://sfamjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/1462-2920.13849.

[113] R. A. Quinn, K. Whiteson, Y.-W. Lim, P. Salamon, B. Bailey, S. Mien-ardi, S. E. Sanchez, D. Blake, D. Conrad, F. Rohwer, A Winogradsky-based culture system shows an association between microbial fermen-tation and cystic fibrosis exacerbation, The ISME Journal 9 (2015)1024–1038.

[114] M. C. Rillig, J. Antonovics, Microbial biospherics: The experimentalstudy of ecosystem function and evolution, Proceedings of the NationalAcademy of Sciences 116 (2019) 11093–11098.

[115] C. K. Obenhuber, C. E. Folsome, Carbon recycling in materially closedecological life support systems, BioSystems 21 (1988) 165–173.

[116] C. K. Obenhuber, C. E. Folsome, Eucaryote/procaryote ratio as an in-dicator of stability for closed ecological systems, BioSystems 16 (1984)291–296.

[117] E. A. Kearns, C. E. Folsome, Measurement of biological activity inmaterially closed microbial ecosystems, BioSystems 14 (1981) 205–209.

63

Page 64: An ensemble approach to the structure-function problem in

[118] Z. Kawabata, K. Matsui, K. Okazaki, M. Nasu, N. Nakano, T. Sugai,Synthesis of a species-defined microcosm with protozoa, Journal ofProtozoology Research 5 (1995) 23–26.

[119] F. B. Taub, Closed Ecological Systems, Annual Review of Ecology andSystematics 5 (1974) 139–160.

[120] F. B. Taub, Community metabolism of aquatic Closed Ecological Sys-tems: Effects of nitrogen sources, Advances in Space Research 44(2009) 949–957.

[121] D. R. Hekstra, S. Leibler, Contingency and statistical laws in replicatemicrobial closed ecosystems, Cell 149 (2012) 1164–1173.

[122] Z. Frentz, S. Kuehn, S. Leibler, Strongly Deterministic PopulationDynamics in Closed Microbial Communities, Physical Review X 5(2015) 041014.

[123] J. E. Goldford, N. Lu, D. Bajic, S. Estrela, M. Tikhonov, A. Sanchez-Gorostiaga, D. Segre, P. Mehta, A. Sanchez, Emergent simplicity inmicrobial community assembly, Science (New York, N.Y.) 361 (2018)469–474.

[124] S. Estrela, J. Vila, N. Lu, D. Bajic, M. R.-G. bioRxiv, 2020, Metabolicrules of microbial community assembly, biorxiv.org (????).

[125] K. R. Foster, T. Bell, Competition, Not Cooperation, Dominates In-teractions among Culturable Microbial Species, Current Biology 22(2012) 1845–1850.

[126] X. Yu, M. F. Polz, E. J. Alm, Interactions in self-assembled microbialcommunities saturate with diversity, The ISME Journal (2019) 1–16.

[127] J. Kehe, A. Kulesa, A. Ortiz, C. M. Ackerman, S. G. Thakku, D. Sell-ers, S. Kuehn, J. Gore, J. Friedman, P. C. Blainey, Massively parallelscreening of synthetic microbial communities, Proceedings of the Na-tional Academy of Sciences 116 (2019) 12804–12809.

[128] J. Kehe, A. Ortiz, A. Kulesa, J. Gore, P. C. Blainey, J. Friedman, Pos-itive interactions are common among culturable bacteria, biorxiv.org(2020) 2020.06.24.169474.

64

Page 65: An ensemble approach to the structure-function problem in

[129] P. Lycus, K. Lovise Bøthun, L. Bergaust, J. Peele Shapleigh,L. Reier Bakken, A. Frostegard, Phenotypic and genotypic richnessof denitrifiers revealed by a novel isolation strategy, The ISME Journal11 (2017) 2219–2232.

[130] J. M. Tiedje, A. J. Sexstone, D. D. Myrold, J. A. Robinson, Deni-trification: ecological niches, competition and survival, Antonie vanLeeuwenhoek 48 (1983) 569–583.

[131] E. E. Lilja, D. R. Johnson, Segregating metabolic processes into differ-ent microbial cells accelerates the consumption of inhibitory substrates,The ISME Journal 10 (2016) 1568–1578.

[132] E. E. Lilja, D. R. Johnson, Substrate cross-feeding affects the speedand trajectory of molecular evolution within a synthetic microbial as-semblage, BMC Evolutionary Biology 19 (2019) 129.

[133] F. Goldschmidt, R. R. Regoes, D. R. Johnson, Metabolite toxicityslows local diversity loss during expansion of a microbial cross-feedingcommunity, The ISME Journal 12 (2018) 136–144.

[134] H. K. Carlson, L. M. Lui, M. N. Price, A. E. Kazakov, A. V. Carr, J. V.Kuehl, T. K. Owens, T. Nielsen, A. P. Arkin, A. M. Deutschbauer, Se-lective carbon sources influence the end products of microbial nitraterespiration, The ISME Journal 14 (2020) 2034–2045. Bandiera abtest:a Cc license type: cc by Cg type: Nature Research Journals Num-ber: 8 Primary atype: Research Publisher: Nature Publishing GroupSubject term: Environmental microbiology;Microbial ecology Sub-ject term id: environmental-microbiology;microbial-ecology.

[135] L. W. Hugerth, A. F. Andersson, Analysing Microbial CommunityComposition through Amplicon Sequencing: From Sampling to Hy-pothesis Testing, Frontiers in Microbiology 8 (2017) 1561.

[136] J. AMANN, Die direkte Zahlung der Wasserbakterien mittels des Ul-tramikroskops, Centralbl. f. Bakteriol. 29 (1911) 381–384.

[137] J. T. Staley, A. Konopka, Measurement of in Situ Activities ofNonphotosynthetic Microorganisms in Aquatic and Terrestrial Habi-tats, Annual Review of Microbiology 39 (1985) 321–346. eprint:https://doi.org/10.1146/annurev.mi.39.100185.001541.

65

Page 66: An ensemble approach to the structure-function problem in

[138] J.-C. Lagier, F. Armougom, M. Million, P. Hugon, I. Pagnier,C. Robert, F. Bittar, G. Fournous, G. Gimenez, M. Maraninchi, J.-F.Trape, E. V. Koonin, B. La Scola, D. Raoult, Microbial culturomics:paradigm shift in the human gut microbiome study, Clinical Microbi-ology and Infection: The Official Publication of the European Societyof Clinical Microbiology and Infectious Diseases 18 (2012) 1185–1193.

[139] P. Seng, M. Drancourt, F. Gouriet, B. La Scola, P.-E. Fournier, J. M.Rolain, D. Raoult, Ongoing revolution in bacteriology: routine identi-fication of bacteria by matrix-assisted laser desorption ionization time-of-flight mass spectrometry, Clinical Infectious Diseases: An OfficialPublication of the Infectious Diseases Society of America 49 (2009)543–551.

[140] J.-C. Lagier, S. Khelaifia, M. T. Alou, S. Ndongo, N. Dione, P. Hugon,A. Caputo, F. Cadoret, S. I. Traore, E. H. Seck, G. Dubourg, G. Du-rand, G. Mourembou, E. Guilhot, A. Togo, S. Bellali, D. Bachar,N. Cassir, F. Bittar, J. Delerce, M. Mailhe, D. Ricaboni, M. Bilen,N. P. M. Dangui Nieko, N. M. Dia Badiane, C. Valles, D. Mouelhi,K. Diop, M. Million, D. Musso, J. Abrahao, E. I. Azhar, F. Bibi,M. Yasir, A. Diallo, C. Sokhna, F. Djossou, V. Vitton, C. Robert,J. M. Rolain, B. La Scola, P.-E. Fournier, A. Levasseur, D. Raoult,Culture of previously uncultured members of the human gut micro-biota by culturomics, Nature Microbiology 1 (2016) 1–8. Number: 12Publisher: Nature Publishing Group.

[141] K. Zengler, G. Toledo, M. Rappe, J. Elkins, E. J. Mathur, J. M.Short, M. Keller, Cultivating the uncultured, Proceedings of the Na-tional Academy of Sciences 99 (2002) 15681–15686. Publisher: NationalAcademy of Sciences Section: Biological Sciences.

[142] K. Zengler, M. Walcher, G. Clark, I. Haller, G. Toledo, T. Holland, E. J.Mathur, G. Woodnutt, J. M. Short, M. Keller, High-Throughput Cul-tivation of Microorganisms Using Microcapsules, in: Methods in En-zymology, volume 397 of Environmental Microbiology, Academic Press,2005, pp. 124–130.

[143] T. Kaeberlein, K. Lewis, S. S. Epstein, Isolating ”uncultivable” mi-croorganisms in pure culture in a simulated natural environment, Sci-ence (New York, N.Y.) 296 (2002) 1127–1129.

66

Page 67: An ensemble approach to the structure-function problem in

[144] A. Bollmann, K. Lewis, S. S. Epstein, Incubation of EnvironmentalSamples in a Diffusion Chamber Increases the Diversity of RecoveredIsolates, Applied and Environmental Microbiology 73 (2007) 6386–6390. Publisher: American Society for Microbiology Section: MICRO-BIAL ECOLOGY.

[145] D. K. Chaudhary, A. Khulan, J. Kim, Development of a novel cultiva-tion technique for uncultured soil bacteria, Scientific Reports 9 (2019)6666. Number: 1 Publisher: Nature Publishing Group.

[146] B. Berdy, A. L. Spoering, L. L. Ling, S. S. Epstein, In situ cultivationof previously uncultivable microorganisms using the ichip, Nature Pro-tocols 12 (2017) 2232–2242. Number: 10 Publisher: Nature PublishingGroup.

[147] Y. Aoi, T. Kinoshita, T. Hata, H. Ohta, H. Obokata, S. Tsuneda,Hollow-Fiber Membrane Chamber as a Device for In Situ Environmen-tal Cultivation, Applied and Environmental Microbiology 75 (2009)3826–3833. Publisher: American Society for Microbiology Section:METHODS.

[148] H. P. Browne, S. C. Forster, B. O. Anonye, N. Kumar, B. A. Neville,M. D. Stares, D. Goulding, T. D. Lawley, Culturing of ‘unculturable’human microbiota reveals novel taxa and extensive sporulation, Na-ture 533 (2016) 543–546. Number: 7604 Publisher: Nature PublishingGroup.

[149] K. L. Cross, J. H. Campbell, M. Balachandran, A. G. Campbell, S. J.Cooper, A. Griffen, M. Heaton, S. Joshi, D. Klingeman, E. Leys,Z. Yang, J. M. Parks, M. Podar, Targeted isolation and cultivationof uncultivated bacteria by reverse genomics, Nature biotechnology 37(2019) 1314–1321.

[150] P. Blainey, A. Kulesa, J. Kehe, Massively parallel on-chip coalescenceof microemulsions, 2018.

[151] J. Kehe, A. Kulesa, A. Ortiz, C. M. Ackerman, S. G. Thakku, D. Sell-ers, S. Kuehn, J. Gore, J. Friedman, P. C. Blainey, Massively paral-lel screening of synthetic microbial communities, Proceedings of the

67

Page 68: An ensemble approach to the structure-function problem in

National Academy of Sciences 116 (2019) 12804–12809. Publisher: Na-tional Academy of Sciences Section: Biological Sciences.

[152] J. Park, A. Kerner, M. A. Burns, X. N. Lin, Microdroplet-EnabledHighly Parallel Co-Cultivation of Microbial Communities, PLOS ONE6 (2011) e17019. Publisher: Public Library of Science.

[153] S. S. Terekhov, I. V. Smirnov, M. V. Malakhova, A. E. Samoilov, A. I.Manolov, A. S. Nazarov, D. V. Danilov, S. A. Dubiley, I. A. Oster-man, M. P. Rubtsova, E. S. Kostryukova, R. H. Ziganshin, M. A. Ko-rnienko, A. A. Vanyushkina, O. N. Bukato, E. N. Ilina, V. V. Vlasov,K. V. Severinov, A. G. Gabibov, S. Altman, Ultrahigh-throughputfunctional profiling of microbiota communities, Proceedings of the Na-tional Academy of Sciences 115 (2018) 9551–9556. Publisher: NationalAcademy of Sciences Section: Biological Sciences.

[154] C. P. Slichter, Principles of Magnetic Resonance, Springer Series inSolid-State Sciences, Springer-Verlag, Berlin Heidelberg, 3 edition,1990.

[155] M. H. Levitt, Spin Dynamics: Basics of Nuclear Magnetic Resonance,Wiley, 2008. Google-Books-ID: Bgwo JTgb64C.

[156] R. T. Mckay, How the 1D-NOESY suppresses solvent sig-nal in metabonomics NMR spectroscopy: An examinationof the pulse sequence components and evolution, Conceptsin Magnetic Resonance Part A 38A (2011) 197–220. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cmr.a.20223.

[157] A.-H. Emwas, R. Roy, R. T. McKay, L. Tenori, E. Saccenti, G. A. N.Gowda, D. Raftery, F. Alahmari, L. Jaremko, M. Jaremko, D. S.Wishart, NMR Spectroscopy for Metabolomics Research, Metabolites9 (2019).

[158] A. Andrade-Domınguez, E. Salazar, M. del Carmen Vargas-Lagunas,R. Kolter, S. Encarnacion, Eco-evolutionary feedbacks drive speciesinteractions, The ISME Journal 8 (2014) 1041–1054. Number: 5 Pub-lisher: Nature Publishing Group.

68

Page 69: An ensemble approach to the structure-function problem in

[159] G. Gonzalez-Gil, L. Thomas, A.-H. Emwas, P. N. L. Lens, P. E. Saikaly,NMR and MALDI-TOF MS based characterization of exopolysaccha-rides in anaerobic microbial aggregates from full-scale reactors, Scien-tific Reports 5 (2015) 14316. Number: 1 Publisher: Nature PublishingGroup.

[160] Y. Date, Y. Nakanishi, S. Fukuda, T. Kato, S. Tsuneda, H. Ohno,J. Kikuchi, New monitoring approach for metabolic dynamics in mi-crobial ecosystems using stable-isotope-labeling technologies, Journalof Bioscience and Bioengineering 110 (2010) 87–93.

[161] Y. Nakanishi, S. Fukuda, E. Chikayama, Y. Kimura, H. Ohno,J. Kikuchi, Dynamic Omics Approach Identifies Nutrition-MediatedMicrobial Interactions, Journal of Proteome Research 10 (2011) 824–836. Publisher: American Chemical Society.

[162] M. A. Macnaughtan, T. Hou, J. Xu, D. Raftery, High-ThroughputNuclear Magnetic Resonance Analysis Using a Multiple Coil FlowProbe, Analytical Chemistry 75 (2003) 5116–5123. Publisher: Ameri-can Chemical Society.

[163] P. Soininen, A. J. Kangas, P. Wurtz, T. Tukiainen, T. Tynkkynen,R. Laatikainen, M.-R. Jarvelin, M. Kahonen, T. Lehtimaki, J. Viikari,O. T. Raitakari, M. J. Savolainen, M. Ala-Korpela, High-throughputserum NMR metabonomics for cost-effective holistic studies on sys-temic metabolism, Analyst 134 (2009) 1781–1785. Publisher: RoyalSociety of Chemistry.

[164] R. McDermott, A. H. Trabesinger, M. Muck, E. L. Hahn, A. Pines,J. Clarke, Liquid-State NMR and Scalar Couplings in Microtesla Mag-netic Fields, Science 295 (2002) 2247–2249. Publisher: American As-sociation for the Advancement of Science Section: Report.

[165] K. S. Cujia, J. M. Boss, K. Herb, J. Zopes, C. L. Degen, Tracking theprecession of single nuclear spins by weak measurements, Nature 571(2019) 230–233. Number: 7764 Publisher: Nature Publishing Group.

[166] S. Kuehn, S. A. Hickman, J. A. Marohn, Advances in mechanicaldetection of magnetic resonance, The Journal of Chemical Physics 128(2008) 052208.

69

Page 70: An ensemble approach to the structure-function problem in

[167] D. J. Beale, F. R. Pinu, K. A. Kouremenos, M. M. Poojary, V. K.Narayana, B. A. Boughton, K. Kanojia, S. Dayalan, O. A. H. Jones,D. A. Dias, Review of recent developments in GC–MS approaches tometabolomics-based research, Metabolomics 14 (2018) 152.

[168] A. Mastrangelo, A. Ferrarini, F. Rey-Stolle, A. Garcıa, C. Barbas,From sample treatment to biomarker discovery: A tutorial for untar-geted metabolomics based on GC-(EI)-Q-MS, Analytica Chimica Acta900 (2015) 21–35.

[169] K. Dettmer, P. A. Aronov, B. D. Hammock, Mass spectrometry-basedmetabolomics, Mass Spectrometry Reviews 26 (2007) 51–78. eprint:https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/pdf/10.1002/mas.20108.

[170] D. Raftery (Ed.), Mass spectrometry in metabolomics: methods andprotocols, number 1198 in Methods in molecular biology, HumanaPress, New York, 2014. OCLC: ocn880461869.

[171] S. Alseekh, A. Aharoni, Y. Brotman, K. Contrepois, J. D’Auria,J. Ewald, J. C. Ewald, P. D. Fraser, P. Giavalisco, R. D. Hall,M. Heinemann, H. Link, J. Luo, S. Neumann, J. Nielsen, L. Perez deSouza, K. Saito, U. Sauer, F. C. Schroeder, S. Schuster, G. Siuz-dak, A. Skirycz, L. W. Sumner, M. P. Snyder, H. Tang, T. Tohge,Y. Wang, W. Wen, S. Wu, G. Xu, N. Zamboni, A. R. Fernie, Massspectrometry-based metabolomics: a guide for annotation, quantifica-tion and best reporting practices, Nature Methods 18 (2021) 747–756. Bandiera abtest: a Cg type: Nature Research Journals Number:7 Primary atype: Reviews Publisher: Nature Publishing Group Sub-ject term: Mass spectrometry;Metabolomics Subject term id: mass-spectrometry;metabolomics.

[172] M. Jemal, High-throughput quantitative bioanalysis by lc/ms/ms,Biomedical Chromatography 14 (2000) 422–429.

[173] J. W. F. Robertson, C. G. Rodrigues, V. M. Stanford, K. A. Rubinson,O. V. Krasilnikov, J. J. Kasianowicz, Single-molecule mass spectrom-etry in solution using a solitary nanopore, Proceedings of the Na-tional Academy of Sciences 104 (2007) 8207–8211. Publisher: NationalAcademy of Sciences Section: Physical Sciences.

70

Page 71: An ensemble approach to the structure-function problem in

[174] B. D. Bennett, E. H. Kimball, M. Gao, R. Osterhout, S. J. Van Dien,J. D. Rabinowitz, Absolute Metabolite Concentrations and ImpliedEnzyme Active Site Occupancy in Escherichia coli, Nature chemicalbiology 5 (2009) 593–599.

[175] P. Aiyar, D. Schaeme, M. Garcıa-Altares, D. Carrasco Flores,H. Dathe, C. Hertweck, S. Sasso, M. Mittag, Antagonistic bacteriadisrupt calcium homeostasis and immobilize algal cells, NatureCommunications 8 (2017) 1756. Bandiera abtest: a Cc license type:cc by Cg type: Nature Research Journals Number: 1 Primary atype:Research Publisher: Nature Publishing Group Subject term: Bac-teria;Microbial ecology;Natural products;Plant signalling Sub-ject term id: bacteria;microbial-ecology;natural-products;plant-signalling.

[176] L. Molstad, P. Dorsch, L. R. Bakken, Robotized incubation system formonitoring gases (O2, NO, N2O N2) in denitrifying cultures, Journalof Microbiological Methods 71 (2007) 202–211.

[177] Y. Shi, C. Pan, K. Wang, X. Chen, X. Wu, C.-T. A. Chen, B. Wu, Syn-thetic multispecies microbial communities reveals shifts in secondarymetabolism and facilitates cryptic natural product discovery, Environ-mental Microbiology 19 (2017) 3606–3618.

[178] C. Pieper, D. Risse, B. Schmidt, B. Braun, U. Szewzyk, W. Rotard,Investigation of the microbial degradation of phenazone-type drugs andtheir metabolites by natural biofilms derived from river water usingliquid chromatography/tandem mass spectrometry (lc-ms/ms), WaterResearch 44 (2010) 4559–4569.

[179] J.-R. Thelusmond, T. J. Strathmann, A. M. Cupples, The identificationof carbamazepine biodegrading phylotypes and phylotypes sensitive tocarbamazepine exposure in two soil microbial communities, Science ofThe Total Environment 571 (2016) 1241–1252.

[180] T. G. Luan, K. S. H. Yu, Y. Zhong, H. W. Zhou, C. Y. Lan, N. F. Y.Tam, Study of metabolites from the degradation of polycyclic aromatichydrocarbons (PAHs) by bacterial consortium enriched from mangrovesediments, Chemosphere 65 (2006) 2289–2296.

71

Page 72: An ensemble approach to the structure-function problem in

[181] S. Ly, H. Mith, C. Tarayre, B. Taminiau, G. Daube, M.-L. Fauconnier,F. Delvigne, Impact of microbial composition of cambodian traditionaldried starters (dombea) on flavor compounds of rice wine: Combiningamplicon sequencing with hp-spme-gcms, Frontiers in Microbiology 9(2018) 894.

[182] X. Xu, B. Wu, W. Zhao, X. Pang, F. Lao, X. Liao, J. Wu, Correla-tion between autochthonous microbial communities and key odorantsduring the fermentation of red pepper (capsicum annuum l.), FoodMicrobiology 91 (2020) 103510.

[183] S.-E. Park, S.-H. Seo, E.-J. Kim, S. Byun, C.-S. Na, H.-S. Son, Changesof microbial community and metabolite in kimchi inoculated with dif-ferent microbial community starters, Food Chemistry 274 (2019) 558–565.

[184] Z.-R. Huang, W.-L. Guo, W.-B. Zhou, L. Li, J.-X. Xu, J.-L. Hong,H.-P. Liu, F. Zeng, W.-D. Bai, B. Liu, L. Ni, P.-F. Rao, X.-C. Lv,Microbial communities and volatile metabolites in different traditionalfermentation starters used for hong qu glutinous rice wine, Food Re-search International 121 (2019) 593–603.

[185] Y. Jin, D. Li, M. Ai, Q. Tang, J. Huang, X. Ding, C. Wu, R. Zhou,Correlation between volatile profiles and microbial communities: Ametabonomic approach to study jiang-flavor liquor daqu, Food Re-search International 121 (2019) 422–432.

[186] Y. Rao, Y. Tao, X. Chen, X. She, Y. Qian, Y. Li, Y. Du, W. Xiang,H. Li, L. Liu, The characteristics and correlation of the microbialcommunities and flavors in traditionally pickled radishes, LWT 118(2020) 108804.

[187] Y. Suzuki, K. Kobayashi, Y. Wakisaka, D. Deng, S. Tanaka, C.-J.Huang, C. Lei, C.-W. Sun, H. Liu, Y. Fujiwaki, S. Lee, A. Isozaki,Y. Kasai, T. Hayakawa, S. Sakuma, F. Arai, K. Koizumi, H. Tezuka,M. Inaba, K. Hiraki, T. Ito, M. Hase, S. Matsusaka, K. Shiba, K. Suga,M. Nishikawa, M. Jona, Y. Yatomi, Y. Yalikun, Y. Tanaka, T. Sug-imura, N. Nitta, K. Goda, Y. Ozeki, Label-free chemical imaging flow

72

Page 73: An ensemble approach to the structure-function problem in

cytometry by high-speed multicolor stimulated Raman scattering, Pro-ceedings of the National Academy of Sciences 116 (2019) 15842–15848.Publisher: National Academy of Sciences Section: PNAS Plus.

[188] N. Nitta, T. Iino, A. Isozaki, M. Yamagishi, Y. Kitahama, S. Sakuma,Y. Suzuki, H. Tezuka, M. Oikawa, F. Arai, T. Asai, D. Deng,H. Fukuzawa, M. Hase, T. Hasunuma, T. Hayakawa, K. Hiraki, K. Hi-ramatsu, Y. Hoshino, M. Inaba, Y. Inoue, T. Ito, M. Kajikawa,H. Karakawa, Y. Kasai, Y. Kato, H. Kobayashi, C. Lei, S. Matsusaka,H. Mikami, A. Nakagawa, K. Numata, T. Ota, T. Sekiya, K. Shiba,Y. Shirasaki, N. Suzuki, S. Tanaka, S. Ueno, H. Watarai, T. Yamano,M. Yazawa, Y. Yonamine, D. Di Carlo, Y. Hosokawa, S. Uemura,T. Sugimura, Y. Ozeki, K. Goda, Raman image-activated cell sort-ing, Nature Communications 11 (2020) 3452. Number: 1 Publisher:Nature Publishing Group.

[189] R. Hatzenpichler, V. Krukenberg, R. L. Spietz, Z. J. Jay, Next-generation physiology approaches to study microbiome function at sin-gle cell level, Nature Reviews Microbiology 18 (2020) 241–256. Number:4 Publisher: Nature Publishing Group.

[190] M. F. Escoriza, J. M. Vanbriesen, S. Stewart, J. Maier, StudyingBacterial Metabolic States Using Raman Spectroscopy, Applied Spec-troscopy 60 (2006) 971–976. Publisher: SAGE Publications Ltd STM.

[191] P. Rosch, M. Harz, M. Schmitt, K.-D. Peschke, O. Ronneberger,H. Burkhardt, H.-W. Motzkus, M. Lankers, S. Hofer, H. Thiele,J. Popp, Chemotaxonomic Identification of Single Bacteria by Micro-Raman Spectroscopy: Application to Clean-Room-Relevant BiologicalContaminations, Applied and Environmental Microbiology 71 (2005)1626–1637.

[192] M. Harz, P. Rosch, K.-D. Peschke, O. Ronneberger, H. Burkhardt,J. Popp, Micro-Raman spectroscopic identification of bacterial cells ofthe genus Staphylococcus and dependence on their cultivation condi-tions, Analyst 130 (2005) 1543–1550. Publisher: The Royal Society ofChemistry.

[193] K. J. Kobayashi-Kirschvink, H. Nakaoka, A. Oda, K.-i. F. Kamei,K. Nosho, H. Fukushima, Y. Kanesaki, S. Yajima, H. Masaki, K. Ohta,

73

Page 74: An ensemble approach to the structure-function problem in

Y. Wakamoto, Linear Regression Links Transcriptomic Data and Cel-lular Raman Spectra, Cell Systems 7 (2018) 104–117.e4.

[194] A. Paul, P. Carl, F. Westad, J.-P. Voss, M. Maiwald, To-wards Process Spectroscopy in Complex Fermentation Samples andMixtures, Chemie Ingenieur Technik 88 (2016) 756–763. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cite.201500118.

[195] E. S. Wharfe, R. M. Jarvis, C. L. Winder, A. S. Whiteley,R. Goodacre, Fourier transform infrared spectroscopy as ametabolite fingerprinting tool for monitoring the phenotypicchanges in complex bacterial communities capable of degradingphenol, Environmental Microbiology 12 (2010) 3253–3263. eprint:https://sfamjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1462-2920.2010.02300.x.

[196] R. Tenorio, A. C. Fedders, T. J. Strathmann, J. S. Guest, Impact ofgrowth phases on photochemically produced reactive species in the ex-tracellular matrix of algal cultivation systems, Environmental Science:Water Research & Technology 3 (2017) 1095–1108.

[197] J. Holm, I. Bjorck, A. Drews, N.-G. Asp, A Rapid Method forthe Analysis of Starch, Starch - Starke 38 (1986) 224–226. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/star.19860380704.

[198] H. Fuwa, A New Method for Microdetermination Cf Amylase Activityby the Use of Amylose as the Substrate, The Journal of Biochemistry41 (1954) 583–603.

[199] K. M. Miranda, M. G. Espey, D. A. Wink, A rapid, simple spectropho-tometric method for simultaneous detection of nitrate and nitrite, Ni-tric Oxide 5 (2001) 62–71.

[200] B. J. Callahan, P. J. McMurdie, M. J. Rosen, A. W. Han, A. J. A.Johnson, S. P. Holmes, DADA2: High-resolution sample inferencefrom Illumina amplicon data, Nature Methods 13 (2016) 581–583.

[201] G. M. Douglas, V. J. Maffei, J. R. Zaneveld, S. N. Yurgel, J. R. Brown,C. M. Taylor, C. Huttenhower, M. G. I. Langille, PICRUSt2 for pre-diction of metagenome functions, Nature Biotechnology 38 (2020) 685–688. Bandiera abtest: a Cg type: Nature Research Journals Number: 6

74

Page 75: An ensemble approach to the structure-function problem in

Primary atype: Correspondence Publisher: Nature Publishing GroupSubject term: Biomarkers;Computational biology and bioinformat-ics;Ecology;Microbiology Subject term id: biomarkers;computational-biology-and-bioinformatics;ecology;microbiology.

[202] J. N. Nissen, J. Johansen, R. L. Allesøe, C. K. Sønderby, J. J. A. Ar-menteros, C. H. Grønbech, L. J. Jensen, H. B. Nielsen, T. N. Petersen,O. Winther, S. Rasmussen, Improved metagenome binning and as-sembly using deep variational autoencoders, Nature Biotechnology 39(2021) 555–560. Bandiera abtest: a Cg type: Nature Research JournalsNumber: 5 Primary atype: Research Publisher: Nature PublishingGroup Subject term: Genome informatics;Metagenomics;MicrobiomeSubject term id: genome-informatics;metagenomics;microbiome.

[203] N. R. Garud, B. H. Good, O. Hallatschek, K. S. Pollard, Evolutionarydynamics of bacteria in the gut microbiome within and across hosts,PLOS Biology 17 (2019) e3000102. Publisher: Public Library of Sci-ence.

[204] L. P. Antunes, L. F. Martins, R. V. Pereira, A. M. Thomas, D. Bar-bosa, L. N. Lemos, G. M. M. Silva, L. M. S. Moura, G. W. C.Epamino, L. A. Digiampietri, K. C. Lombardi, P. L. Ramos, R. B.Quaggio, J. C. F. de Oliveira, R. C. Pascon, J. B. d. Cruz,A. M. da Silva, J. C. Setubal, Microbial community structureand dynamics in thermophilic composting viewed through metage-nomics and metatranscriptomics, Scientific Reports 6 (2016) 38915.Bandiera abtest: a Cc license type: cc by Cg type: Nature ResearchJournals Number: 1 Primary atype: Research Publisher: NaturePublishing Group Subject term: Metagenomics;Soil microbiology Sub-ject term id: metagenomics;soil-microbiology.

[205] Y. Zhang, K. N. Thompson, T. Branck, Y. Yan, L. H. Nguyen,E. A. Franzosa, C. Huttenhower, Metatranscriptomics for the Hu-man Microbiome and Microbial Community Functional Profiling, An-nual Review of Biomedical Data Science 4 (2021) 279–311. eprint:https://doi.org/10.1146/annurev-biodatasci-031121-103035.

[206] G. B. Gloor, J. M. Macklaim, V. Pawlowsky-Glahn, J. J. Egozcue,Microbiome Datasets Are Compositional: And This Is Not Optional,Frontiers in Microbiology 8 (2017). Publisher: Frontiers.

75

Page 76: An ensemble approach to the structure-function problem in

[207] J. Aitchison, C. Barcelo-Vidal, J. A. Martın-Fernandez, V. Pawlowsky-Glahn, Logratio analysis and compositional distance, MathematicalGeology 32 (2000) 271–275.

[208] J. Aitchison, The Statistical Analysis of Compositional Data, Journalof the Royal Statistical Society: Series B (Methodological) 44 (1982)139–160.

[209] M. I. Love, W. Huber, S. Anders, Moderated estimation of fold changeand dispersion for RNA-seq data with DESeq2, Genome Biology 15(2014) 550.

[210] P. J. McMurdie, S. Holmes, Waste Not, Want Not: Why RarefyingMicrobiome Data Is Inadmissible, PLOS Computational Biology 10(2014) e1003531. Publisher: Public Library of Science.

[211] T. Hastie, R. Tibshirani, J. Friedman, Elements Of Statistical LearningData, Springer, New York, NY, 2nd edition edition, 2016.

[212] P. Mehta, M. Bukov, C.-H. Wang, A. G. R. Day, C. Richardson, C. K.Fisher, D. J. Schwab, A high-bias, low-variance introduction to Ma-chine Learning for physicists, Physics Reports 810 (2019) 1–124.

[213] A. Hyvarinen, E. Oja, A Fast Fixed-Point Algorithm for IndependentComponent Analysis, Neural Computation 9 (1997) 1483–1492.

[214] D. D. Lee, H. S. Seung, Learning the parts of objects by non-negativematrix factorization, Nature 401 (1999) 788–791. Bandiera abtest: aCg type: Nature Research Journals Number: 6755 Primary atype: Re-search Publisher: Nature Publishing Group.

[215] Y. Cai, H. Gu, T. Kenney, Learning Microbial Community Structureswith Supervised and Unsupervised Non-negative Matrix Factorization,Microbiome 5 (2017) 110.

[216] M. A. Kramer, Nonlinear principal component analysis using autoas-sociative neural networks, AIChE Journal 37 (1991) 233–243. eprint:https://onlinelibrary.wiley.com/doi/pdf/10.1002/aic.690370209.

[217] G. Hinton, S. Roweis, Stochastic neighbor embedding, in: Proceedingsof the 15th International Conference on Neural Information Processing

76

Page 77: An ensemble approach to the structure-function problem in

Systems, NIPS’02, MIT Press, Cambridge, MA, USA, 2002, pp. 857–864.

[218] D. T. Fraebel, K. Gowda, M. Mani, S. Kuehn, Evolution of generalistsby phenotypic plasticity, iScience (2020) 101678.

[219] W. Bialek, R. Ranganathan, Rediscovering the power of pairwise in-teractions, arXiv:0712.4397 [q-bio] (2007). ArXiv: 0712.4397.

[220] E. Schneidman, M. J. Berry, R. Segev, W. Bialek, Weak pairwise corre-lations imply strongly correlated network states in a neural population,Nature 440 (2006) 1007–1012. Bandiera abtest: a Cg type: NatureResearch Journals Number: 7087 Primary atype: Research Publisher:Nature Publishing Group.

[221] K. Weiss, T. M. Khoshgoftaar, D. Wang, A survey of transfer learning,Journal of Big Data 3 (2016) 9.

[222] P. J. Blazek, M. M. Lin, Explainable neural networks thatsimulate reasoning, Nature Computational Science 1 (2021)607–618. Bandiera abtest: a Cg type: Nature Research Jour-nals Number: 9 Primary atype: Research Publisher: NaturePublishing Group Subject term: Computer science;Learning al-gorithms;Network models;Perception Subject term id: computer-science;learning-algorithms;network-models;perception.

77