high-throughput dna sequencing in microbial ecology · luisa warchavchik hugerth (2016):...

58
High-throughput DNA Sequencing in Microbial Ecology Methods and Applications Luisa Warchavchik Hugerth KTH Royal Institute of Technology School of Biotechnology Stockholm 2016

Upload: others

Post on 02-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

High-throughput DNA Sequencingin Microbial EcologyMethods and Applications

Luisa Warchavchik Hugerth

KTH Royal Institute of TechnologySchool of Biotechnology

Stockholm 2016

Page 2: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of
Page 3: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of Gene Technology, School of Biotechnology, KTH Royal Institute of Technology, Stockholm, Sweden

Summary Microorganisms play central roles in planet Earth’s

geochemical cycles, in food production, and in health and disease of humans and livestock. In spite of this, most microbial life formsremain unknown and unnamed, their ecological importance and potential technological applications beyond the realm of speculation. This is due both to the magnitude of microbial diversity and to technological limitations. Of the many advances that have enabled microbiology to reach new depth and breadth in the past decade, one of the most important is affordable high-throughput DNA sequencing. This technology plays a central role in each paper in this thesis.

Papers I and II are focused on developing methods to survey microbial diversity based on marker gene amplification and sequencing. In Paper I we proposed a computational strategyto design primers with the highest coverage among a given set of sequences and applied it to drastically improve one of the most commonly used primer pairs for ecological surveys of prokaryotes. In Paper II this strategy was applied to an eukaryotic marker gene. Despite their importance in the food chain, eukaryotic microbes are much more seldom surveyed than bacteria. Paper II aimed at making this domain of life more amenable to high-throughput surveys.

In Paper III, the primers designed in papers I and II were applied to water samples collected up to twice weekly from 2011 to 2013 at an offshore station in the Baltic proper, the Linnaeus Microbial Observatory. In addition to tracking microbial communities over these three years, we created predictive models for hundreds of microbial populations, based on their co-occurrence with other populations and environmental factors.

In paper IV we explored the entire metagenomic diversity in the Linnaeus Microbial Observatory. We used computational

Page 4: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

tools developed in our group to construct draft genomes of abundant bacteria and archaea and described their phylogeny, seasonal dynamics and potential physiology. We were also able to establish that, rather than being a mixture of genomes from freshand saline water, the Baltic Sea plankton community is composed of brackish specialists which diverged from other aquatic microorganisms thousands of years before the formation of the Baltic itself.

Keywords: Baltic Sea; Microbial ecology; Metagenomics; Bacterioplankton

Page 5: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of Gene Technology, School of Biotechnology, Royal Institute of Technology, Stockholm, Sweden

Sammanfattning Mikroorganismer spelar centrala roller i de geokemiska

kretsloppen i hav och på land, för matproduktion, och för hälsa och sjukdom hos människor och djur. Trots det är de flesta mikroorganismernar fortfarande okända och deras ekologiska roller och potentiella teknologiska användningar går bara att spekulera i. Detta beror både på den enorma mikrobiella diversiteten och på teknologiska begränsningar. Av de många framsteg som har möjliggjort att mikrobiell ekologisk forskning har avancerat snabbt de senaste tio åren, är kanskeden viktigastestorskalig DNA sekvensering. Denna teknologi spelar en central roll i varje artikel i denna avhandling.

Artiklar I och II fokuserar på metodutveckling för kartläggning av mikrobiell diversitet genom PCR-amplifiering ochsekvensering av markörgener. I Artikel I presenterar vi en mjukvara som beräknar optimala primer sekvenser för att amplifiera så många mål-sekvenser som möjligt, och vi använder metoden för att avsevärt förbättra ett av de mest använda PCR-primer-paren för kartläggning av diversitet av prokaryoter (bakterier och arkéer). I Artikel II använder vi denna strategi på en eukaryot markörgen. Trots deras viktiga roller i näringskedjanär eukaryota mikrober än mindre utforskade än bakterier. Artikel II har som mål att göra den eukaryota domänen av livets träd mertillgänglig för storskaliga kartläggningar.

I Artikel III används primerna som utvecklades i artikel I och II på vattenprover som togs upp till två gånger i veckan från 2011 till 2013 vid en provtagning-station i centrala Östersjön, Linnaeus Microbial Observatory. Förutom att följa säsongsdynamiken hos mikrobiella populationer över dessa tre årskapade vi prediktiva modeller för hundratals mikrober, baserat på deras samtida förekomst med andra mikrober och miljöparemetrar.

Page 6: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

I Artikel IV utforskade vi hela den metagenomiska diversitetet vid Linnaeus Microbial Observatory-stationen. Vi använde mjukvaror utvecklade i vår grupp för att från metagenom-data rekonstruera arvsmassorna hos talrika bakterieroch arkéer, och beskrev deras fylogeni, säsongsdynamik och fysiologiska potential. Vi kunde även etablera att, snarare än en blandning av sötvatten och marina mikroorganismer, så är Östersjöns plankton bräckvattenspecialister som divergerade frånandra akvatiska organismer långt innan Östersjön bildades.

Keywords: Baltic Sea; Microbial ecology; Metagenomics; Bacterioplankton

Page 7: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

“Life on earth is such a good story you

cannot afford to miss the beginning...

Beneath our superficial differences we are

all of us walking communities of bacteria.

The world shimmers, a pointillist landscape

made of tiny living beings.”

- Lynn Margulis

Page 8: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of
Page 9: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

List of publications

This thesis is based upon the following papers which are referred to in the text by their corresponding Roman numerals. All papers are included at the end of thesis.

Paper IHugerth, LW, Wefer HA, Lundin S, Jakobsson HE, Lindberg M, Rodin S, Engstrand L, Andersson AF (2014) DegePrime, a program for degenerate primer design for broad-taxonomic-rangePCR in microbial ecology studies. Appl Environ Microbiol 80 (16): 5116-23.

Paper IIHugerth LW, Muller EE, Hu YO, Lebrun LA, Roume H, Lundin D, Wilmes P, Andersson AF (2014) Systematic design of 18S rRNA gene primers for determining eukaryotic diversity in microbial consortia. PloS One 9 (4): e95567.

Paper IIIHugerth LW, Lindh MV, Sjöqvist C, Bunse C, Legrand C, Pinhassi J, Andersson AF. Seasonal dynamics and interactions among Baltic Sea prokaryotic and eukaryotic plankton assemblages. Manuscript

Paper IVHugerth LW, Larsson J, Alneberg J, Lindh MV, Legrand C, PinhassiJ. (2015) Metagenome-assembled genomes uncover a globalbrackish microbiome. Genome Biol 16:279.

Page 10: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of
Page 11: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

Table of Contents

i. The importance of microbes on global and

regional scales ........................................................ 1

ii. An overview of techniques for microbial

community characterization .................................... 3

iii. High-throughput DNA sequencing in the

determination of microbial community

composition ….......................................................... 6

iv. Data processing and statistical analyses of microbiomics ..................................................…...... 16

v. Network construction and community dynamics .... 19

vi. Metagenomic surveys …...........................................24

vii. Genome reconstruction from metagenomic data …. 28

viii. Summary and perspectives ….................................. 30

ix. References …............................................................ 32

x. Acknowledgements …...............................................45

Page 12: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of
Page 13: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

i. The importance of microbes on global and regional scales

Planet Earth is an intrinsically unstable system. Continents rise

and fall, oceans open up and are squeezed out again, new land is

colonized by life only to be ravaged by fire, and innumerable molecules

are constantly breaking down and reassembling everywhere, from deep

in the ocean to the upper reaches of the atmosphere. This instability is

crucial for the existence of life and is, to a large extent, mediated by life.

It is the smallest of life forms that carry out the crucial role of

converting gaseous nitrogen into biologically available forms, as well as

carrying out about half of the world photosynthesis [1]. Single-cell

organisms also outnumber larger life forms by several orders of

magnitude, in terms of total cell count, and above all in the combined

number of biochemical pathways available to them [2]. While

microorganisms in soils are crucial for sustaining land plants and

agricultural activity, in global terms it is the world’s oceans that host the

bulk of geochemical transformations. About half of the global flux of

vital elements such as carbon, nitrogen, phosphorus, iron and sulfur are

mediated through marine microbes [3, 4]. Understanding the life cycle

and metabolism of marine microorganisms is therefore a crucial step in

understanding, modelling and managing global planetary cycles as a

whole.

The life of microbes in an aquatic system takes place along an

ever-mixing water column. Along this water column, levels of light,

oxygen and nutrients vary with depth, forming various niches where

different metabolic strategies have greatest success: photosynthesis and

other light-based strategies can only be successful in depths reached by

light, aerobic metabolism requires dissolved oxygen, sulfate reduction

cannot happen in the presence of oxygen and reactive nitrogen oxides

etc. In addition to this macro-scale diversity, on the size scale of

microbial life every spoonful of marine water is in itself highly patchy

[5]. Particles produced as macroorganisms feed, shed cells, defecate, lay

eggs and die function as transient oases for heterotrophic bacteria in an

otherwise very poor environment. Microorganisms themselves interact

with each other. These interactions can be direct, such as the predation

of bacteria by protists, or bacterial biofilm formation. They can also

1

Page 14: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

happen indirectly, such as through competition for the same resources

or through leaching of nutrients that can be taken up by other

microorganisms [6].

Besides their role in global geochemical cycles, marine bacteria

also play important economical and social roles at the local and regional

scale. By fixing carbon, cyanobacteria and phytoplankton are the bases

of marine food webs, thereby sustaining fish stocks. On the other hand,

blooms of neurotoxic cyanobacteria negatively impact fish stocks, as

well as water quality and tourism in affected areas. Conversely, human

activities affect the chemistry of the water column, such as through high

phosphorus and nitrogen runoff from agricultural land or wastewater

treatment facilities promoting fast bacterial growth and oxygen

depletion, a process known as eutrophication [7, 8]. Further, selective

fishing can propagate down the foodweb all the way to zooplankton and

possibly bacterioplankton [9]. On the other hand, a well-managed

marine environment can support various established or nascent

industries, from tourism to fisheries to algal farming.

The Baltic Sea is an environment where the interplay between

human activity, water chemistry and bacterial life is particularly

prominent. Due to its narrow connection with the open ocean, water

retention time in the Baltic Sea is around 50 years [10]. Agricultural and

industrial activities in the surrounding countries have greatly affected

the nitrogen and phosphorous loadings in the area, albeit phosphorous

is now much more carefully managed [11]. These nutrient influxes have

led to a large increase in area and duration of total anoxia in the bottom

of the Baltic Sea [12], thereby affecting geochemistry both in the water

column and in the sediment below. Increased global temperatures have

led to a shorter ice-cover period over the water, as well as to an increase

in the influx of high nutrient-load freshwater each spring [13]. Some of

these impacts have quick and visible effects, such as anoxic zones,

larger harmful algal blooms and the collapse of fish stocks. Other effects

of a changing water chemistry might take much more time to be

noticeable at the human scale, and the correlation between the effect

and the result might not be readily appreciated. Nevertheless, the fact

remains that human activities are having deep impacts in a delicately

balanced system whose pieces and links are still largely unknown.

2

Page 15: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

ii. An overview of techniques for microbial community

characterization

While humans have been selectively breeding bacteria and fungi

for food fermentation for several centuries, the first observations of

microbial organisms were made in the 1670’s by Antony van

Leeuwenhoek (he first observed microbes (“animalcules”) in saliva) and

the first purposeful and successful isolation of bacteria for scientific

purposes was attained by Robert Koch and Julius Petri in the 1870’s.

Both direct observation and culturing remain invaluable techniques to

this day, albeit both have limitations.

Culturing is the gold standard of microbial characterization, as it

provides large amounts of cells from a clonal population, and allows any

number of functional tests on bacterial biochemistry, physiology and

genetics to be performed. It was however evident even to Koch that

different bacteria grow best in different settings, and by the early

1900’s it was accepted that the vast majority of bacteria could not be

cultivated with standard techniques, a phenomenon later dubbed “the

great plate count anomaly” [14]. Therefore, most of what is known today

about bacterial physiology stems from a very small subset of easily

culturable bacteria of medical or veterinary importance which grow well

in the presence of high nutrient loads [15].

Reasons for refraction to culturing are many. Firstly, in the

absence of knowledge of the specific growth requirements of an

organism, trial-and-error is not a feasible way to determine them:

“When it is not clear what facet of the environment is not being properly

replicated (nutrients, pH, osmotic conditions, temperature, or many

more), attempting to vary all of these conditions at once results in a

multidimensional matrix of possibilities that cannot be exhaustively

addressed with reasonable time and effort. “ - Stewart, 2012 [16]

In particular, many organisms have rather narrow windows of

growth, becoming dormant when any of a number of micro- or

macronutrients is lacking or in excess, or when physical conditions

deviate by a few percentage points from their optima [15]. These

microbes might survive in the environment in boom-and-bust cycles,

growing rapidly when conditions are optimal but remaining dormant for

3

Page 16: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

extended periods of time [17, 18]. Conversely, other microbes sustain

long-term continuous growth at a pace so slow as to be nearly

indistinguishable, in a lab setting, from a failure to thrive [15, 19]. In

addition to not providing the required growth substrates at appropriate

rates, laboratory settings might easily generate toxic conditions, such as

oxidative stress [20, 21], which in the environment are either absent or

mitigated by other strains. Finally, organisms might fail to grow due to

missing certain pathways, which can be mitigated by adding

intermediates to the medium [22], or be dependent on scavenging

molecules such as siderophores produced by other members of their

community [23]

Even today, despite the development of high-throughput dilution-

to-extinction culturing techniques [24–27], culture chambers that mimic

natural environments [19, 28, 29] and co-culturing approaches [30–32],

isolating and culturing bacteria is a complex and time-consuming

endeavour.

An alternative to culturing is to perform microscopy directly on

environmental samples. High-resolution microscopy techniques such as

electron microscopy, confocal microscopy and photoswitchable

fluorophores allow a number of specific biological questions to be

addressed directly from images of live or fixated bacteria (reviewed in

[33]). However, regardless of technology, with observation alone it can

be extremely hard to achieve a reasonable functional or taxonomic

resolution for the diversity of microbes typically found in an

environmental sample. It takes years of training as a taxonomist to excel

in the visual identification of microbes, even ones with as much

morphological diversity as protists; and even then there are strong

observer effects (reviewed in [34, 35]).

To move beyond the difficulties of culturing and the limitations of

microscopy, microbial ecologists moved increasingly towards molecular

fingerprinting. Starting in 1977, Carl Woese and colleagues established

the suitability of the small subunit (SSU) of the ribosomal rna gene

(rRNA) for inferring phylogenetic relationships between prokaryotic

organisms, a property later verified to also apply to eukaryotes [36–38].

Norman Pace and colleagues soon started applying the same technique

4

Page 17: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

to natural communities [39, 40]. Together with the ribosomal internal

transcribed spacer (ITS), this is still the most commonly used gene for

community phylogenetic composition analysis (community

fingerprinting). The advantages of using SSU rRNA for community

fingerprinting are many: i. This gene is universal in all cellular life forms

ii. It is a highly conserved gene, serving to a large degree as a reliable

molecular chronometer iii. It is seldom, if not ever, transferred

horizontally iv. It possesses both conserved and variable regions, so that

the conserved regions can be targeted by molecular approaches and the

variable ones be used as identifying markers. A handful of other genes,

such as the large subunit (LSU) rRNA share these properties, but the

length of ~1,500 bp of the bacterial SSU rRNA made it amenable to

early molecular techniques, and the impressive body of knowledge that

has since accumulated with this gene as a basis make a switch to other

markers very impractical, except in certain sub-fields such as mycology,

where ITS and LSU are still widely used.

The 1990s saw the first high-throughput environmental

fingerprinting approaches, also sometimes referred to as microbiomics.

It is the decade of techniques such as denaturing gradient gel

electrophoresis (DGGE, [41]), terminal restriction fragment length

polymorphism (T-RFLP, [42]) and automated ribosomal intergenic space

analysis (ARISA, [43]), all of which are based on the characteristic travel

distance of polymerase chain reaction (PCR) amplified DNA fragments

(amplicons) in an electrophoretic chamber. These banding patterns can

be used directly to compare broad changes in taxonomic composition of

samples in different conditions. Even though, in each of these

techniques, different organisms might give rise to identical bands, each

band is treated as an operational taxonomic unit (OTU). To assign a

tentative phylogeny to the each OTU, high abundance tags can be

selected for sequencing, or clone libraries be generated directly from

the environment, which will most likely contain the most highly

abundant tags.

At around the same time, microarrays emerged as sequence-

based alternatives to fingerprinting. The downside of microarrays is that

identification is restricted to sequences previously known and printed

onto the array [44]. While this limits its applications as a general

5

Page 18: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

environmental survey tool, microarrays can still be valuable tools in

focused clinical, industrial or environmental monitoring settings [45–

47]. Since microarrays can cover various regions of the genome, they

are also useful for distinguishing between closely related species or

strains [48–50].

iii. High-throughput DNA sequencing in

the determination of microbial community composition

Assigning taxonomy to every OTU in a broad environmental

survey requires sequencing each amplified DNA fragment. While for

most environments this is still not achievable with current technologies,

the first steps in this direction came in the early 2000’s, with the rise of

high-throughput DNA sequencing. In 2006, the first study was published

using 454 pyrosequencing for assessing microbial communities, a

survey of the microbial diversity in a marine water community [51]. This

study, while sequencing relatively shallowly (6,505-22,994

sequences/sample), already presented two of the main characteristics of

sequencing-based microbiomics that came to be seen as standards in

the field: rarefaction curves very far from reaching saturation, which

indicated a much larger microbial diversity than previously suspected,

and a highly uneven community, with 3-4 orders of magnitude of

difference in abundance between the least and most abundant tags.

These previously unknown low abundance organisms were dubbed in

the paper the “rare biosphere”, a term still in use and whose biological

relevance is much discussed [52].

As more and more research groups started using high-

throughput gene tag sequencing, first with 454 pyrosequencing and

later with Illumina technology, it also became increasingly clear that

these methods, while less biased than some of their predecessors, do

produce a considerable number of artefacts, which can be very hard to

detect and separate from true biological signal.

The first source of bias and artefacts in microbiomics is sampling

itself. Solid samples such as soil can have extreme short-distance

heterogeneity [53]. The amount of material used for extraction, and the

definition of the sample (eg, whether they’re homogenised in bulk or

6

Page 19: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

kept separately) has to be suited to the research question at hand. As

for aquatic samples, long term studies must contend with the issue of

the flowing and mixing of water masses. A stationary sampling, fixed to

geographical coordinates, faces the issue that changes observed in the

microbial community can be a true change within a community or a

replacement of one community by another as the water flows. As an

alternative to the stationary Eulerian sampling, it is possible to follow a

water mass using a buoy and collect samples around it, a strategy

termed Lagrangian sampling. This approach, however, cannot be

extended for longer than a few weeks, after which the water mass is

mixed beyond the point where it can be meaningfully considered

coherent with the initial sample. The temporal dimension is crucial

regardless of the sampling strategy, since the frequency of sampling

should be (but often isn’t) commensurate with the rate of the biological

process of interest.

The next source of bias and artefacts is the DNA extraction

method. Extraction relates to sampling, since different methods require

and tolerate different amounts of starting material. The physico-

chemical characteristics of the environment and of the biological

material in it will in turn interact with the extraction method, producing

a more or less efficient disruption of cell walls and membranes and

removal of contaminants. A failure to appropriately disrupt certain types

of cell wall will cause those organisms to be underestimated in the

community profile. A failure to remove contaminants such as other

biomolecules and organic acids will inhibit the DNA amplification step,

leading to amplification biases and eventually even sample loss [54–57].

Finally, for samples of low microbial density, such as patient blood

samples, minute amounts of DNA or cellular contamination in any

reagent or piece of equipment used in extraction will generate spurious

reads [58].

It should also be noted that RNA can also be extracted and

analysed as complementary DNA (cDNA). DNA is a more stable

molecule than RNA, so community signatures are less likely to

experience radical change at DNA level during sample collection [56,

59]. On the other hand, different organisms, specially eukaryotes, can

have an enormous range of copies of the rRNA gene in their genomes,

7

Page 20: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

which hinders a simple correlation between gene copies and cell

numbers [60]. The number of rRNA copies per cell, however, is largely

independent of the number of gene copies in prokaryotic cells and is

instead correlated with the level of activity in cells, which in turn is

often inversely correlated with cell abundance in a system [61–63].

These differences can result in very different community profiles for

cDNA and DNA analyses, especially in deep water layers, which at the

DNA level are more affected by sinking dead and dying cells [63, 64].

After nucleic acid extraction (and cDNA synthesis when studying

RNA), the region of interest must be amplified and prepared for

sequencing. This is almost always achieved through polymerase chain

reaction (PCR), a method which is sturdy and cost-effective, but may

introduce large biases to the sample [65]. PCR depends on primers,

short DNA molecules (usually 15-30 bp) of defined sequence that bind to

the ends of the DNA target region on the template strands and allow a

DNA polymerase to synthetise a new DNA strand complementary to the

template downstream of the primer’s 3’ end. By flanking the region of

interest with two primers, its copy number is doubled at every reaction

cycle, hence the term “chain reaction”. The amplified region is called

amplicon, but is often referred to as “tag”, as a reference to its role in

identifying microbial taxa.

The dependence on primer binding means that a DNA template

which does not present complementarity to the primers will not be

amplified and its corresponding organism will be a false negative in the

microbiome profile. The odds of failing to amplify a tag corresponding to

a given taxon decrease if a mixture of primers with base-level variations

is used (degenerate primers). On the other hand, this increases the odds

of amplifying other DNA regions, creating artificial diversity. Therefore,

the exact sequence of the PCR primer must be considered in terms of

the community at hand and the acceptability of different biases in the

resulting amplicon pool. This was the issue addressed in PAPER I and

PAPER II.

8

Page 21: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

In PAPER I we describe a heuristic that attempts to find the

primer sequences that match the largest amount of sequences in a

database given a multiple sequence alignment, a length l and a

maximum degree of degeneracy d, which we named Degeprime (Fig. 1).

We used this algorithm to improve on a commonly used microbiomics

primer pair to extend it from domain bacteria to domain archaea,

proposed an alternative primer site of interest for short read

9

Figure 1: an overview of the Degeprime heuristic

Page 22: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

technologies and showed that the community profiles produced using

the resulting primer pairs are comparable to a high degree to profiles

obtained through PCR-free approaches. In the same paper, we

presented an experimental procedure for producing amplicons ready for

multiplex sequencing in Illumina sequencers using a nested PCR

approach (Fig. 2).

In PAPER II the technologies developed in PAPER I were used to

develop a primer pair for microbiomics of eukaryotes. Microbial

eukaryotes have fundamental roles in both aquatic and terrestrial

ecosystems. In marine systems, eukaryotic phytoplankton bloom in early

spring, when a combination of increased sunlight and waters rich in

nutrients from winter mixing gives them optimal growth conditions.

These phytoplankton cells, both during their lives and after death by

starvation or lysis, exude a wide variety of organic compounds that feed

bacterial heterotrophs, replenishing a community starved after the

winter months [66]. Bacteria are in turn predated by zooplankton,

reincorporating the carbon and nutrients into the broader food web, a

process known as the microbial loop. In addition to this, carbon fixed by

eukaryotic phytoplankton can enter the trophic cascade directly through

predation by zooplankton or be entirely removed from circulation as

recalcitrant organic matter sinks to the bottom of the ocean, a process

known as microbial carbon sink. Fungi in soil have long been recognised

for their important role in symbiosis with plants and as decomposers of

recalcitrant organic matter. Other small eukaryotes in soil, however,

have been largely neglected and, in general, high-throughput

microbiology often neglects the eukaryotic component of communities

[67], thereby failing to account for crucial parts of the trophic cascade

and element cycling. The goal of PAPER II was to make the eukaryotic

fraction as easily accessible as the prokaryotic.

Even with the best possible primer pairs and PCR procedures,

sequencing itself introduces errors to the data. Three main kinds of

errors can arise from sequencing: substitutions ( a base is read in place

of another), insertions (a base is read more times than were actually

present) and deletions (a base is skipped). Each sequencing platform

has its characteristic error profiles and assorted suite of tools for

10

Page 23: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

11

Figure 2: a strategy for generating Illumina-ready amplicons based on 2 PCR reactions

Page 24: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

handling them. Error characteristics of the Illumina platform, used for

every paper in this work, have been thoroughly investigated elsewhere

[65].

Specifically in microbiomics, after the initial data filtering a

standard procedure is to “pick OTU”. OTU stands for “operational

taxonomic unit” and is a term used in ecological research when exact

taxonomic assignment is impossible or irrelevant. In the specific case of

sequencing, OTUs are defined by clustering sequences according to

similarity. This step is meant to eliminate erroneous sequences, which

should deviate from a true sequence by only a few bases, but also to

reduce sequence diversity into true biological diversity. Since small

variations in tag composition can be observed within a single species

and even within different operons of the same gene within a single cell,

it is assumed that tags differing by only a small percentage of their

bases represent functionally equivalent cells. This is not always true, as

in the well documented case of Escherichia spp. and Shigella spp.,

which despite having clearly distinct natural histories harbour the exact

same sequence along the full length of their 16S rRNA gene [68].

OTU picking procedures can in very broad terms be divided into

hierarchical clustering (based on single-linkage, average-linkage or

complete-linkage) and heuristic approaches, where the most important

of the latter is the Usearch/Uparse approach. In single-linkage, a

sequence is placed in a cluster if it has a similarity above a threshold to

at least one other sequence in the cluster. This procedure tends to form

very large clusters with a lot of heterogeneity and is rarely used, except

occasionally for very fastly evolving genes such as the fungal ITS region

[69]. Complete linkage, on the other hand, requires that all sequences in

a cluster have similarity above threshold to all others. This method

produces therefore more OTUs than all others, and tends to

overestimate measures of community richness, especially for noisy data.

On average-linkage, finally, the average similarity between a sequence

and all others in the same cluster has to be above the threshold. Since it

is computationally very demanding to run an all-against-all comparison

on datasets of millions of reads, as done in hierarchical clustering,

Usearch approximates complete- or average-linkage approaches by only

comparing each sequence to a “centroid” sequence within each cluster

12

Page 25: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

[70]. By selecting a distance cutoff between this centroid sequence and

the candidate sequences that is half of the maximum distance

acceptable between any two sequences within a cluster, a full-linkage

clustering is approximated. The main disadvantage of this approach is

that the order in which sequences are handled affects the final result.

Therefore, sequences are generally sorted by decreasing abundance

before clustering, since abundant sequences are less likely to be

artifacts. From the description of these methods, it is clear that the

nominal similarity threshold of a cluster can imply a wide range of de

facto distance between sequences in a cluster, depending on the

approach used, a fact that is often glossed over when discussing

microbiomics.

The most popular software packages for OTU picking are

Usearch [70], Mothur [71] and Qiime [72], which runs Uclust in the

background [73]. However, these and other approaches suffer of OTU

instability, that is, the fact that the same sequence might be assigned to

different OTUs depending on the community context [74, 75]. In

general, clusters are selected at 97% similarity [76]. This has several

problems, starting with the fact that 97% similarity over the full length

of a ~1,200 bp gene doesn’t translate directly to 97% similarity over any

given region of the gene [77]. Further, since the methods used are often

heuristic, a 3% distance doesn’t mean exactly the same thing

throughout all packages. Finally, the 97% similarity cutoff is to a large

degree arbitrary, since different taxa might have much less of a distance

between their tags and still represent functionally distinct clades [76,

78]. For the less developed fields of eukaryotic microbiomics (and

increasingly for bacteria as well) higher degrees of similarity are often

used [79, 80]. This stems from an understanding of the different way

taxonomy is applied to eukaryotes as compared to prokaryotes, ie clades

with more morphological variety tend to be assigned a finer grained

classification [81]. The appropriateness of any method is ultimately

dependent on the research question being addressed. OTU-picking at

97% similarity has both been shown to recapitulate natural history well,

when assessed from a global perspective, and to harbour extreme

heterogeneity, when studied at a narrower scale [82, 83]. This issue is

far from being resolved, as the very concept of species is the subject of

13

Page 26: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

much controversy between and within different branches of

microbiology [76].

In an attempt to advance the methodological aspect of OTU

picking, several approaches have been recently published which attempt

to produce biologically meaningful OTUs independently of a predefined

level of similarity. Each of them has a different approach to separating

the noise introduced by PCR and sequencing from true biological

variety. DADA2 (Divisive Amplicon Denoising Algorithm 2) uses the

quality scores of bases, produced by the sequencing platform, to

calculate a substitution error model for the sequencing run at hand. It

then uses this error model to “correct” reads, that is, assign low

frequency reads to higher frequency reads from which they could be

derived by substitution with high probability [84]. Cluster-free filtering

recognises the denoising step in DADA2 as the best available, but its

own error model doesn’t take base quality into account. Instead, it

discards reads with a high probability of error and sorts through the

remaining variation by considering the temporal dynamics of OTUs

[85]. The idea in this case is that while erroneous sequences will have

similar dynamics to their corresponding correct sequence, true

biological variants will react differently to different stimuli. Minimum-

entropy decomposition, in contrast, does not use information across

samples, but neither does it treat each position in a multiple alignment

as equally significant. Instead, it calculates the Shannon entropy for

each position in an alignment and uses positions that are peaks in

entropy to split sequences into smaller clusters. The procedure

continues until no cluster exists which has a significant entropy peak

and contains a given number of sequence reads. Empirically, this

approach has been shown to reveal community dynamics that would

have been obfuscated by 97% OTU clustering [86]. An interesting

computing-power saving approach, which however does not give

differential treatment to more informative positions, has been presented

as Swarm v2, in which the possible 1-base substitution variants of each

sequence are pre-computed, converting a quadratic order all-sequence

comparison into a linear hash comparison [87]. In a small comparison

study performed in-house, however, we came to the conclusion that all

these methods with widely different first principles produce largely

14

Page 27: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

similar ecological results, for instance in terms of alpha and beta

diversity or class-level taxonomic composition, as simply clustering to

97% using Uparse. However, it was also clear that the problem of OTU

instability is not entirely by-passed by any of these heuristics.

After the OTUs in a study are determined, it is crucial to assign a

taxonomic classification to them, so that the OTU dynamics across

samples can be interpreted in light of what is known about these taxa

from previous studies, and more broadly to allow comparison across

microbiomics studies. Unfortunately, there is also no consensus in the

research community about how to assign taxonomy to OTUs. Certain

workflows, such as QIIME [72] and Mothur [71], include the

classification step. Other software are dedicated exclusively to it. For

instance, the Ribosomal Database Project (RDP) Classifier, uses a naïve

bayesian approach to classify sequences based on exact matches of 8-

letter words, and performs bootstrapping to give probability estimates

of the correctness of the assignment [88]. Another popular approach to

sequence classification is the Silva Incremental Aligner (SINA) [89].

SINA uses an initial k-mer based search similar to that of the RDP, but

then uses the subset of the reference sequences matched best by the k-

mer search to construct a direct acyclic graph representing the

composite of all unique selected sequences and calculates an exact

alignment between the query sequence and the reference candidates.

Finally, the sequence taxonomy is assigned as the least common

ancestor of the top-scoring alignments. These approaches generally

perform much more poorly for eukaryotes than prokaryotes, due both to

more incomplete databases and to a more elaborate taxonomy.

Therefore, databases and placement strategies for eukaryotic microbes

are still being developed [90, 91]. For well-studied environments of

limited diversity, placing OTUs directly over a phylogenetic tree is a

good strategy for assigning least common ancestor taxonomy to OTUs of

interest, but this approach is computationally demanding and doesn’t

scale well for large datasets with high taxonomic diversity [92].

Any combination of methods and algorithms chosen to profile

community microbiomes have their own intrinsic and unavoidable

biases. Which method produces the results closest to the underlying

community is difficult to assess and depends on the specific community

15

Page 28: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

under study, but being aware of the biases produced by each method is

crucial both to method selection and to data interpretation and

comparison across studies.

iv. Data processing and statistical analyses of microbiomics

Multisample microbiomics data is generally summarised as a

table of read counts per OTU per sample. These tables are generally

very sparse, especially for OTUs belonging to the rare biosphere. The

interpretation of these counts of 0 is not straightforward, since they may

represent the true absence of an OTU or its presence under the

detection limit. Each of the various steps of processing, from DNA

extraction through library preparation and sequencing, give a different

yield for each sample, meaning that the detection limit is neither shared

by all samples nor easily estimated. While this can be bypassed by

normalizing counts per sample, that means that observations are no

longer independent, since an increase in the relative abundance of an

OTU induces a perceived reduction in all others.

Given these caveats, the choice of metric when comparing

samples or OTUs through clustering or ordination analyses can have

important consequences. In addition to true distance metrics, it is

common in microbial ecology to use correlation coefficients, such as

Pearson’s product moment and Spearman’s rank, or measures of

dissimilarity, such as Bray-Curtis’. The appropriate metric for a study

might depend on the size of the effect of interest and on the depth of

sampling. Semi-quantitative measures such as Spearman’s require a

larger effect size, while Euclidean distances often requires a very large

sample [93]. Bray-Curtis can be appropriate for datasets with many

zeroes, but may also lack sensitivity [94].

An alternative to OTU-based distances is to use phylogenetic

distances. While these approaches also require several non-trivial

choices, such as the underlying phylogenetic tree and the placement of

OTUs on it, phylogenetic distances are still more biologically

meaningful, not least because phylogenetic relatedness is often

associated to trait conservation [95]. As is the case for OTU-based

metrics, using a quantitative or qualitative approach to community

16

Page 29: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

comparison can lead to very different results [96]. This can be

ameliorated through an appropriate weighting procedure, such as

generalised Unifrac [97].

Regardless of the metric chosen, exploratory methods are often

used for an initial assessment of how different communities cluster or

distribute themselves along a gradient. One of the oldest and most

common exploratory methods is the principal component analysis, or

PCA. In it, variables are treated as axes in an Euclidean

multidimensional space and the first principal component is by

definition placed on the direction representing the largest variation of

the data. The second component is placed in the direction orthogonal to

this that explains the largest amount of the remaining variance and so

on. The first two or three components often explain a large amount of

the variation, allowing a visual inspection of the distance between

samples in two- or three dimensional space. Furthermore, the

percentage of the variation explained by each axis indicates whether

there are dominant drivers present or not. To avoid the constraint of the

Euclidean distance, principal coordinate analysis (PCoA) can be used

together with any dissimilarity matrix. Another conceptually similar

strategy is correspondence analysis (CA), where rather than maximising

the percentage of variance explained by each axis, the correspondence

between rows and columns in the matrix is optimised. Unlike these

techniques, in multidimensional scaling (MDS), the number of

dimensions to which the dataset should be reduced is chosen a priori

and the algorithm finds the distribution of objects in the lower-

dimension space that best corresponds to their distances in the full

dimension, while also calculating a stress function representing the

amount of the distortion between spaces. Non-metric MDS, or NMDS, is

an extension of this technique using ranks of distances rather than their

value.

It is often not clear what the main driver of the community over

the gradient is, or even how many overlapping gradients there are.

These are the cases where exploratory methods are most needed, but

also where the biases of each method most affect the biological

interpretation of results. For instance, the horseshoe effect, where

sparse matrices driven by a single dominant gradient assume an arch-

17

Page 30: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

like pattern when submitted to PCA or CA, may mask other, more subtle

gradients, and detrending techniques to eliminate this effect often erase

true patterns [93]. In datasets with many overlapping gradients, an

NMDS will often produce a clearer overview of the data distribution

than methods that don’t limit the number of dimensions [98]. It is

therefore recommendable to try a variety of different approaches and

retain not only those which explain the largest proportion of the

variation in the dataset, but also those that propose underlying

biological mechanisms amenable to further investigation.

It is also possible to test the effect of specific parameters on the

community composition, using constrained techniques or discriminant

analysis. These techniques assess the proportion of the variation in the

data which can be explained by specific parameters or combinations of

them. Some of these methods test how well one matrix can be explained

by another symmetrically, and are useful for instance to see whether the

bacterial community in one filter fraction corresponds to that in another

from the same sampling event, or to assess correlations between the

prokaryotic and eukaryotic fractions of a sample. Similarly to PCA,

canonical correlation analysis (CCorA) tries to find the linear

combinations of variables in two datasets that provide the maximum

correlation between them. In Procrustes analysis, the same set of

objects (eg samples) placed on different spaces (eg biological domains

or metabolites) are moved, rotated and scaled to minimise the square

root of the square sum of distances between each pair of corresponding

objects.

When it is clear which are the explanatory variables and which

are the response variables, as is typically the case when comparing

environmental parameters and microbial communities, the methods

discussed above can be constrained accordingly. Redundancy analysis

(RDA) is a version of PCA extended to sets of variables and constrained

to the explanatory variables. Likewise, canonical correspondence

analysis (CCA) is CA constrained to the explanatory variables. If one

variable overwhelms the effect of all others, as can be the case in

intervention studies where all treated samples are clustered together

and apart from the non-treated, a principal responses curve (PRC) can

be used [99]. This approach is also useful if an overlap of many

18

Page 31: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

potentially interacting gradients makes the visual interpretation of an

RDA or CCA plot impossible.

Another family of analyses aims at assessing the significance of

the correlation between two matrices. This includes tests such as

analysis of similarities (ANOSIM), and Mantel’s test. The latter

calculates the correlation between corresponding positions in two

matrices and assesses significance by permutation. ANOSIM, similarly

to NMDS, is based on ranks of object distances, and compares the ranks

of distances of objects within classes with those between classes.

None of the strategies discussed here can distinguish correlation

from causation, except perhaps in intervention studies. More

importantly, clusters and gradients produced along artificial axes do not

necessarily correspond to any underlying biological effect. From a

mathematical perspective, variables of different type (eg metabolomics

versus microbiomics) will often have different variance-to-mean

characteristics, which requires appropriate data transformation [98].

New methods for testing hypotheses based on high-throughput data are

still being developed, and understanding their strengths as well as their

assumptions is a crucial and challenging issue for microbial ecologists.

v. Network construction and community dynamics

Given the overall design and caveats of a microbiomics study, the

sampling strategy will determine which biological questions can be

addressed. A comparison of the microbiome of diseased patients and

healthy controls can show which taxa differentially colonise these

subjects, but will give no insight into whether the altered microbiome is

promoting disease or if the disease state selects different taxa. On an

experiment with animal models or directly on the environment it is

possible to see how the microbiome reacts to the intervention. In some

cases, it is possible to alter the microbiome (through antibiotics,

prebiotics and/or the introduction of foreign microbial cells) and observe

their impact on environment or host phenotype. However, there is

mounting evidence that within a given ecosystem, interactions between

taxa play a more important role in driving community dynamics than

environmental forcing [18, 100].

19

Page 32: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

The most basic approach to hypothesising interactions between

microbial populations is through pairwise relationships, either as

presence/absence (“checkerboard patterns”) or through quantitative

measures.The latter generally rely on measures of correlation such as

Spearman’s and Pearson’s, while the hypergeometric distribution is

appropriate for binary data [101, 102]. In either case, the underlying

hypothesis is that, if there is an interaction between two species, and

given similar environments with similar resources, these two species

will co-occur more likely than expected by chance if their interaction is

beneficial (mutualism or commensalism) and co-occur less likely than

expected by chance if their interactions is prejudicial (competition or

ammenalism). However, two of the most important types of interactions

in natural systems, predation and parasitism, are beneficial to one of the

parts (the predator or parasite) and prejudicial to the other (the prey or

host), complicating the ecological interpretation of co-occurrence

patterns. Furthermore, given the intricacies of microbial metabolism, it

is seldom clear if a species is excluded from a niche due to negative

interactions with other organisms or due to unmeasured variables in the

environment. Nevertheless, mapping pairwise correlations can be a

useful first step in developing a network hypothesis. Since in the typical

case thousands of correlations and anticorrelations will be tested, the

significance of any association has to be tested for significance and

subjected to multiple testing correction. This is done by randomising the

interaction network and calculating the distribution of scores. It is

however still not clear what the correct randomisation procedure is for

this type of data [94].

In addition to being a rich representation of interactions between

particular nodes, properties of the network itself can contain

information about the system. For instance, microbial networks are

generally modular, scale-free and have short average path length [94].

How these mathematical properties translate into biological properties

is still open to debate. It is not clear, for instance, whether a node with a

high out-degree represents a keystone clade whose demise would

severely perturb the entirety of the system, or whether the levels of

redundancy and plasticity in biological systems are enough to replace

these hubs without much propagation of perturbation. In the case of

20

Page 33: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

bacteria in particular, not only does the community present a certain

level of plasticity, but single species and even individual cells can

dramatically alter their life strategy in response to disturbances,

decoupling to a large extent a community's taxonomic composition from

its functional profile [103]. Network properties also interact with

community characteristics such as richness and evenness, and often

have opposite effects in the resulting resistance and resilience of the

community to perturbation, so that broad natural laws of community

stability might be impossible to obtain [103].

A powerful approach to gain insight into the internal mechanisms

of a natural microbial community is sampling time-series with

appropriate intervals and length and using techniques such as local

similarity analysis [104, 105] or auto- and cross-correlation [18, 106,

107]. If a system has an intrinsic periodicity, such as annual cycles, a

few full cycles should be included in the study to separate recurring

patterns, random fluctuations and system drift (time decay), as was

done for the Western English Channel and in lake Mendota [18, 108].

Also important is to consider that different processes might take place

at different rates, corresponding to one or more sampling intervals, as

seen in the San Pedro Ocean Time-Series, or, conversely, that

associations that are significant in the short term are irrelevant in

longer time-spans, as described for different stations on the coast of

California [105, 109].

Strong seasonal recurrence has been reported in several sites,

with the rate of interannual decay declining with the length of the time-

series [18, 64]. Seasonal patterns are believed to be formed by a

combination of external forcing (light, temperature, wind and currents)

and intrinsic community mechanisms, such as trade-offs between

resistance to predation and growth efficiency. Temporal decay is a

phenomenon whereby events which are close in time are more related

than those that are further apart. This can happen due to random

chance events which accumulate over time or due to undetected

changes in underlying conditions. Due to seasonality, the time-frame

which is relevant for most microbial assemblages are those which are

one-year apart, ie, in the same season. Stochastic factors mean that

there is significant loss of signal from one year to the next. However,

21

Page 34: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

environmental and biological constraints maintain community variation

within certain boundaries. Therefore, calculating inter-annual decay

over several years tends to give lower yearly rates than would be

calculated on shorter time-series. In addition to recurrent and linear

(time-decay) patterns, dramatic but rare events can occasionally also be

observed in long time-series, as otherwise rare taxa bloom in response

to changing environmental conditions, which has been reported in the

Western English Channel, in the Bermudas and in the central Baltic Sea

[18, 110, 111].

In PAPER III, the methods developed in PAPER I and PAPER II

were applied to a 3-year long time-series of samples collected from the

Linnaeus Microbial Observatory, 10 km east of Öland. These samples

were collected during the entire ice-free period, with sampling intervals

of 4-15 days between samples. An initial analysis, focusing on the 2011

bacterial fraction, was published by Lindh et al. [111]. Using 454

pyrosequencing and clustering at 97% identity, 3079 OTUs were formed,

the vast majority of which was rare (<0.1% abundance) and infrequent.

Considering the sequencing depth obtained by 454, of <10,000 reads

per sample, it is expected that rare populations will also be infrequent.

Amongst the abundant OTUs (>1%), some had clear seasonal patterns;

some had less clear peaks, which are either stochastic, driven by

unmeasured environmental parameters or by hydrodynamics; and 8

were abundant in more than half of samples. Interestingly, these 8

OTUs, which together account for >50% of the total number of

sequencing reads, span three phyla, including three proteobacterial

classes. Concentrations of nitrate, ammonium, silicate and phosphate,

as well as dinoflagellate biovolume, correlated significantly with

community composition. Community richness was significantly

correlated to chlorophyll a concentration, indicating that eukaryotic

phytoplankton play an important role in sustaining a diverse

heterotrophic bacterial community. The communities as a whole formed

significantly distinct clusters by season, with gradual changes in

abundance within season, but sharp changes in composition on seasonal

transitions.

22

Page 35: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

This study was extended in PAPER III by sequencing prokaryotic

tags from three years (2011-2013) as well as eukaryotic tags for the

latter two. Some of the observations made in the Lindh study [111] were

confirmed, such as the temporal clusters of 16S tags and a fuzzier, but

still present, clustering of 18S tags. However, a degree of uncertainty

and inconsistency between years was also observed. Despite the deeper

sequencing obtained with Illumina technology, most OTUs are rare and

infrequent, with only 13% of them present in half the samples or more.

The dynamics of individual OTUs are fairly distinct from year to year,

with their maxima often occurring in different seasons (40% of OTUs)

and in a different period in relation to algal blooms (60% of OTUs). This

unpredictability extends to correlations between OTUs, with only 30-

40% of strong links (Spearman’s r >= 0.7) being predicted by more than

one of the yearly time-series. The dynamics of eukaryotic tags are less

predictable, since few strong links are observed between them, or

between them and bacterial tags. Accordingly, no combination of the

environmental factors measured could explain more than 33% of the

community variation for any of the microbial subsets or seasons

analysed.

Despite this large degree of uncertainty, more than half of the

most frequent OTUs could be modeled with a fair degree of precision.

Using random forests, predictive models were generated for each OTU

based on other OTUs, environmental factors or both sources of data.

309 bacterial tags and 120 bacterial tags could be modelled and present

R² > 0.5 when comparing predicted data to actual measurements as

well as statistical significance compared to randomised data. Further,

for 83 bacterial tags, models could be found to predict their abundance

in one time point based on the previous one, 61 could be predicted two

time points ahead and several others (20-40 per interval) could be

predicted at longer time-spans. This has important implications for

environmental monitoring, since it implies that OTUs of interest can

have their relative abundance predicted at least two weeks in advance,

shifting surveillance from a responsive to a predictive and proactive

paradigm.

23

Page 36: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

vi. Metagenomic surveys

While microbiomics can provide an inventory of microbial taxa in

an environment and even propose links between taxa or between these

and environmental parameters, it cannot do a comprehensive functional

profiling of the environment of interest. Assessing the metabolic

potential of a complex environment requires sequencing a

representative set of functional genes in it; this is achievable through

shotgun metagenomics.

The term metagenomics was coined by Handelsman and

colleagues in 1998, when they described “the collective genomes of soil

microflora, which we term the metagenome of the soil” and proposed

discovering new pathways for the synthesis of natural products by

transforming E. coli with genes from soil microbes [112]. The current

understanding of metagenomics, that is, the sequencing of the full gene

inventory of an environment, appeared first in 2004 with the release of

two papers describing whole DNA sequencing from environmental

samples. In one, near complete genomes from a simple acid mine

drainage biofilm were described, [113] while the other focused on

genomic fragments from an oligotrophic marine environment, the

Sargasso Sea [114].

While all the considerations above on sample collection and DNA

extraction also apply to metagenomics, the issues of primer selection

and PCR biases are circumvented. The computational demands of

metagenomic analysis are, however, much greater. Some metagenomic

analyses, such as limited phylogenetic profiling, can be done directly

based on short sequencing reads [115], but most of the information in

the genome can only be analysed based on nearly full-length genes or

operons. In fact, the closer to a full chromosome that can be obtained,

the more information on genetic architecture, physiology and evolution

can be obtained. The process of converting short sequencing reads into

much longer contiguous stretches, or contigs, is called assembly and is

still an active research area for isolate genomes as well as for

metagenomes [116].

A naïve approach to genome assembly is to use the ends of reads

that overlap other reads to piece them together. This strategy, named

24

Page 37: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

overlap-layout-consensus (OLC), has been highly successful on the

relatively long sequencing reads produced by technologies such as

Sanger sequencing. However, the large amount of short reads (~50-300

bp) produced by Illumina and other massively parallel sequencing

technologies is unsuitable for OLC, as the process would be

simultaneously computationally intractable and extremely error prone.

Instead, most modern assemblers are based on de Bruijn graphs. A de

Bruijn graph is a directed graph which represents overlaps between

short sequences such that each node is a sequence and each edge

represents an overlap between them. For genome assembly, the length

of the short sequence, or k-mer, is typically between 21 and 91, and

each edge represents a 1-base overlap (i.e. a k+1-mer). Since there are

far fewer unique k-mers of a given length than unique sequencing reads,

the de Bruijn graph structure greatly reduces the amount of memory

required for the assembly, and the search space and hence computation

time. With the graph in place, the problem of assembly is reduced to a

problem of finding the shortest path connecting all nodes in the graph.

If there were no loops in the graph and sequencing quality was perfect,

this problem would be trivial. However, sequencing errors and unequal

sequencing coverage create dead ends in the path which must be

sorted. In addition, there is natural biologic repetition in the genome, so

the formation of loops is inevitable. To solve this, most assemblers rely

on heuristics based on the depth of sequencing in each path. Most

errors will be rare, so low coverage paths (“bubbles”) may be discarded.

True repetitive regions will have particularly high coverage, and can

therefore either be present in the assembly multiple times or be left as a

separate contig, in case they cannot safely be connected to the rest of

the assembly.

In the case of metagenome assembly, all the issues described

above still hold, with the aggravation that many genes and genetic

elements will be found with small variations in several genomes. In this

case, a low coverage path is not necessarily a sign of a sequencing error

to be collapsed into a contig of higher coverage; rather, it may represent

true biological variation. The use of coverage to differentiate between

different possible paths in a de Bruijn graph is therefore even more

crucial for metagenomic assembly than for single genomes.

25

Page 38: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

Metagenomic assemblies are often highly fragmented, that is,

many more short contigs are generated than the number of full length

chromosomes and plasmids in the biological sample, due to repeated

regions between genomes and genomes of low abundance in the

community. One way to improve this is to combine the strengths of

different k-mer lengths. While short k-mers will generally have higher

coverage, they are less likely to be able to resolve loops in the de Bruijn

graph, which results in a long total length of the assembly, but spread

over several short contigs. Long k-mers have more capacity to bridge

over repeats and resolve loops but, having lower coverage, they will

generally not result in an equally long total assembly since many k-mers

will be missing. Therefore, several assembly strategies exist that

attempt to combine the strengths of short and long k-mers through

procedures such as scaffolding, reassembly with an OLC method or

built-in progressive assembly. Scaffolding does not directly extend

contigs, but places them in the correct order and direction while

estimating the size of assembly gaps (unassembled regions) between

them. This is generally done by taking advantage of mate-pair reads

with long insert sizes that bypass the maximum length of the DNA

fragment that can be sequenced. While in principle useful, this strategy

requires special laboratory procedures and is in practice rarely used in

metagenomic sequencing. On the other hand, producing contigs by de

Bruijn graph assembly and then clipping them into long substrings

(~500-1000 bp) reduces the problem of short-read assembly into that of

long-read assembly, which is tractable by OLC. This approach has been

shown to significantly improve the length of contigs produced with a

very small increase in the rate of misassembly [117–120]. Finally, some

assemblers, such as Megahit and IDBA-UD use progressively increasing

k-mers in their assemblies, while retaining contigs obtained at previous

iterations [121, 122]. They also implement different thresholds for

removing erroneous k-mers at each iteration, which is a great advantage

when assembling data with highly uneven coverage, as is inevitably the

case for metagenomic datasets.

To extract biologically relevant information from assembled

contigs, the next crucial step is usually gene prediction. Predicted genes

are used first and foremost for inferring possible metabolic pathways

26

Page 39: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

present in the community. Although the presence of a gene does not

imply that it is being transcribed and much less that it is enzymatically

active, it is a reasonable assumption that communities presenting a

certain gene should at least have the potential for realising that activity

under appropriate conditions, such as in the presence of substrates and

the absence of inhibitors. While gene prediction and gene functional

assignment are not trivial problems, available software is sufficiently

accurate to allow good estimates of metabolic potential from complex

communities [123]. The quantitative information contained in the

sequencing runs is lost during assembly. Therefore, to quantify the

relative copy number of genes and genomes from a metagenome

requires mapping reads back to the assembly. This quantitative

information can then be related to quantitative measures of

environmental parameters.

Conserved gene families can also be used to infer the

phylogenetic origin of contigs, and therefore of the underlying

community [124, 125]. Other approaches for the phylogenetic

placement of contigs are based on k-mer profiles, alignment of full

contigs to a database or combinations of these approaches. A

fundamental distinction to be made between these approaches is that

marker-gene based strategies will necessarily only classify the subset of

all contigs which contains these genes, while an alignment or k-mer

based approach can be applied to all contigs and even individual reads.

A comprehensive comparison of available software was performed by

Peabody and colleagues [126]. By performing in silico and in vitro clade

exclusion experiments, that is, removing from the database the target

taxon, they found that several methods tend to grossly overestimate the

community diversity. The precision and sensitivity of every method falls

as expected with increasing levels of clade exclusion and increases with

read length. When the query sequences are present in the database, k-

mer based methods outperform those based on marker genes, since they

can analyse all the sequencing information and not only predicted

genes. In the more realistic case where metagenomic reads come from

sequences not already known in databases, however, marker-gene-based

predictions outperform k-mer based approaches by far. This also

highlights the importance of good assembly and gene calling

27

Page 40: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

procedures, to maximise the number of contigs containing a correctly

predicted marker gene.

Global oceans were among the first environments to be explored

through metagenomics and are still among the most researched [114,

127–129]. These studies have uncovered much phylogenetic novelty at

all taxonomic levels. Functional novelty was also discovered, such as

novel proteorhodopsin families, novel pigment gene clusters and

ammonia oxidation capacity in Archaea [114, 130]. Crucially, with

metagenomics, hypotheses initially postulated from microbiomics

studies can be assessed at the functional gene level. For instance,

Venter and colleagues confirmed in the Sargasso Sea that

bacterioplankton contigs cluster into defined species, not a continuum

of diversity [114]. Later studies deepened this observation by showing

that even cosmopolitan clades such as SAR11 display microdiversity and

geographical structuring [131]. From a biogeography perspective,

Dupont, Larsson and colleagues confirmed the observation that

bacterioplankton functional genes in the Baltic Sea follows its salinity

gradient, with a typically freshwater profile in the low salinity North

becoming increasingly marine towards the mesohaline southwest [129].

Metagenomic data has the further advantage that, because it is broad

and unbiased by primer choice, it can be combined across studies to

present a more complete picture of a biological phenomenon or

reexamined by other research groups to answer questions not even

considered by those which produced the data.

vii. Genome reconstruction from metagenomic data

Although a great deal can be learnt about a bacterial community

based on gene prediction, in practice, metabolism takes place in

structured pathways, mostly in and on cells, and not as a loose

collection of genes. Therefore, much more information can be obtained

when contigs are put together into the context of a single organism.

While full genomes are the ultimate goal of assembly, this is in practice

not feasible, and other sources of information must be used to bin

contigs into fragmented draft genomes (metagenome-assembled

genomes, or MAGs). Sequence-based parameters, such as GC-content

28

Page 41: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

and tetranucleotide frequencies have been successfully used to produce

MAGs in some cases, but these approaches can generally only

discriminate down to the genus level [132–134]. By including

information on coverage of contigs across multiple related samples it is

possible to obtain a finer level of resolution, at species and sometimes

strain levels [135–137]. For the sake of reproducibility and cross-study

comparison, this process should be fully automated. A few tools have

been put forth that can do that [138–141], although binning genomes

from specific organisms of interest might still require semi-manual

approaches [142].

New clades are increasingly being proposed based on MAG data

[137, 142–146]. In addition to providing new genomes from ecologically

important clades, the MAG approach can be used to close outstanding

gaps in the tree of life. By strategically sampling from environments

known to contain phylogenetic novelty and sometimes using semi-

manual approaches, deeply diverging clades have been found and

characterised. These include 35 bacterial phyla from a proposed

candidate phyla radiation [137], several genomes from previously

proposed candidate phyla [145] and an archaeal clade proposed to be a

sister clade to that which gave rise to all eukaryotes [142].

In PAPER IV, the automated contig binning tool CONCOCT [138]

was used to generate high-quality MAGs from water samples from the

Linnaeus Microbial Observatory’s 2012 sampling. The degree of

completeness and purity of bins was estimated based on 36 universal

prokaryotic single-copy genes were considered. The bins considered for

further analyses were those that presented at least 30 of these genes,

no more than two of which in multiple copies. Due to this very stringent

selection criteria, only 89 bins were selected, corresponding to 29

bacterial and 1 archaeal species, each of which was termed BACL, for

BAltic CLuster. With the exception of BACL8, all of these species had

never been genome sequenced before. In a few cases, these were the

first published genomes for bacterial lineages known from 16S

sequencing to be highly abundant in freshwater environments (acIV of

Actinobacteria and LD19 of Verrucomicrobia). The presented genomes

complement the previous 16S data with information on these organisms'

29

Page 42: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

metabolic potentials and may guide future efforts in isolating them.

Mapping sequencing reads from previous metagenomic surveys of lakes

and oceans from around the globe to each of the genomes assembled

gave their spatial distribution, showing that lineages are differentially

abundant in different salinity levels. Actinobacteria, for instance, are

more closely related to freshwater organisms, while Bacteroidetes and

Alphaproteobacteria are closer to their marine counterparts.

Reads from previous surveys were also mapped to the mass of

unbinned contigs that comprised the rest of the LMO community. This

revealed a clear separation of aquatic environments based on salinity.

While it was already known that there is a strong divide between fresh

and marine environments [147], previous studies suggested that

brackish environments were comprised of a mixture of typically fresh

and typically marine OTUs [129, 148]. In this work, we could show that

this community is in fact comprised of brackish water specialists, an

observation later confirmed in a survey of the Caspian Sea [149].

viii. Summary and perspectives

Life on Earth was exclusively microbial for most of its history,

and is still predominantly so. Scientists have been striving to catalogue,

understand and manage this wealth of life for almost 250 years, and yet

been severely limited by available techniques. Historically, while general

ecology has been based on direct observation combined with

mathematical modelling, breakthroughs in microbial ecology have been

coupled to technological advance. With recent advances in technologies

such as microfluidics and high-throughput DNA sequencing, as well as

the steady growth of computational methods and processing capacity,

the pace of advance in microbial ecology has been greatly increased. In

addition to the approaches already discussed in this work,

microbiologists can now use metatranscriptomics, metaproteomics,

single-cell genome sequencing, flow cytometry, cell sorting, high-

throughput image analysis and nanoSIMS (nanoscale mass

spectrometry), together providing a wide array of complementary

techniques for assessing microbial phylogeny and activity in bulk as well

as at the single-cell level.

30

Page 43: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

While the work of mapping and modelling microbial life on Earth

will remain an open field of basic scientific inquiry, it is important to also

consider the potential medical and technological applications of these

studies. From alternative fuel sources to environmental

decontamination, antibiotic resistance to allergy and autoimmunity

prevention and treatment, many of the biggest challenges of our times

may soon find their answers in the myriad strategies microorganisms

adopt to survive, compete, cooperate and thrive on Earth. It is therefore

crucial that the full potential, as well as the caveats and biases, of

established and nascent microbiology approaches are understood. It is

my hope that the work contemplated by this doctoral thesis will play a

part in this endeavour.

31

Page 44: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

x. References

1. Falkowski P: Ocean Science: The power of plankton. Nature 2012,

483:S17–20.

2. Whitman WB, Coleman DC, Wiebe WJ: Prokaryotes: the unseen majority.

Proc Natl Acad Sci U S A 1998, 95:6578–6583.

3. Falkowski PG, Fenchel T, Delong EF: The microbial engines that drive

Earth’s biogeochemical cycles. Science 2008, 320:1034–1039.

4. Zehr JP, Kudela RM: Nitrogen cycle of the open ocean: from genes to

ecosystems. Ann Rev Mar Sci 2011, 3:197–225.

5. Stocker R: Marine microbes see a sea of gradients. Science 2012,

338:628–633.

6. Azam F, Malfatti F: Microbial structuring of marine ecosystems. Nat Rev

Microbiol 2007, 5:782–791.

7. Glibert PM, Heil CA, Hollander D, Revilla M, Hoare A, Alexander J, Murasko

S: Evidence for dissolved organic nitrogen and phosphorus uptake

during a cyanobacterial bloom in Florida Bay. Mar Ecol Prog Ser 2004,

280:73–83.

8. Conley DJ, Paerl HW, Howarth RW, Boesch DF, Seitzinger SP, Havens KE,

Lancelot C, Likens GE: Ecology. Controlling eutrophication: nitrogen and

phosphorus. Science 2009, 323:1014–1015.

9. Casini M, Hjelm J, Molinero J-C, Lövgren J, Cardinale M, Bartolino V, Belgrano

A, Kornilovs G: Trophic cascades promote threshold-like shifts in pelagic

marine ecosystems. Proc Natl Acad Sci U S A 2009, 106:197–202.

10. Leppäranta M, Myrberg K: Physical Oceanography of the Baltic Sea.

Chichester: Praxis Publishing Ltd; 2009.

11. Elmgren R, Blenckner T, Andersson A: Baltic Sea management:

Successes and failures. Ambio 2015, 44 Suppl 3:335–344.

12. Carstensen J, Andersen JH, Gustafsson BG, Conley DJ: Deoxygenation of

the Baltic Sea during the last century. Proc Natl Acad Sci U S A 2014,

111:5628–5633.

13. Rutgersson A, Jaagus J, Schenk F, Stendel M: Observed changes and

variability of atmospheric parameters in the Baltic Sea region during the

last 200 years. Clim Res 2014, 61:177–190.

14. Staley JT KA: Measurement of in Situ Activities of Nonphotosynthetic

Microorganisms in Aquatic and Terrestrial Habitats. Annu Rev Microbiol

1985, 39:321–346.

15. Lagier J-C, Edouard S, Pagnier I, Mediannikov O, Drancourt M, Raoult D:

Current and past strategies for bacterial culture in clinical microbiology.

Clin Microbiol Rev 2015, 28:208–236.

32

Page 45: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

16. Stewart EJ: Growing unculturable bacteria. J Bacteriol 2012, 194:4151–

4160.

17. Iluz D, Dishon G, Capuzzo E, Meeder E, Astoreca R, Montecino V, Znachor P,

Ediger D, Marra J: Short-term variability in primary productivity during a

wind-driven diatom bloom in the Gulf of Eilat (Aqaba). Aquat Microb Ecol

2009, 56:205–215.

18. Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B,

Huse S, McHardy AC, Knight R, Joint I, Somerfield P, Fuhrman JA, Field D:

Defining seasonal marine microbial community dynamics. ISME J 2012,

6:298–308.

19. Zengler K, Toledo G, Rappe M, Elkins J, Mathur EJ, Short JM, Keller M:

Cultivating the uncultured. Proc Natl Acad Sci U S A 2002, 99:15681–15686.

20. Morris JJ, Johnson ZI, Szul MJ, Keller M, Zinser ER: Dependence of the

cyanobacterium Prochlorococcus on hydrogen peroxide scavenging

microbes for growth at the ocean’s surface. PLoS One 2011, 6:e16805.

21. Tanaka T, Kawasaki K, Daimon S, Kitagawa W, Yamamoto K, Tamaki H,

Tanaka M, Nakatsu CH, Kamagata Y: A hidden pitfall in the preparation of

agar media undermines microorganism cultivability. Appl Environ

Microbiol 2014, 80:7659–7666.

22. Nye KJ, Fallon D, Gee B, Messer S, Warren RE, Andrews N: A comparison

of blood agar supplemented with NAD with plain blood agar and

chocolated blood agar in the isolation of Streptococcus pneumoniae and

Haemophilus influenzae from sputum. Bacterial Methods Evaluation

Group. J Med Microbiol 1999, 48:1111–1114.

23. D’Onofrio A, Crawford JM, Stewart EJ, Witt K, Gavrish E, Epstein S, Clardy J,

Lewis K: Siderophores from neighboring organisms promote the growth

of uncultured bacteria. Chem Biol 2010, 17:254–264.

24. Aakra A, Utåker JB, Nes IF, Bakken LR: An evaluated improvement of the

extinction dilution method for isolation of ammonia-oxidizing bacteria. J

Microbiol Methods 1999, 39:23–31.

25. Rappé MS, Connon SA, Vergin KL, Giovannoni SJ: Cultivation of the

ubiquitous SAR11 marine bacterioplankton clade. Nature 2002, 418:630–

633.

26. Aoi Y, Kinoshita T, Hata T, Ohta H, Obokata H, Tsuneda S: Hollow-fiber

membrane chamber as a device for in situ environmental cultivation.

Appl Environ Microbiol 2009, 75:3826–3833.

27. Liu W, Kim HJ, Lucchetta EM, Du W, Ismagilov RF: Isolation, incubation,

and parallel functional testing and identification by FISH of rare

microbial single-copy cells from multi-species mixtures using the

33

Page 46: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

combination of chemistrode and stochastic confinement. Lab Chip 2009,

9:2153–2162.

28. Nichols D, Cahoon N, Trakhtenberg EM, Pham L, Mehta A, Belanger A,

Kanigan T, Lewis K, Epstein SS: Use of ichip for high-throughput in situ

cultivation of “uncultivable” microbial species. Appl Environ Microbiol

2010, 76:2445–2450.

29. Sizova MV, Hohmann T, Hazen A, Paster BJ, Halem SR, Murphy CM, Panikov

NS, Epstein SS: New approaches for isolation of previously uncultivated

oral bacteria. Appl Environ Microbiol 2012, 78:194–203.

30. Kaeberlein T, Lewis K, Epstein SS: Isolating “uncultivable”

microorganisms in pure culture in a simulated natural environment.

Science 2002, 296:1127–1129.

31. Tanaka Y, Hanada S, Manome A, Tsuchida T, Kurane R, Nakamura K,

Kamagata Y: Catellibacterium nectariphilum gen. nov., sp. nov., which

requires a diffusible compound from a strain related to the genus

Sphingomonas for vigorous growth. Int J Syst Evol Microbiol 2004, 54(Pt

3):955–959.

32. Morris JJ, Kirkegaard R, Szul MJ, Johnson ZI, Zinser ER: Facilitation of

Robust Growth of Prochlorococcus Colonies and Dilute Liquid Cultures

by “Helper” Heterotrophic Bacteria. Appl Environ Microbiol 2008, 74:4530–

4534.

33. Coltharp C, Xiao J: Superresolution microscopy for microbiology. Cell

Microbiol 2012, 14:1808–1818.

34. Moreira D, López-García P: The molecular ecology of microbial

eukaryotes unveils a hidden world. Trends Microbiol 2002, 10:31–

35. Silva PC: Historical review of attempts to decrease subjectivity in

species identification, with particular regard to algae. Protist 2008,

159:153–161.

36. Woese CR, Fox GE: Phylogenetic structure of the prokaryotic domain:

the primary kingdoms. Proc Natl Acad Sci U S A 1977, 74:5088–5090.

37. Woese CR, Stackebrandt E, Macke TJ, Fox GE: A phylogenetic definition

of the major eubacterial taxa. Syst Appl Microbiol 1985, 6:143–151.

38. Woese CR: Bacterial evolution. Microbiol Rev 1987, 51:221–271.

39. Stahl DA, Lane DJ, Olsen GJ, Pace NR: Characterization of a Yellowstone

hot spring microbial community by 5S rRNA sequences. Appl Environ

Microbiol 1985, 49:1379–1384.

40. Pace NR, Stahl DA, Lane DJ, Olsen GJ: Analyzing natural microbial

populations by rRNA sequences. ASM News 1985, 51:4–12.

34

Page 47: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

41. Muyzer G, de Waal EC, Uitterlinden AG: Profiling of complex microbial

populations by denaturing gradient gel electrophoresis analysis of

polymerase chain reaction-amplified genes coding for 16S rRNA. Appl

Environ Microbiol 1993, 59:695–700.

42. Liu WT, Marsh TL, Cheng H, Forney LJ: Characterization of microbial

diversity by determining terminal restriction fragment length

polymorphisms of genes encoding 16S rRNA. Appl Environ Microbiol 1997,

63:4516–4522.

43. Fisher MM, Triplett EW: Automated approach for ribosomal intergenic

spacer analysis of microbial diversity and its application to freshwater

bacterial communities. Appl Environ Microbiol 1999, 65:4630–4636.

44. Ehrenreich A: DNA microarray technology for the microbiologist: an

overview. Appl Microbiol Biotechnol 2006, 73:255–273.

45. Humbert JF, Quiblier C, Gugger M: Molecular approaches for monitoring

potentially toxic marine and freshwater phytoplankton species. Anal

Bioanal Chem 2010, 397:1723–1732.

46. Ricke SC, Khatiwara A, Kwon YM: Application of microarray analysis of

foodborne Salmonella in poultry production: a review. Poult Sci 2013,

92:2243–2250.

47. Zumla A, Al-Tawfiq JA, Enne VI, Kidd M, Drosten C, Breuer J, Muller MA, Hui

D, Maeurer M, Bates M, Mwaba P, Al-Hakeem R, Gray G, Gautret P, Al-Rabeeah

AA, Memish ZA, Gant V: Rapid point of care diagnostic tests for viral and

bacterial respiratory tract infections--needs, advances, and future

prospects. Lancet Infect Dis 2014, 14:1123–1135.

48. Lehner A, Loy A, Behr T, Gaenge H, Ludwig W, Wagner M, Schleifer K-H:

Oligonucleotide microarray for identification of Enterococcus species.

FEMS Microbiol Lett 2005, 246:133–142.

49. Singh DV, Mohapatra H: Application of DNA-based methods in typing

Vibrio cholerae strains. Future Microbiol 2008, 3:87–96.

50. Narihiro T, Sekiguchi Y: Oligonucleotide primers, probes and molecular

methods for the environmental monitoring of methanogenic archaea.

Microb Biotechnol 2011, 4:585–602.

51. Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR,

Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the

underexplored “rare biosphere.” Proc Natl Acad Sci U S A 2006, 103:12115–

12120.

52. Lynch MDJ, Neufeld JD: Ecology and exploration of the rare biosphere.

Nat Rev Microbiol 2015, 13:217–229.

35

Page 48: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

53. Certini G, Campbell CD, Edwards AC: Rock fragments in soil support a

different microbial community from the fine earth. Soil Biol Biochem 2004,

36:1119–1128.

54. Walker AW, Martin JC, Scott P, Parkhill J, Flint HJ, Scott KP: 16S rRNA

gene-based profiling of the human infant gut microbiota is strongly

influenced by sample processing and PCR primer choice. Microbiome

2015, 3:440.

55. Gorzelak MA, Gill SK, Tasnim N, Ahmadi-Vand Z, Jay M, Gibson DL:

Methods for Improving Human Gut Microbiome Data by Reducing

Variability through Sample Processing and Storage of Stool. 2015.

56. Reck M, Tomasch J, Deng Z, Jarek M, Husemann P, Wagner-Döbler I: Stool

metatranscriptomics: A technical guideline for mRNA stabilisation and

isolation. BMC Genomics 2015, 16:804.

57. Weiss S, Amir A, Hyde ER, Metcalf JL, Song SJ, Knight R: Tracking down

the sources of experimental contamination in microbiome studies.

Genome Biol 2014, 15:1704.

58. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P,

Parkhill J, Loman NJ, Walker AW: Reagent and laboratory contamination can

critically impact sequence-based microbiome analyses. BMC Biol 2014,

12:1–12.

59. Lim YW, Haynes M, Furlan M, Robertson CE, Harris JK, Rohwer F:

Purifying the Impure: Sequencing Metagenomes and

Metatranscriptomes from Complex Animal-associated Samples. J Vis Exp

2014.

60. Gong J, Dong J, Liu X, Massana R: Extremely High Copy Numbers and

Polymorphisms of the rDNA Operon Estimated from Single Cell Analysis

of Oligotrich and Peritrich Ciliates. Protist 2013, 164:369–379.

61. Jones SE, Lennon JT: Dormancy contributes to the maintenance of

microbial diversity. Proc Natl Acad Sci U S A 2010, 107:5881–5886.

62. Campbell BJ, Yu L, Heidelberg JF, Kirchman DL: Activity of abundant and

rare bacteria in a coastal ocean. Proc Natl Acad Sci U S A 2011, 108:12776–

12781.

63. Zhang Y, Zhao Z, Dai M, Jiao N, Herndl GJ: Drivers shaping the diversity

and biogeography of total and active bacterial communities in the South

China Sea. Mol Ecol 2014, 23:2260–2274.

64. Cram JA, Chow C-ET, Sachdeva R, Needham DM, Parada AE, Steele JA,

Fuhrman JA: Seasonal and interannual variability of the marine

bacterioplankton community throughout the water column over ten

years. ISME J 2015, 9:563–580.

36

Page 49: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

65. Schirmer M, Ijaz UZ, D’Amore R, Hall N, Sloan WT, Quince C: Insight into

biases and sequencing errors for amplicon sequencing with the Illumina

MiSeq platform. Nucleic Acids Res 2015, 43:e37.

66. Buchan A, LeCleir GR, Gulvik CA, González JM: Master recyclers: features

and functions of bacteria associated with phytoplankton blooms. Nat Rev

Microbiol 2014, 12:686–698.

67. Caron DA, Worden AZ, Countway PD, Demir E, Heidelberg KB: Protists are

microbes too: a perspective. ISME J 2009, 3:4–12.

68. Zuo G, Xu Z, Hao B: Shigella strains are not clones of Escherichia coli

but sister species in the genus Escherichia. Genomics Proteomics

Bioinformatics 2013, 11:61–65.

69. Lindahl BD, Nilsson RH, Tedersoo L, Abarenkov K, Carlsen T, Kjøller R,

Kõljalg U, Pennanen T, Rosendahl S, Stenlid J, Kauserud H: Fungal community

analysis by high-throughput sequencing of amplified markers--a user’s

guide. New Phytol 2013, 199:288–299.

70. Edgar RC: UPARSE: highly accurate OTU sequences from microbial

amplicon reads. Nat Methods 2013, 10:996–998.

71. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hatmann M, Hollister EB,

Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger

GG, Van Horn DJ, Weber CF: Introducing Mothur: Open-source, platform-

independent community- supported software for describing and

comparing microbial communities. Appl Environ Microbiol 2009, 75:7537–

7541.

72. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello

EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D,

Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J,

Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J,

Knight R: QIIME allows analysis of high-throughput community

sequencing data. Nat Methods 2010, 7:335–336.

73. Edgar RC: Search and clustering orders of magnitude faster than

BLAST. Bioinformatics 2010, 26:2460–2461.

74. He Y, Caporaso JG, Jiang X-T, Sheng H-F, Huse SM, Rideout JR, Edgar RC,

Kopylova E, Walters WA, Knight R, Zhou H-W: Stability of operational

taxonomic units: an important but neglected property for analyzing

microbial diversity. Microbiome 2015, 3:20.

75. Schmidt TSB, Matias Rodrigues JF, von Mering C: Limits to robustness

and reproducibility in the demarcation of operational taxonomic units.

Environ Microbiol 2015, 17:1689–1706.

37

Page 50: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

76. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, Feil EJ,

Stackebrandt E, Van de Peer Y, Vandamme P, Thompson FL, Swings J: Re-

evaluating prokaryotic species. Nat Rev Microbiol 2005, 3:733–739.

77. Schloss PD: The effects of alignment quality, distance calculation

method, sequence filtering, and region on the analysis of 16S rRNA

gene-based studies. PLoS Comput Biol 2010, 6:e1000844.

78. Fox GE, Wisotzkey JD, Jurtshuk P Jr: How close is close: 16S rRNA

sequence identity may not be sufficient to guarantee species identity. Int

J Syst Bacteriol 1992, 42:166–170.

79. Not F, del Campo J, Balagué V, de Vargas C, Massana R: New insights into

the diversity of marine picoeukaryotes. PLoS One 2009, 4:e7143.

80. Stoeck T, Bass D, Nebel M, Christen R, Jones, Richards BHWA, TA: Multiple

marker parallel tag environmental DNA sequencing reveals a highly

complex eukaryotic community in marine anoxic water. Mol Ecol 2010,

19(Sup. 1):21–31.

81. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward

automatic reconstruction of a highly resolved tree of life. Science 2006,

311:1283–1287.

82. Koeppel AF, Wu M: Surprisingly extensive mixed phylogenetic and

ecological signals among bacterial Operational Taxonomic Units. Nucleic

Acids Res 2013, 41:5175–5188.

83. Schmidt TSB, Matias Rodrigues JF, von Mering C: Ecological consistency

of SSU rRNA-based operational taxonomic units at a global scale. PLoS

Comput Biol 2014, 10:e1003594.

84. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP:

DADA2: High resolution sample inference from amplicon data. bioRxiv

2015:024034.

85. Tikhonov M, Leach RW, Wingreen NS: Interpreting 16S metagenomic

data without clustering to achieve sub-OTU resolution. ISME J 2015, 9:68–

80.

86. Murat Eren A, Morrison HG, Lescault PJ, Reveillaud J, Vineis JH, Sogin ML:

Minimum entropy decomposition: Unsupervised oligotyping for sensitive

partitioning of high-throughput marker gene sequences. ISME J 2014,

9:968–979.

87. Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M: Swarm v2: highly-

scalable and high-resolution amplicon clustering. PeerJ 2015, 3:e1420.

88. Wang Q, Garrity GM, Tiedje JM, Cole JR: Naive Bayesian classifier for

rapid assignment of rRNA sequences into the new bacterial taxonomy.

Appl Environ Microbiol 2007, 73:5261–5267.

38

Page 51: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

89. Pruesse E, Peplies J, Glöckner FO: SINA: accurate high-throughput

multiple sequence alignment of ribosomal RNA genes. Bioinformatics

2012, 28:1823–1829.

90. Guillou L, Bachar D, Audic S, Bass D, Berney C, Bittner L, Boutte C, Burgaud

G, de Vargas C, Decelle J, Del Campo J, Dolan JR, Dunthorn M, Edvardsen B,

Holzmann M, Kooistra WHCF, Lara E, Le Bescot N, Logares R, Mahé F, Massana

R, Montresor M, Morard R, Not F, Pawlowski J, Probert I, Sauvadet A-L, Siano R,

Stoeck T, Vaulot D, et al.: The Protist Ribosomal Reference database (PR2):

a catalog of unicellular eukaryote small sub-unit rRNA sequences with

curated taxonomy. Nucleic Acids Res 2013, 41(Database issue):D597–604.

91. Hu YOO, Karlson B, Charvet S, Andersson AF: Diversity of Pico- to

Mesoplankton Along the 2000 km Salinity Gradient of the Baltic Sea.

bioRxiv 2015: 035485

92. Matsen FA, Kodner RB, Armbrust EV: pplacer: linear time maximum-

likelihood and Bayesian phylogenetic placement of sequences onto a

fixed reference tree. BMC Bioinformatics 2010, 11:538.

93. Kuczynski J, Liu Z, Lozupone C, McDonald D, Fierer N, Knight R: Microbial

community resemblance methods differ in their ability to detect

biologically relevant patterns. Nat Methods 2010, 7:813–819.

94. Faust K, Raes J: Microbial interactions: from networks to models. Nat

Rev Microbiol 2012, 10:538–550.

95. Martiny JBH, Jones SE, Lennon JT, Martiny AC: Microbiomes in light of

traits: A phylogenetic perspective. Science 2015, 350:aac9323.

96. Lozupone CA, Hamady M, Kelley ST, Knight R: Quantitative and

qualitative beta diversity measures lead to different insights into factors

that structure microbial communities. Appl Environ Microbiol 2007,

73:1576–1585.

97. Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J, Wu GD, Collman RG,

Bushman FD, Li H: Associating microbiome composition with

environmental covariates using generalized UniFrac distances.

Bioinformatics 2012, 28:2106–2113.

98. Paliy O, Shankar V: Application of multivariate statistical techniques in

microbial ecology. Mol Ecol 2016.

99. van den Brink PJ, den Besten PJ, bij de Vaate A, ter Braak CJF: Principal

response curves technique for the analysis of multivariate biomonitoring

time series. Environ Monit Assess 2009, 152:271–281.

100. Lima-Mendez G, Faust K, Henry N, Decelle J, Colin S, Carcillo F, Chaffron S,

Ignacio-Espinosa JC, Roux S, Vincent F, Bittner L, Darzi Y, Wang J, Audic S,

Berline L, Bontempi G, Cabello AM, Coppola L, Cornejo-Castillo FM, d’Ovidio F,

39

Page 52: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

De Meester L, Ferrera I, Garet-Delmas M-J, Guidi L, Lara E, Pesant S, Royo-

Llonch M, Salazar G, Sánchez P, Sebastian M, et al.: Ocean plankton.

Determinants of community structure in the global plankton

interactome. Science 2015, 348:1262073.

101. Chaffron S, Rehrauer H, Pernthaler J, von Mering C: A global network of

coexisting microbes from environmental and whole-genome sequence

data. Genome Res 2010, 20:947–959.

102. Freilich S, Kreimer A, Meilijson I, Gophna U, Sharan R, Ruppin E: The

large-scale organization of the bacterial network of ecological co-

occurrence interactions. Nucleic Acids Res 2010, 38:3857–3868.

103. Shade A, Peter H, Allison SD, Baho D, Berga M, Buergmann H, Huber DH,

Langenheder S, Lennon JT, Martiny JBH, Matulich KL, Schmidt TM, Handelsman

J: Fundamentals of Microbial Community Resistance and Resilience.

Front Microbiol 2012, 3.

104. Ruan Q, Dutta D, Schwalbach MS, Steele JA, Fuhrman JA, Sun F: Local

similarity analysis reveals unique associations among marine

bacterioplankton species and environmental factors. Bioinformatics 2006,

22:2532–2538.

105. Steele JA, Countway PD, Xia L, Vigil PD, Beman JM, Kim DY, Chow C-ET,

Sachdeva R, Jones AC, Schwalbach MS, Rose JM, Hewson I, Patel A, Sun F,

Caron DA, Fuhrman JA: Marine bacterial, archaeal and protistan

association networks reveal ecological linkages. ISME J 2011, 5:1414–

1425.

106. Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, Naeem S:

Annually reoccurring bacterial communities are predictable from ocean

conditions. Proc Natl Acad Sci U S A 2006, 103:13104–13109.

107. David LA, Materna AC, Friedman J, Campos-Baptista MI, Blackburn MC,

Perrotta A, Erdman SE, Alm EJ: Host lifestyle affects human microbiota on

daily timescales. Genome Biol 2014, 15:R89.

108. Kara EL, Hanson PC, Hu YH, Winslow L, McMahon KD: A decade of

seasonal dynamics and co-occurrences within freshwater

bacterioplankton communities from eutrophic Lake Mendota, WI, USA.

ISME J 2013, 7:680–684.

109. Needham DM, Chow C-ET, Cram JA, Sachdeva R, Parada A, Fuhrman JA:

Short-term observations of marine bacterial and viral communities:

patterns, connections and resilience. ISME J 2013, 7:1274–1285.

110. Vergin KL, Done B, Carlson CA, Giovannoni SJ: Spatiotemporal

distributions of rare bacterioplankton populations indicate adaptive

strategies in the oligotrophic ocean. Aquat Microb Ecol 2013, 71:1–13.

40

Page 53: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

111. Lindh MV, Sjöstedt J, Andersson AF, Baltar F, Hugerth LW, Lundin D,

Muthusamy S, Legrand C, Pinhassi J: Disentangling seasonal

bacterioplankton population dynamics by high frequency sampling.

Environ Microbiol 2015:2459–2476.

112. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM: Molecular

biological access to the chemistry of unknown soil microbes: a new

frontier for natural products. Chem Biol 1998, 5:R245–9.

113. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM,

Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and

metabolism through reconstruction of microbial genomes from the

environment. Nature 2004, 428:37–43.

114. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu

D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW,

Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H,

Pfannkoch C, Rogers Y, Smith HO: Environmental Genome Shotgun

Sequencing of the Sargasso Sea. Science 2004, 304:66–74.

115. Kopylova E, Noé L, Touzet H: SortMeRNA: fast and accurate filtering of

ribosomal RNAs in metatranscriptomic data. Bioinformatics 2012, 28:3211–

3217.

116. Ekblom R, Wolf JBW: A field guide to whole-genome sequencing,

assembly and annotation. Evol Appl 2014, 7:1026–1042.

117. Luo C, Tsementzi D, Kyrpides NC, Konstantinidis KT: Individual genome

assembly from complex community short-read metagenomic datasets.

ISME J 2012, 6:898–901.

118. Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT: Direct

comparisons of Illumina vs. Roche 454 sequencing technologies on the

same microbial community DNA sample. PLoS One 2012, 7:e30087.

119. Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, Delwart EL: An

ensemble strategy that significantly improves de novo assembly of

microbial genomes from metagenomic next-generation sequencing data.

Nucleic Acids Res 2015, 43:e46.

120. Hugerth L, Larsson J, Alneberg J, Lindh M, Legrand C, Pinhassi J,

Andersson A: Metagenome-assembled genomes uncover a global brackish

microbiome. Genome Biology 2015:279.

121. Peng Y, Leung HCM, Yiu SM, Chin FYL: IDBA-UD: a de novo assembler

for single-cell and metagenomic sequencing data with highly uneven

depth. Bioinformatics 2012, 28:1420–1428.

41

Page 54: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

122. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W: MEGAHIT: an ultra-fast

single-node solution for large and complex metagenomics assembly via

succinct de Bruijn graph. Bioinformatics 2015, 31:1674–1676.

123. Seemann T: Prokka: rapid prokaryotic genome annotation.

Bioinformatics 2014, 30:2068–2069.

124. Segata N, Börnigen D, Morgan XC, Huttenhower C: PhyloPhlAn is a new

method for improved phylogenetic and taxonomic placement of

microbes. Nat Commun 2013, 4:2304.

125. Darling AE, Jospin G, Lowe E, Matsen FA 4th, Bik HM, Eisen JA: PhyloSift:

phylogenetic analysis of genomes and metagenomes. PeerJ 2014, 2:e243.

126. Peabody MA, Van Rossum T, Lo R, Brinkman FSL: Evaluation of shotgun

metagenomics sequence classification methods using in silico and in

vitro simulated communities. BMC Bioinformatics 2015, 16:363.

127. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S,

Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-

Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li

K, Kravitz S, Heidelberg JF, Utterback T, Rogers Y-H, Falcón LI, Souza V, Bonilla-

Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, et al.: The Sorcerer II

Global Ocean Sampling expedition: northwest Atlantic through eastern

tropical Pacific. PLoS Biol 2007, 5:DOI: 10.1371/journal.pbio.0050077.

128. Yooseph S, Nealson KH, Rusch DB, McCrow JP, Dupont CL, Kim M, Johnson

J, Montgomery R, Ferriera S, Beeson K, Williamson SJ, Tovchigrechko A, Allen

AE, Zeigler LA, Sutton G, Eisenstadt E, Rogers Y, Friedman R, Frazier M, Venter

JC: Genomic and functional adaptation in surface ocean planktonic

prokaryotes. Nature 2010, 468:60–66.

129. Dupont CL, Larsson J, Yooseph S, Ininbergs K, Goll J, Asplund-Samuelsson J,

McCrow JP, Celepli N, Allen LZ, Ekman M, Lucas AJ, Hagström Å, Thiagarajan

M, Brindefalk B, Richter AR, Andersson AF, Tenney A, Lundin D, Tovchigrechko

A, Nylander J, Brami D, Badger JH, Allen AE, Rusch DB, Hoffman J, Norrby E,

Friedman R, Pinhassi J, Venter JC, Bergman B: Functional Tradeoffs Underpin

Salinity-Driven Divergence in Microbial Community Composition. PLoS

One 2014:DOI: 10.1371/journal.pone.0089549.

130. Larsson J, Celepli N, Ininbergs K, Dupont CL, Yooseph S, Bergman B,

Ekman M: Picocyanobacteria containing a novel pigment gene cluster

dominate the brackish water Baltic Sea. ISMEJ 2014, 8:1892–1903.

131. Brown MV, Lauro FM, DeMaere MZ, Muir L, Wilkins D, Thomas T, Riddle

MJ, Fuhrman JA, Andrews-Pfannkoch C, Hoffman JM, McQuaid JB, Allen A,

Rintoul SR, Cavicchioli R: Global biogeography of SAR11 marine bacteria.

Mol Syst Biol 2012, 8:595.

42

Page 55: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

132. Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T: Novel

phylogenetic studies of genomic sequence fragments derived from

uncultured microbe mixtures in environmental and clinical samples.

DNA Res 2005, 12:281–290.

133. Chatterji S, Yamazaki I, Bai Z, Eisen JA: CompostBin: A DNA

Composition-Based Algorithm for Binning Environmental Shotgun

Reads. In Research in Computational Molecular Biology. Springer Berlin

Heidelberg; 2008:17–28. [Lecture Notes in Computer Science]

134. Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, Yelton AP,

Banfield JF: Community-wide analysis of microbial genome sequence

signatures. Genome Biol 2009, 10:R85.

135. Sharon I, Morowitz MJ, Thomas BC, Costello EK, Relman DA, Banfield JF:

Time series community genomics analysis reveals rapid shifts in

bacterial species, strains, and phage during infant gut colonization.

Genome Res 2012, 23:111–120.

136. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen

PH: Genome sequences of rare, uncultured bacteria obtained by

differential coverage binning of multiple metagenomes. Nat Biotechnol

2013, 31:533–538.

137. Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A, Wilkins MJ,

Wrighton KC, Williams KH, Banfield JF: Unusual biology across a group

comprising more than 15% of domain Bacteria. Nature 2015, 523:208–211.

138. Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L,

Loman NJ, Andersson AF, Quince C: Binning metagenomic contigs by

coverage and composition. Nat Methods 2014, 11:1144–1146.

139. Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P, Tyson GW:

GroopM: an automated tool for the recovery of population genomes from

related metagenomes. PeerJ 2014:e603.

140. Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S,

Plichta DR, Gautier L, Pedersen AG, Le Chatelier E, Pelletier E, Bonde I, Nielsen

T, Manichanh C, Arumugam M, Batto J, dos Santos M, Blom N, Borruel N,

Burgdorf KS, Boumezbeur F, Casellas F, Doré J, Dworzynski P, Guarner F,

Hansen T, Hildebrand F, Kaas RS, Kennedy S, Kristiansen K, et al.:

Identification and assembly of genomes and genetic elements in complex

metagenomic samples without using reference genomes. Nat Biotechnol

2014, 32:822–828.

141. Cleary B, Brito IL, Huang K, Gevers D, Shea T, Young S, Alm EJ: Detection

of low-abundance bacterial strains in metagenomic datasets by

eigengenome partitioning. Nat Biotechnol 2015, 33:1053–1060.

43

Page 56: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

142. Spang A, Saw JH, Jørgensen SL, Zaremba-Niedzwiedzka K, Martijn J, Lind

AE, van Eijk R, Schleper C, Guy L, Ettema TJG: Complex archaea that bridge

the gap between prokaryotes and eukaryotes. Nature 2015, 521:173–179.

143. Ghai R, Mizuno CM, Picazo A, Camacho A, Rodriguez-Valera F: Key roles

for freshwater Actinobacteria revealed by deep metagenomic

sequencing. Mol Ecol 2014, 23:6073–6090.

144. Mizuno CM, Rodriguez-Valera F, Ghai R: Genomes of planktonic

Acidimicrobiales: widening horizons for marine Actinobacteria by

metagenomics. MBio 2015, 6.

145. Luef B, Frischkorn KR, Wrighton KC, Holman H-YN, Birarda G, Thomas BC,

Singh A, Williams KH, Siegerist CE, Tringe SG, Downing KH, Comolli LR,

Banfield JF: Diverse uncultivated ultra-small bacterial cells in

groundwater. Nat Commun 2015, 6:6372.

146. Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ,

Butterfield CN, Hernsdorf AW, Amano Y, Ise K, Suzuki Y, Dudek N, Relman DA,

Finstad KM, Amundson R, Thomas BC, Banfield JF: A new view of the tree of

life. Nature Microbiology 2016:16048.

147. Logares R, Bråte J, Bertilsson S, Clasen JL, Shalchian-Tabrizi K, Rengefors

K: Infrequent marine–freshwater transitions in the microbial world.

Trends Microbiol 2009, 17:414–422.

148. Herlemann DPR, Labrenz M, Jurgens K, Bertilsson S, Waniek JJ, Andersson

AF: Transitions in bacterial communities along the 2000 km salinity

gradient of the Baltic Sea. ISME J 2011, 5:1571–1579.

149. Mehrshad M, Amoozegar MA, Ghai R, Shahzadeh Fazeli SA, Rodriguez-

Valera F: Genome reconstruction from metagenomic datasets reveals

novel microbes in the brackish waters of the Caspian Sea. Appl Environ

Microbiol 2016.

44

Page 57: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

x. Acknowledgements

First of all, my most heartfelt thanks to Anders, who has been a

wonderful supervisor ever since my master’s. I really wasn’t planning on

staying more than 4 months under your wings, but 5 years later, here I

am. I stayed on because of you. You never lost your temper or

discouraged me even when I was losing patience with myself for all

those silly mistakes. I definitely had no interest in the Baltic Sea before

starting our projects together. Sometimes I claim I still don’t, but that’s

probably a lie.

A large round of thank-yous to everyone that has passed through

the Environmental Genomics lab these past 5 years. To Daniel who was

around to answer my first silly questions about Perl and is still around

for equally silly questions about phylogeny placement. To Ino, who was

so much fun to work with and gave whole new levels of meaning to the

word “Riiiiiiight”. To Johannes, who has come to my rescue so many

times with coke and code. To John, who rescued the Baltic Genomes

when we needed him, and ruined Hermann’s for me. To Yue, who made

sure I wasn’t all alone in the lab bench anymore, and who’s been a great

travel companion. To Hugo, who has been an endless source of

collaboration and friendship. To Nelson, who’s always repaid the

simplest of favours with limitless kindness. To Conny, for never

forgetting that protists are microbes too! And to Jürg, Olov and Kajsa,

for good advice, fika and the odd bit of office gossip.

Another important round of thank-yous goes to the Kalmar gang.

Jarone, my reluctant co-supervisor from whom I learn so much each

time we meet. To Markus for the dedicated work in the LMO time-

series and for your endless emails pushing my R-skills a bit further, and

for the tweets that always make laugh. To Carina for taking over the

time-series and always keeping me in the loop, and also for the great

times we’ve had at conferences. You’ve also convinced certain people to

attend SAME in Uppsala, and for this I’ll always be deeply grateful. And

to Åke, who claims to be retired but is always around to discuss science

in his soothing tone and make everyone see the bigger picture.

For the friends and colleagues in Alfa 3, Linda, Elin, Amanda,

Pelin, Lumi, Jimmie, Guille, Phil, Erik, Mickan, Fredrik, Sanja,

45

Page 58: High-throughput DNA Sequencing in Microbial Ecology · Luisa Warchavchik Hugerth (2016): High-throughput DNA Sequencing in Microbial Ecology: methods and applications. Division of

Anders, Britta, Stefania, Kostas, Kim, Maja, Annelie, Nemo, Ema,

Simon, Carlos, Francesco, Mau, you are all too many to mention, but

you’ve been great company and a pleasure to share a lab with. Extra

special thanks to Kicki who not only keeps us all on our heels to make

sure the lab works, also found time, time and again, to train me in new

protocols when I needed it. Thank you, Peter, for the last minute help!

And thanks to Valle, who could always be drawn to the lunch room with

a simple “I need to talk math” or “I need to whine”, and is still around

via email to do the same.

For the Alfa 6 crowd, Ani, Walter, Oxana, Per, Axel, Özge.

Thanks for the cake and the music, the political clashes and the history

discussions.

Arne, Erik Lindhal, Afshin, Lars Arvestad, Thijs Ettema: you

have all at some point said something which ended up in my

words_to_never_forget.txt. The fact that you yourselves probably don’t

even know what it was just makes it all the more meaningful.

Tere, Hari, Jojo, Mille, Sus, Johanna: you took the long haul.

Thanks for the tjejfika, the MF parties, the birthdays, the house-

warmings, the football, the movie nights and all those unicorns! I'm

moving away, but don't think I'm leaving you.

To PromenadorQuestern och med Baletten Paletten for

being a constant reminder that happiness is an active choice. JAAAAA!

Tagga! Ni är alla så fina och jag har inte utryme för så många namn,

men kramar kan alla få!

And finally, thanks to my parents, my sister, my grandparents, my

priminhos and my migas, who’ve kept me company from a distance and

made sure I always knew I came from somewhere and had somewhere

to return to.

46