microbiome data are compositional - usda ars...triplet femur, tibia, humerus, and seeks the...

MICROBIOME DATA ARE COMPOSITIONAL

Adam R. Rivers, Ph.D.September 2019 https://tinyecology.com

What this means and how to deal with it

Agricultural Research Service

SEQUENCING IS A SAMPLING PROCESS• Sequencing samples n DNA molecules

from a pool of n molecules. n

EVERY TAXA(GENE) AFFECTS ALL OTHERS

• In quantitative metagenomics reads are assigned to taxa (or genes).

• The total number of taxa and the reads assigned to each taxa affect all other taxa.

Gloor et al. Microbiome Datasets Are Compositional

which must be filled. Returning to our tiger and ladybuganalogy, the migration of ladybugs into an area containinga fixed number of slots that are already filled must displaceeither tigers or ladybugs from the occupied slots. This analogyextents, without restriction, to any number of taxa, and to anyfixed capacity instrument (Aitchison, 1986; Lovell et al., 2011;Friedman and Alm, 2012; Fernandes et al., 2013, 2014; Lovellet al., 2015; Mandal et al., 2015; Gloor et al., 2016a,b; Gloor andReid, 2016; Tsilimigras and Fodor, 2016). Thus, the total readcount observed in a HTS run is a fixed-size, random sampleof the relative abundance of the molecules in the underlyingecosystem. Moreover, the count can not be related to the absolutenumber of molecules in the input sample as shown in Figure 1.This is implicitly acknowledged when microbiome datasets areconverted to relative abundance values, or normalized counts,or are rarefied (McMurdie and Holmes, 2014; Weiss et al., 2017)prior to analysis. Thus the number of reads obtained is irrelevant,and contains only information on the precision of the estimate(Fernandes et al., 2013). Data that are naturally described asproportions or probabilities, or with a constant or irrelevantsum, are referred to as compositional data. Compositional datacontains information about the relationships between the parts(Aitchison, 1986; Pawlowsky-Glahn et al., 2015).

Data about a microbiome collected by high throughputsequencing are often examined under the assumption thatsequencing is, in some way, counting the number of moleculesassociated with the bacteria in the population, as illustrated bythe top barplot in Figure 1B. We can see the difference betweencounts and compositions by comparing the data for the actualcounts for three samples in the top barplot with their proportionsin the bottom barplot. Note, that samples 2 and 3 in Figure 1Bhave the same proportional abundances even though they havedifferent absolute counts prior to sequencing. The difference inapparent direction of change is shown in Figure 1C and we canobserve that the relationship between absolute abundance in theenvironment and the relative abundance after sequencing is notpredictable.

2. PROBLEMS WITH CURRENT METHODSOF ANALYSIS

We will briefly outline the problems that arise whencompositional data are examined using a non-compositionalparadigm, stepping through the usual stages of analysis shownin Figure 2. All these issues have been extensively reviewedand debated in both the older and the more recent literature infields as diverse as economics, geology and ecology. Thus, ratherthan present an exhaustive explanation of the problems, we willoutline the major issue and cite a few useful resources.

It is very difficult to collect exactly the same numberof sequence reads for each sample. This can be because ofdifferences in platform (e.g., MiSeq vs. HiSeq) or because oftechnical difficulties in loading the same molar amounts of thesequencing libraries on the instrument, or because of randomvariation. The total number of counts observed (often referred toas read depth) is a major confounder for distance or dissimilarity

FIGURE 1 | High-throughput sequencing data are compositional. (A)

illustrates that the data observed after sequencing a set of nucleic acids from a

bacterial population cannot inform on the absolute abundance of molecules.

The number of counts in a high throughput sequencing (HTS) dataset reflect

the proportion of counts per feature (OTU, gene, etc.) per sample, multiplied by

the sequencing depth. Therefore, only the relative abundances are available.

The bar plots in (B) show the difference between the count of molecules and

the proportion of molecules for two features, A (red) and B (gray) in three

samples. The top bar graphs show the total counts for three samples, and the

height of the color illustrates the total count of the feature. When the three

samples are sequenced we lose the absolute count information and only have

relative abundances, proportions, or “normalized counts” as shown in the

bottom bar graph. Note that features A and B in samples 2 and 3 appear with

the same relative abundances, even though the counts in the environment are

different. The table below in (C) shows real and perceived changes for each

sample if we transition from one sample to another.

calculations for multivariate ordinations derived from thesedistances (McMurdie and Holmes, 2014). Initial attempts in themicrobiome field used “rarefaction” or subsampling of the readcounts of each sample to a common read depth to attempt tocorrect this problem (Lozupone et al., 2011; Wong et al., 2016).The use of subsampling has been questioned since it results ina loss of information and precision (McMurdie and Holmes,2014), and the practice of count normalization from the RNA-seq field has been advocated instead. There are a number ofcount normalization methods used and two, the trimmed meanof M values (TMM) (Robinson and Oshlack, 2010), and themedian method (Anders and Huber, 2010) are similar to a log-ratio transformations, but are less suitable in highly asymmetricalor sparse datasets (Fernandes et al., 2013; Gloor et al., 2016a).These transformations are further undesirable since the numberof counts observed by the instrument, by design, can not containany information on the actual number of molecules in theenvironment, and because the investigator naturally interpretsthe results as counts instead of log-ratios.

One of the first analysis steps in a traditional analysis,following rarefaction or count normalization, is the calculation ofa distance or dissimilarity (DD) matrix from the data that is used

Frontiers in Microbiology | www.frontiersin.org 2 November 2017 | Volume 8 | Article 2224

Gloor et al. 2017

Example 1

EVERY TAXA(GENE) AFFECTS ALL OTHERS

• In quantitative metagenomics reads are assigned to taxa (or genes).

• The total number of taxa and the reads assigned to each taxa affect all other taxa.

Example 2

Morton et al. 2016

https://doi.org/10.3389/fmicb.2017.02224https://doi.org/10.1128/mSystems.00162-16.

EARLY WAYS OF DEALING WITH COMPOSITIONS

Method Issues

Rarefaction loss of information and precision

RNASeq TMM and median ratio

Similar to CLR but bad for asymmetrical data and sparse data

RPKM largely abandoned since it corrects for the wrong thing

All the above Give outputs in “counts”, the wrong mindset

NEGATIVE CORRELATION BIAS• We know that ratio data (including data

normalized by a sum) suffer from negative correlation bias.

• Because a composition must sum to one, some creations must be negative and some must be positive even if no such correlation exists. 490 Prof. Karl Pearson.

exist correlation between u and v. Thus a real danger arises when a statistical biologist attribntes the correlation between two functions like uand vto organic relationship. The particular case that iflikely to occur is when uand v are indices with the same denominator for the correlation of indices seems at first sight a very plausible measure of organic correlation.

The difficulty and dang’er which arise from the use of indices was brought home to me recently in an endeavour to deal with a consider able series of personal equation data. In this case it was convenient to divide the errors made by three observers in estimating a variable quantity by the actual value of the quantity. As a result there appeared a high degree of correlation between three series of abso lutely independent judgments. It was some time before I realised that this correlation had nothing to do with the manner of judging, but was a special case of the above principle due to the use of indices.

A further illustration is of the following kind. Select three num bers within certain ranges at random, say x, y, z, these will be pair and pair uncorrelated. Form the proper fractions xjy and zjy for each triplet, and correlation will be found between these indices.

The application of this idea to biology seems of considerable importance. For example, a quantity of bones are taken from an ossuarium, and are put together in groups, which are asserted to be those of individual skeletons. To test this a biologist takes the triplet femur, tibia, humerus, and seeks the correlation between the indices femur / humerus and tibia / humerus. He might reasonably conclude that this correlation marked organic relationship, and believe that the bones had really been put together substantially in their individual grouping. As a matter of fact, since the coefficients of variation for femur, tibia, and humerus are approximately equal, there would be, as we shall see later, a correlation of about 0'4 to 0'*> between these indices had the bones been sorted absolutely at random. I term this a spurious organic correlation, or simply a spurious correlation. I understand by this phrase the amount of correlation which would still exist between the indices, were the absolute lengths on which they depend disti’ibuted at random.

It has hitherto been usual to measure the organic correlation of the organs of shrimps, prawns, crabs, Ac., by the correlation of indices in which the denominator represents the total body length or total cara pace length. Now suppose a table formed of the absolute lengths and the indices of, say, some thousand individuals. Let an “ imp ” (allied to the Maxwellian demon) redistribute the indices at random, they would then exhibit no correlation; if the corresponding absolute lengths followed along with the indices in the redistribution, they also would exhibit no correlation. Now let us suppose the indices not to have been calculated, but the imp to redistribute the abso-

Pearson, 1896

CORRELATION EXAMPLE

Bacteria195011

Bacteria1B

879362

Bacteria2

62763

Bacteria3

236274

Bacteria4

1219595

Unidentified

1205361373771166881655192528910

19747965891046217575115306759020020110291602928417

7517998807477188167542191681212124802802416994

5420456493956034465749962164249099898595114189388

302106681620188674392966865704480262980378119099

20845431983084106192945143280248776245182224279220214416412829

Correlated 0.998Randomly sampled from log normal distribution logmean = 10 (13 for unidentified) log SD=1

CORRELATION EXAMPLECorrelated 0.998

Randomly sampled from log normal distribution logmean = 10 (13 for unidentified) log SD=1

Divide each sample by the sumBacteria10.00418

Bacteria1B

0.17

Bacteria2

0.00857

Bacteria3

0.00756

Bacteria4

0.182

Unidentified

0.01360.0420.02870.0570.0494

0.008690.1870.01430.005630.1720.008540.06120.02720.05520.0555

0.03310.01910.01020.006020.01120.02160.03710.006110.09660.0332

0.02380.1090.1310.01430.07440.01850.2780.2430.03930.0183

0.01330.1290.2760.02380.1440.06420.02450.007340.0130.0373

0.9170.3840.5610.9430.4170.8740.5570.6880.7390.806

CORRELATION EXAMPLE

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Bacteria1

Bacteria1B

Bacteria2

Bacteria3

Bacteria4

Unidentified

0.99

−0.04

0.04

0.28

−0.78

−0.03

0.1

0.28

−0.82

−0.18

−0.37

0.14

0.04

−0.52 −0.58

All samples Drop Unidentified

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Bacteria1

Bacteria1B

Bacteria2

Bacteria3

Bacteria4

0.95

−0.12

−0.53

−0.3

−0.05

−0.46

−0.42

−0.3

−0.27 −0.41

CORRELATION EXAMPLECorrelation of Aitchison composition Drop Unidentified

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Bacteria1

Bacteria1B

Bacteria2

Bacteria3

Bacteria4

Unidentified

0.99

−0.04

0.04

0.28

−0.78

−0.03

0.1

0.28

−0.82

−0.18

−0.37

0.14

0.04

−0.52 −0.58−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Bacteria1

Bacteria1B

Bacteria2

Bacteria3

Bacteria4

0.99

−0.04

0.04

0.28

−0.03

0.1

0.28

−0.18

−0.37 0.04

DISTANCE AND DISSIMILARITY METRICSTypical metrics• Unifrac• Bray-Curtis• Jensen-Shannon

Bray-Curtis Aitchison (Correct metric)

Better metrics• Unifrac -> PhILR• Others -> Aitchison

Bacteria3

Bacteria1B

Bacteria1

Bacteria4

Unidentified

Bacteria2

Bacteria3

Bacteria1B

Bacteria1

Bacteria4

Unidentified

Bacteria2

Unidentified

Bacteria2

Bacteria3

Bacteria4

Bacteria1

Bacteria1B

Unidentified

Bacteria2

Bacteria3

Bacteria4

Bacteria1

Bacteria1B

https://doi.org/10.7554/eLife.21887

PCA ORDINATION ISSUES

−0.6 −0.4 −0.2 0.0 0.2 0.4

−0.6

−0.4

−0.2

0.0

0.2

0.4

Comp.1

Com

p.2

1

2

3

4

5

6

78

9

10

−0.6 −0.4 −0.2 0.0 0.2 0.4

−0.6

−0.4

−0.2

0.0

0.2

0.4

Bacteria1Bacteria1B

Bacteria2

Bacteria3

Bacteria4

Unidentified

−0.6 −0.4 −0.2 0.0 0.2 0.4

−0.6

−0.4

−0.2

0.0

0.2

0.4

Comp.1

Com

p.2

1

2

3

4

5

6

78

9

10

−3 −2 −1 0 1 2

−3−2

−10

12

Bacteria1Bacteria1B

Bacteria2

Bacteria3

Bacteria4

Unidentified

Untransformed Transformed

UNDERSTANDING COMPOSITIONS

SIMPLEX AS A NUMBER SPACE• The sum does not matter

• The components are a proportion of some whole

• The number space is Simplex

• Examples:• Composition of soil [silt, sand, clay]• NGS Sequencing data• Time use surveys

Ternary soil plot

THE SIMPLEX NUMBER SPACE

Key arithmetic operations can be defined:• Closure

• Perturbation

• Powering

Key requirements for simplex operations• Sub-compositionally coherent• Order invariant• Scale invariant

TRANSFORMATIONS FROM SIMPLEXThe simplex can be transformed into real space using Log ratio transformations

Additive Log Ratio (ALR)

Centered Log ratio (CLR)

Isometric Log Ratio (ILR)

Where geometric mean is

Working in transformed space allows us to use many

existing statistical tools

ISOMETRIC LOG RATIOUseful because its isometric

or

DC to Tel Aviv is always 6400 km

log-ratio analysis algebraic-geometric structure C-coordinates balances

circles and ellipses

-2

-1

0

1

2

3

4

-2 -1 0 1 2 3

in S3 coordinate representation

log-ratio analysis algebraic-geometric structure C-coordinates balances

parallel lines

x2

x1

x3

n

-4

-2

0

2

4

-4 -2 0 2 4

in S3 coordinate representation

Useful because points are now independent

ISOMETRIC LOG RATIO

So what is Phi?Binary partitionsCLR X

ISOMETRIC LOG RATIO

So what is Phi?Binary partitions1 -1CLR X

ISOMETRIC LOG RATIO

So what is Phi?Binary partitions1 -1

1 -1

CLR X

ISOMETRIC LOG RATIO

So what is Phi?Binary partitions1 -1

1 -1

CLR X

Dot Product

ILR

ISOMETRIC LOG RATIO

20

40

60

80

100

20

40

60

80

100

20 40 60 80 100

Bacteria1BBacteria1

Bacteria2

Bacteria1B

Bacteria1 Bacteria2

GENERAL CODA SOFTWARE

Microbiome specific software covered later

CoDaPack• From the “Official”

CoDa group in Girona

• Compositions• Compositional• zCompositions• coda.base

Scikit bio • skbio.stats.composition

A COMPOSITIONAL WORKFLOW FOR METAGENOMICS

Normalization Distance Ordination Multivariate comparison CorrelationDifferential abundance

Old Rarefaction DESeqBray-Curtis

UnifracPcoA

(abundance)perMANOVA

ANOSIMPearson

SpearmanEDGER DESeq

CompositionalCLR ALR ILR

Aitchisonphilr PCA (variance)

perMANOVA ANOSIM

SparCCSpiecEasi

propr ANCOM ALDEx2

THERE IS ANOTHER WAY

From Gloor 2017

WORKED EXAMPLEBacteria195011

Bacteria1B

879362

Bacteria2

62763

Bacteria3

236274

Bacteria4

1219595

Unidentified

1205361373771166881655192528910

19747965891046217575115306759020020110291602928417

7517998807477188167542191681212124802802416994

5420456493956034465749962164249099898595114189388

302106681620188674392966865704480262980378119099

20845431983084106192945143280248776245182224279220214416412829

Preprocessing1. Close the dataset (all reads)2. Create a sub-composition

INPUTE ZEROS

So many reasons for a zero…1. Structural zeros, a real zero2. Missing values3. Below the limit of detection

So many ways to handle zeros…1. pseudo-counts ( add one to everything)2. Multiplicative replacement3. Mean value replacement4. Expectation Maximization

It is often helpful to drop components (taxa) that appear in

PCA ORDINATION AND DISTANCEQiime 2 package DeicodeDeicode takes a count matrix with zeros• Calculates CLR on nonzero values• Calculates Aitchson distance• Performs robust PCA using Singular

value decomposition

In RWith compositions package computecar using “clr(acomp())”Distance using “dist, method=“aitchison” function Compute pca using “pca” function

DIVERSITY INDICES AND CURVES

−0.6 −0.4 −0.2 0.0 0.2 0.4

−0.6

−0.4

−0.2

0.0

0.2

0.4

Comp.1

Com

p.2

1

2

3

4

5

6

78

9

10

−0.6 −0.4 −0.2 0.0 0.2 0.4

−0.6

−0.4

−0.2

0.0

0.2

0.4

Bacteria1Bacteria1B

Bacteria2

Bacteria3

Bacteria4

Unidentified

−0.6 −0.4 −0.2 0.0 0.2 0.4

−0.6

−0.4

−0.2

0.0

0.2

0.4

Comp.1

Com

p.2

1

2

3

4

5

6

78

9

10

−3 −2 −1 0 1 2

−3−2

−10

12

Bacteria1Bacteria1B

Bacteria2

Bacteria3

Bacteria4

Unidentified

DIVERSITY INDICES AND DISTANCEBray-Curtis Aitchison

Bacteria3

Bacteria1B

Bacteria1

Bacteria4

Unidentified

Bacteria2

Bacteria3

Bacteria1B

Bacteria1

Bacteria4

Unidentified

Bacteria2Unidentified

Bacteria2

Bacteria3

Bacteria4

Bacteria1

Bacteria1B

Unidentified

Bacteria2

Bacteria3

Bacteria4

Bacteria1

Bacteria1B

ANCOM (Python)• One of the first• Does pairwise tests in

CLR space• In Qiime 2

STATISTICAL INFERENCE METHODSALDEx2 (R)• Best general use• Dirichlet used to

model variance • Walsh and KS test• effect sizes & p-values

MirKat-Coda (R)• Kernel machine

regression• Identifies microbial

subsets that explain multivariate data

MixOmics• A suite of tools • Partial least squares

discriminate analysis

Gneiss• Balance trees

ANCOM (Python)• One of the first• Does pairwise tests in

CLR space• In Qiime 2

STATISTICAL INFERENCE METHODSALDEx2 (R)• Best general use• Dirichlet used to

model variance • Walsh and KS test• effect sizes & p-values

MirKat-Coda (R)• Kernel machine

regression• Identifies microbial

subsets that explain multivariate data

MixOmics• A suite of tools • Partial least squares

discriminate analysis

Gneiss• Balance trees

DIY!Do your own linear modeling on CLR

data

ALDEX2

ALDEx2

ylab="Difference")aldex.plot(x.all, type="MW", test="welch", xlab="Dispersion",

ylab="Difference")

−2 0 2 4 6 8

05

1015

Log−ratio abundance

Diff

eren

ce

1 2 3 4

05

1015

DispersionD

iffer

ence

Figure 1: MA and Effect plots of ALDEx2 output

The left panel is an Bland-Altman or MA plot that shows the relationship between abundance and Di�er-

ence. The right panel is an e�ect plot that shows the relationship between Di�erence and Dispersion. In

both plots features that are not significant are in grey or black. Features that are statistically significant are

in red. The Log-ratio abundance axis is the clr value for the feature.

5 Using ALDEx2 modules

The modular approach exposes the underlying intermediate data so that users can generatetheir own tests. The simple approach outlined above just calls aldex.clr, aldex.ttest,aldex.effect in turn and then merges the data into one object. We will show these modulesin turn, and then examine additional modules.

5.1 The aldex.clr module

The workflow for the modular approach first generates random instances of the centredlog-ratio transformed values. There are three inputs: counts table, a vector of conditions, thenumber of Monte-Carlo instances, a string indicating if iqlr, zero or all feature are used as thedenominator is required, and level of verbosity (TRUE or FALSE). We recommend 128 ormore mc.samples for the t-test, 1000 for a rigorous e�ect size calculation, and at least 16 forANOVA.

This operation is fast.

x

RESOURCESAitchison, J. (John), 2003. A concise guide to compositional data analysis. Lecture notes. url: http://www.compositionaldata.com/material/others/Ait2003_A_concise_guide_to_compositional_data_analysis.pdfAitchison, J. (John), 2003. The statistical analysis of compositional data. Blackburn Press.Calle, M.L., 2019. Statistical Analysis of Metagenomics Data. Genomics Inform. 17, e6. doi:10.5808/

GI.2019.17.1.e6Fernandes, A.D., Vu, M.T.H.Q., Edward, L.-M., Macklaim, J.M., Gloor, G.B., 2018. A reproducible effect

size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasets.

Gloor, G.B., Macklaim, J.M., Pawlowsky-Glahn, V., Egozcue, J.J., 2017. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 8, 2224. doi:10.3389/fmicb.2017.02224

Gloor, G.B., Reid, G., 2016. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can. J. Microbiol. 62, 692–703. doi:10.1139/cjm-2015-0821

Macklaim, J.M., Gloor, G.B., 2018. From RNA-seq to Biological Inference: Using Compositional Data Analysis in Meta-Transcriptomics. Humana Press, New York, NY, pp. 193–213. doi:10.1007/978-1-4939-8728-3_13

Mandal, S., Van Treuren, W., White, R.A., Eggesbø, M., Knight, R., Peddada, S.D., 2015. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 26, 27663. doi:10.3402/mehd.v26.27663

Pawlowsky-Glahn, V., Buccianti, A., 2011. Compositional data analysis : theory and applications. Wiley.Rivera-Pinto, J., Egozcue, J.J., Pawlowsky–Glahn, V., Paredes, R., Noguera-Julian, M., Calle, M.L., 2017.

Balances: a new perspective for microbiome analysis. bioRxiv 219386. doi:10.1101/219386

MINI REVIEWpublished: 15 November 2017

doi: 10.3389/fmicb.2017.02224

Frontiers in Microbiology | www.frontiersin.org 1 November 2017 | Volume 8 | Article 2224

Edited by:

Jessica Galloway-Pena,

University of Texas MD Anderson

Cancer Center, United States

Reviewed by:

Ionas Erb,

Centre for Genomic Regulation, Spain

Jennifer Stearns,

McMaster University, Canada

*Correspondence:

Gregory B. Gloor

[email protected]

Specialty section:

This article was submitted to

Systems Microbiology,

a section of the journal

Frontiers in Microbiology

Received: 10 July 2017

Accepted: 30 October 2017

Published: 15 November 2017

Citation:

Gloor GB, Macklaim JM,

Pawlowsky-Glahn V and Egozcue JJ

(2017) Microbiome Datasets Are

Compositional: And This Is Not

Optional. Front. Microbiol. 8:2224.

doi: 10.3389/fmicb.2017.02224

Microbiome Datasets AreCompositional: And This Is NotOptionalGregory B. Gloor 1*, Jean M. Macklaim1, Vera Pawlowsky-Glahn2 and Juan J. Egozcue3

1 Department of Biochemistry, University of Western Ontario, London, ON, Canada, 2 Departments of Computer Science,

Applied Mathematics, and Statistics, Universitat de Girona, Girona, Spain, 3 Department of Applied Mathematics, Universitat

Politècnica de Catalunya, Barcelona, Spain

Datasets collected by high-throughput sequencing (HTS) of 16S rRNA gene amplimers,

metagenomes or metatranscriptomes are commonplace and being used to study human

disease states, ecological differences between sites, and the built environment. There is

increasing awareness that microbiome datasets generated by HTS are compositional

because they have an arbitrary total imposed by the instrument. However, many

investigators are either unaware of this or assume specific properties of the compositional

data. The purpose of this review is to alert investigators to the dangers inherent in

ignoring the compositional nature of the data, and point out that HTS datasets derived

from microbiome studies can and should be treated as compositions at all stages of

analysis. We briefly introduce compositional data, illustrate the pathologies that occur

when compositional data are analyzed inappropriately, and finally give guidance and point

to resources and examples for the analysis of microbiome datasets using compositional

data analysis.

Keywords: microbiota, compositional data, high-throughput sequencing, correlation, Bayesian estimation, count

normalization, relative abundance

1. INTRODUCTION

The collection and analysis of microbiome datasets presents many challenges in the study design,sample collection, storage, and sequencing phases, and these have been well reviewed (Robinsonet al., 2016). Many methods for the analysis of microbiome datasets assume that sequencingdata are equivalent to ecological data where the counts of reads assigned to organisms are oftennormalized to a constant area or volume. Methods applied include count-based strategies such asBray-Curtis dissimilarity, zero-inflated Gaussianmodels and negative binomial models (McMurdieand Holmes, 2014; Weiss et al., 2017).

In an ecological study it is possible for many different species to co-exist, and their absoluteabundance may be important. For example, in an area containing only tigers, it is importantto know if the population size is sufficient to maintain needed genetic diversity for long-termsurvival (Shaffer, 1981). However, the abundance of one species may not influence the abundanceof another; the area may contain both tigers and ladybugs, and the migration of several ladybugsinto the area would not be expected to affect the number of tigers.

The assumption of true independence can not hold in high-throughput sequencing (HTS)experiments because the sequencing instruments can deliver reads only up to the capacity of theinstrument. Thus, it is proper to think of these instruments as containing a fixed number of slots

THANKS

CodaCourse Staff University of Girona (UdG)

Agricultural Research Service

microbiome data are compositional - usda ars...triplet femur, tibia, humerus, and seeks the...

Documents