microbiome data are compositional - usda ars...triplet femur, tibia, humerus, and seeks the...
TRANSCRIPT
-
MICROBIOME DATA ARE COMPOSITIONAL
Adam R. Rivers, Ph.D.September 2019 https://tinyecology.com
What this means and how to deal with it
Agricultural Research Service
-
SEQUENCING IS A SAMPLING PROCESS• Sequencing samples n DNA molecules
from a pool of n molecules. n
-
EVERY TAXA(GENE) AFFECTS ALL OTHERS
• In quantitative metagenomics reads are assigned to taxa (or genes).
• The total number of taxa and the reads assigned to each taxa affect all other taxa.
Gloor et al. Microbiome Datasets Are Compositional
which must be filled. Returning to our tiger and ladybuganalogy, the migration of ladybugs into an area containinga fixed number of slots that are already filled must displaceeither tigers or ladybugs from the occupied slots. This analogyextents, without restriction, to any number of taxa, and to anyfixed capacity instrument (Aitchison, 1986; Lovell et al., 2011;Friedman and Alm, 2012; Fernandes et al., 2013, 2014; Lovellet al., 2015; Mandal et al., 2015; Gloor et al., 2016a,b; Gloor andReid, 2016; Tsilimigras and Fodor, 2016). Thus, the total readcount observed in a HTS run is a fixed-size, random sampleof the relative abundance of the molecules in the underlyingecosystem. Moreover, the count can not be related to the absolutenumber of molecules in the input sample as shown in Figure 1.This is implicitly acknowledged when microbiome datasets areconverted to relative abundance values, or normalized counts,or are rarefied (McMurdie and Holmes, 2014; Weiss et al., 2017)prior to analysis. Thus the number of reads obtained is irrelevant,and contains only information on the precision of the estimate(Fernandes et al., 2013). Data that are naturally described asproportions or probabilities, or with a constant or irrelevantsum, are referred to as compositional data. Compositional datacontains information about the relationships between the parts(Aitchison, 1986; Pawlowsky-Glahn et al., 2015).
Data about a microbiome collected by high throughputsequencing are often examined under the assumption thatsequencing is, in some way, counting the number of moleculesassociated with the bacteria in the population, as illustrated bythe top barplot in Figure 1B. We can see the difference betweencounts and compositions by comparing the data for the actualcounts for three samples in the top barplot with their proportionsin the bottom barplot. Note, that samples 2 and 3 in Figure 1Bhave the same proportional abundances even though they havedifferent absolute counts prior to sequencing. The difference inapparent direction of change is shown in Figure 1C and we canobserve that the relationship between absolute abundance in theenvironment and the relative abundance after sequencing is notpredictable.
2. PROBLEMS WITH CURRENT METHODSOF ANALYSIS
We will briefly outline the problems that arise whencompositional data are examined using a non-compositionalparadigm, stepping through the usual stages of analysis shownin Figure 2. All these issues have been extensively reviewedand debated in both the older and the more recent literature infields as diverse as economics, geology and ecology. Thus, ratherthan present an exhaustive explanation of the problems, we willoutline the major issue and cite a few useful resources.
It is very difficult to collect exactly the same numberof sequence reads for each sample. This can be because ofdifferences in platform (e.g., MiSeq vs. HiSeq) or because oftechnical difficulties in loading the same molar amounts of thesequencing libraries on the instrument, or because of randomvariation. The total number of counts observed (often referred toas read depth) is a major confounder for distance or dissimilarity
FIGURE 1 | High-throughput sequencing data are compositional. (A)
illustrates that the data observed after sequencing a set of nucleic acids from a
bacterial population cannot inform on the absolute abundance of molecules.
The number of counts in a high throughput sequencing (HTS) dataset reflect
the proportion of counts per feature (OTU, gene, etc.) per sample, multiplied by
the sequencing depth. Therefore, only the relative abundances are available.
The bar plots in (B) show the difference between the count of molecules and
the proportion of molecules for two features, A (red) and B (gray) in three
samples. The top bar graphs show the total counts for three samples, and the
height of the color illustrates the total count of the feature. When the three
samples are sequenced we lose the absolute count information and only have
relative abundances, proportions, or “normalized counts” as shown in the
bottom bar graph. Note that features A and B in samples 2 and 3 appear with
the same relative abundances, even though the counts in the environment are
different. The table below in (C) shows real and perceived changes for each
sample if we transition from one sample to another.
calculations for multivariate ordinations derived from thesedistances (McMurdie and Holmes, 2014). Initial attempts in themicrobiome field used “rarefaction” or subsampling of the readcounts of each sample to a common read depth to attempt tocorrect this problem (Lozupone et al., 2011; Wong et al., 2016).The use of subsampling has been questioned since it results ina loss of information and precision (McMurdie and Holmes,2014), and the practice of count normalization from the RNA-seq field has been advocated instead. There are a number ofcount normalization methods used and two, the trimmed meanof M values (TMM) (Robinson and Oshlack, 2010), and themedian method (Anders and Huber, 2010) are similar to a log-ratio transformations, but are less suitable in highly asymmetricalor sparse datasets (Fernandes et al., 2013; Gloor et al., 2016a).These transformations are further undesirable since the numberof counts observed by the instrument, by design, can not containany information on the actual number of molecules in theenvironment, and because the investigator naturally interpretsthe results as counts instead of log-ratios.
One of the first analysis steps in a traditional analysis,following rarefaction or count normalization, is the calculation ofa distance or dissimilarity (DD) matrix from the data that is used
Frontiers in Microbiology | www.frontiersin.org 2 November 2017 | Volume 8 | Article 2224
Gloor et al. 2017
Example 1
-
EVERY TAXA(GENE) AFFECTS ALL OTHERS
• In quantitative metagenomics reads are assigned to taxa (or genes).
• The total number of taxa and the reads assigned to each taxa affect all other taxa.
Example 2
Morton et al. 2016
https://doi.org/10.3389/fmicb.2017.02224https://doi.org/10.1128/mSystems.00162-16.
-
EARLY WAYS OF DEALING WITH COMPOSITIONS
Method Issues
Rarefaction loss of information and precision
RNASeq TMM and median ratio
Similar to CLR but bad for asymmetrical data and sparse data
RPKM largely abandoned since it corrects for the wrong thing
All the above Give outputs in “counts”, the wrong mindset
-
NEGATIVE CORRELATION BIAS• We know that ratio data (including data
normalized by a sum) suffer from negative correlation bias.
• Because a composition must sum to one, some creations must be negative and some must be positive even if no such correlation exists. 490 Prof. Karl Pearson.
exist correlation between u and v. Thus a real danger arises when a statistical biologist attribntes the correlation between two functions like uand vto organic relationship. The particular case that iflikely to occur is when uand v are indices with the same denominator for the correlation of indices seems at first sight a very plausible measure of organic correlation.
The difficulty and dang’er which arise from the use of indices was brought home to me recently in an endeavour to deal with a consider able series of personal equation data. In this case it was convenient to divide the errors made by three observers in estimating a variable quantity by the actual value of the quantity. As a result there appeared a high degree of correlation between three series of abso lutely independent judgments. It was some time before I realised that this correlation had nothing to do with the manner of judging, but was a special case of the above principle due to the use of indices.
A further illustration is of the following kind. Select three num bers within certain ranges at random, say x, y, z, these will be pair and pair uncorrelated. Form the proper fractions xjy and zjy for each triplet, and correlation will be found between these indices.
The application of this idea to biology seems of considerable importance. For example, a quantity of bones are taken from an ossuarium, and are put together in groups, which are asserted to be those of individual skeletons. To test this a biologist takes the triplet femur, tibia, humerus, and seeks the correlation between the indices femur / humerus and tibia / humerus. He might reasonably conclude that this correlation marked organic relationship, and believe that the bones had really been put together substantially in their individual grouping. As a matter of fact, since the coefficients of variation for femur, tibia, and humerus are approximately equal, there would be, as we shall see later, a correlation of about 0'4 to 0'*> between these indices had the bones been sorted absolutely at random. I term this a spurious organic correlation, or simply a spurious correlation. I understand by this phrase the amount of correlation which would still exist between the indices, were the absolute lengths on which they depend disti’ibuted at random.
It has hitherto been usual to measure the organic correlation of the organs of shrimps, prawns, crabs, Ac., by the correlation of indices in which the denominator represents the total body length or total cara pace length. Now suppose a table formed of the absolute lengths and the indices of, say, some thousand individuals. Let an “ imp ” (allied to the Maxwellian demon) redistribute the indices at random, they would then exhibit no correlation; if the corresponding absolute lengths followed along with the indices in the redistribution, they also would exhibit no correlation. Now let us suppose the indices not to have been calculated, but the imp to redistribute the abso-
Pearson, 1896
-
CORRELATION EXAMPLE
Bacteria195011
Bacteria1B
879362
Bacteria2
62763
Bacteria3
236274
Bacteria4
1219595
Unidentified
1205361373771166881655192528910
19747965891046217575115306759020020110291602928417
7517998807477188167542191681212124802802416994
5420456493956034465749962164249099898595114189388
302106681620188674392966865704480262980378119099
20845431983084106192945143280248776245182224279220214416412829
Correlated 0.998Randomly sampled from log normal distribution logmean = 10 (13 for unidentified) log SD=1
-
CORRELATION EXAMPLECorrelated 0.998
Randomly sampled from log normal distribution logmean = 10 (13 for unidentified) log SD=1
Divide each sample by the sumBacteria10.00418
Bacteria1B
0.17
Bacteria2
0.00857
Bacteria3
0.00756
Bacteria4
0.182
Unidentified
0.01360.0420.02870.0570.0494
0.008690.1870.01430.005630.1720.008540.06120.02720.05520.0555
0.03310.01910.01020.006020.01120.02160.03710.006110.09660.0332
0.02380.1090.1310.01430.07440.01850.2780.2430.03930.0183
0.01330.1290.2760.02380.1440.06420.02450.007340.0130.0373
0.9170.3840.5610.9430.4170.8740.5570.6880.7390.806
-
CORRELATION EXAMPLE
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Bacteria1
Bacteria1B
Bacteria2
Bacteria3
Bacteria4
Unidentified
0.99
−0.04
0.04
0.28
−0.78
−0.03
0.1
0.28
−0.82
−0.18
−0.37
0.14
0.04
−0.52 −0.58
All samples Drop Unidentified
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Bacteria1
Bacteria1B
Bacteria2
Bacteria3
Bacteria4
0.95
−0.12
−0.53
−0.3
−0.05
−0.46
−0.42
−0.3
−0.27 −0.41
-
CORRELATION EXAMPLECorrelation of Aitchison composition Drop Unidentified
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Bacteria1
Bacteria1B
Bacteria2
Bacteria3
Bacteria4
Unidentified
0.99
−0.04
0.04
0.28
−0.78
−0.03
0.1
0.28
−0.82
−0.18
−0.37
0.14
0.04
−0.52 −0.58−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Bacteria1
Bacteria1B
Bacteria2
Bacteria3
Bacteria4
0.99
−0.04
0.04
0.28
−0.03
0.1
0.28
−0.18
−0.37 0.04
-
DISTANCE AND DISSIMILARITY METRICSTypical metrics• Unifrac• Bray-Curtis• Jensen-Shannon
Bray-Curtis Aitchison (Correct metric)
Better metrics• Unifrac -> PhILR• Others -> Aitchison
Bacteria3
Bacteria1B
Bacteria1
Bacteria4
Unidentified
Bacteria2
Bacteria3
Bacteria1B
Bacteria1
Bacteria4
Unidentified
Bacteria2
Unidentified
Bacteria2
Bacteria3
Bacteria4
Bacteria1
Bacteria1B
Unidentified
Bacteria2
Bacteria3
Bacteria4
Bacteria1
Bacteria1B
https://doi.org/10.7554/eLife.21887
-
PCA ORDINATION ISSUES
−0.6 −0.4 −0.2 0.0 0.2 0.4
−0.6
−0.4
−0.2
0.0
0.2
0.4
Comp.1
Com
p.2
1
2
3
4
5
6
78
9
10
−0.6 −0.4 −0.2 0.0 0.2 0.4
−0.6
−0.4
−0.2
0.0
0.2
0.4
Bacteria1Bacteria1B
Bacteria2
Bacteria3
Bacteria4
Unidentified
−0.6 −0.4 −0.2 0.0 0.2 0.4
−0.6
−0.4
−0.2
0.0
0.2
0.4
Comp.1
Com
p.2
1
2
3
4
5
6
78
9
10
−3 −2 −1 0 1 2
−3−2
−10
12
Bacteria1Bacteria1B
Bacteria2
Bacteria3
Bacteria4
Unidentified
Untransformed Transformed
-
UNDERSTANDING COMPOSITIONS
-
SIMPLEX AS A NUMBER SPACE• The sum does not matter
• The components are a proportion of some whole
• The number space is Simplex
• Examples:• Composition of soil [silt, sand, clay]• NGS Sequencing data• Time use surveys
Ternary soil plot
-
THE SIMPLEX NUMBER SPACE
Key arithmetic operations can be defined:• Closure
• Perturbation
• Powering
Key requirements for simplex operations• Sub-compositionally coherent• Order invariant• Scale invariant
-
TRANSFORMATIONS FROM SIMPLEXThe simplex can be transformed into real space using Log ratio transformations
Additive Log Ratio (ALR)
Centered Log ratio (CLR)
Isometric Log Ratio (ILR)
Where geometric mean is
Working in transformed space allows us to use many
existing statistical tools
-
ISOMETRIC LOG RATIOUseful because its isometric
or
DC to Tel Aviv is always 6400 km
log-ratio analysis algebraic-geometric structure C-coordinates balances
circles and ellipses
-2
-1
0
1
2
3
4
-2 -1 0 1 2 3
in S3 coordinate representation
log-ratio analysis algebraic-geometric structure C-coordinates balances
parallel lines
x2
x1
x3
n
-4
-2
0
2
4
-4 -2 0 2 4
in S3 coordinate representation
Useful because points are now independent
-
ISOMETRIC LOG RATIO
So what is Phi?Binary partitionsCLR X
-
ISOMETRIC LOG RATIO
So what is Phi?Binary partitions1 -1CLR X
-
ISOMETRIC LOG RATIO
So what is Phi?Binary partitions1 -1
1 -1
CLR X
-
ISOMETRIC LOG RATIO
So what is Phi?Binary partitions1 -1
1 -1
CLR X
-
ISOMETRIC LOG RATIO
So what is Phi?Binary partitions1 -1
1 -1
CLR X
Dot Product
ILR
-
ISOMETRIC LOG RATIO
20
40
60
80
100
20
40
60
80
100
20 40 60 80 100
Bacteria1BBacteria1
Bacteria2
Bacteria1B
Bacteria1 Bacteria2
-
GENERAL CODA SOFTWARE
Microbiome specific software covered later
CoDaPack• From the “Official”
CoDa group in Girona
• Compositions• Compositional• zCompositions• coda.base
Scikit bio • skbio.stats.composition
-
A COMPOSITIONAL WORKFLOW FOR METAGENOMICS
-
Normalization Distance Ordination Multivariate comparison CorrelationDifferential abundance
Old Rarefaction DESeqBray-Curtis
UnifracPcoA
(abundance)perMANOVA
ANOSIMPearson
SpearmanEDGER DESeq
CompositionalCLR ALR ILR
Aitchisonphilr PCA (variance)
perMANOVA ANOSIM
SparCCSpiecEasi
propr ANCOM ALDEx2
THERE IS ANOTHER WAY
From Gloor 2017
-
WORKED EXAMPLEBacteria195011
Bacteria1B
879362
Bacteria2
62763
Bacteria3
236274
Bacteria4
1219595
Unidentified
1205361373771166881655192528910
19747965891046217575115306759020020110291602928417
7517998807477188167542191681212124802802416994
5420456493956034465749962164249099898595114189388
302106681620188674392966865704480262980378119099
20845431983084106192945143280248776245182224279220214416412829
Preprocessing1. Close the dataset (all reads)2. Create a sub-composition
-
INPUTE ZEROS
So many reasons for a zero…1. Structural zeros, a real zero2. Missing values3. Below the limit of detection
So many ways to handle zeros…1. pseudo-counts ( add one to everything)2. Multiplicative replacement3. Mean value replacement4. Expectation Maximization
It is often helpful to drop components (taxa) that appear in
-
PCA ORDINATION AND DISTANCEQiime 2 package DeicodeDeicode takes a count matrix with zeros• Calculates CLR on nonzero values• Calculates Aitchson distance• Performs robust PCA using Singular
value decomposition
In RWith compositions package computecar using “clr(acomp())”Distance using “dist, method=“aitchison” function Compute pca using “pca” function
-
DIVERSITY INDICES AND CURVES
−0.6 −0.4 −0.2 0.0 0.2 0.4
−0.6
−0.4
−0.2
0.0
0.2
0.4
Comp.1
Com
p.2
1
2
3
4
5
6
78
9
10
−0.6 −0.4 −0.2 0.0 0.2 0.4
−0.6
−0.4
−0.2
0.0
0.2
0.4
Bacteria1Bacteria1B
Bacteria2
Bacteria3
Bacteria4
Unidentified
−0.6 −0.4 −0.2 0.0 0.2 0.4
−0.6
−0.4
−0.2
0.0
0.2
0.4
Comp.1
Com
p.2
1
2
3
4
5
6
78
9
10
−3 −2 −1 0 1 2
−3−2
−10
12
Bacteria1Bacteria1B
Bacteria2
Bacteria3
Bacteria4
Unidentified
-
DIVERSITY INDICES AND DISTANCEBray-Curtis Aitchison
Bacteria3
Bacteria1B
Bacteria1
Bacteria4
Unidentified
Bacteria2
Bacteria3
Bacteria1B
Bacteria1
Bacteria4
Unidentified
Bacteria2Unidentified
Bacteria2
Bacteria3
Bacteria4
Bacteria1
Bacteria1B
Unidentified
Bacteria2
Bacteria3
Bacteria4
Bacteria1
Bacteria1B
-
ANCOM (Python)• One of the first• Does pairwise tests in
CLR space• In Qiime 2
STATISTICAL INFERENCE METHODSALDEx2 (R)• Best general use• Dirichlet used to
model variance • Walsh and KS test• effect sizes & p-values
MirKat-Coda (R)• Kernel machine
regression• Identifies microbial
subsets that explain multivariate data
MixOmics• A suite of tools • Partial least squares
discriminate analysis
Gneiss• Balance trees
-
ANCOM (Python)• One of the first• Does pairwise tests in
CLR space• In Qiime 2
STATISTICAL INFERENCE METHODSALDEx2 (R)• Best general use• Dirichlet used to
model variance • Walsh and KS test• effect sizes & p-values
MirKat-Coda (R)• Kernel machine
regression• Identifies microbial
subsets that explain multivariate data
MixOmics• A suite of tools • Partial least squares
discriminate analysis
Gneiss• Balance trees
DIY!Do your own linear modeling on CLR
data
-
ALDEX2
ALDEx2
ylab="Difference")aldex.plot(x.all, type="MW", test="welch", xlab="Dispersion",
ylab="Difference")
−2 0 2 4 6 8
05
1015
Log−ratio abundance
Diff
eren
ce
1 2 3 4
05
1015
DispersionD
iffer
ence
Figure 1: MA and Effect plots of ALDEx2 output
The left panel is an Bland-Altman or MA plot that shows the relationship between abundance and Di�er-
ence. The right panel is an e�ect plot that shows the relationship between Di�erence and Dispersion. In
both plots features that are not significant are in grey or black. Features that are statistically significant are
in red. The Log-ratio abundance axis is the clr value for the feature.
5 Using ALDEx2 modules
The modular approach exposes the underlying intermediate data so that users can generatetheir own tests. The simple approach outlined above just calls aldex.clr, aldex.ttest,aldex.effect in turn and then merges the data into one object. We will show these modulesin turn, and then examine additional modules.
5.1 The aldex.clr module
The workflow for the modular approach first generates random instances of the centredlog-ratio transformed values. There are three inputs: counts table, a vector of conditions, thenumber of Monte-Carlo instances, a string indicating if iqlr, zero or all feature are used as thedenominator is required, and level of verbosity (TRUE or FALSE). We recommend 128 ormore mc.samples for the t-test, 1000 for a rigorous e�ect size calculation, and at least 16 forANOVA.
This operation is fast.
x
-
RESOURCESAitchison, J. (John), 2003. A concise guide to compositional data analysis. Lecture notes. url: http://www.compositionaldata.com/material/others/Ait2003_A_concise_guide_to_compositional_data_analysis.pdfAitchison, J. (John), 2003. The statistical analysis of compositional data. Blackburn Press.Calle, M.L., 2019. Statistical Analysis of Metagenomics Data. Genomics Inform. 17, e6. doi:10.5808/
GI.2019.17.1.e6Fernandes, A.D., Vu, M.T.H.Q., Edward, L.-M., Macklaim, J.M., Gloor, G.B., 2018. A reproducible effect
size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasets.
Gloor, G.B., Macklaim, J.M., Pawlowsky-Glahn, V., Egozcue, J.J., 2017. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 8, 2224. doi:10.3389/fmicb.2017.02224
Gloor, G.B., Reid, G., 2016. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can. J. Microbiol. 62, 692–703. doi:10.1139/cjm-2015-0821
Macklaim, J.M., Gloor, G.B., 2018. From RNA-seq to Biological Inference: Using Compositional Data Analysis in Meta-Transcriptomics. Humana Press, New York, NY, pp. 193–213. doi:10.1007/978-1-4939-8728-3_13
Mandal, S., Van Treuren, W., White, R.A., Eggesbø, M., Knight, R., Peddada, S.D., 2015. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 26, 27663. doi:10.3402/mehd.v26.27663
Pawlowsky-Glahn, V., Buccianti, A., 2011. Compositional data analysis : theory and applications. Wiley.Rivera-Pinto, J., Egozcue, J.J., Pawlowsky–Glahn, V., Paredes, R., Noguera-Julian, M., Calle, M.L., 2017.
Balances: a new perspective for microbiome analysis. bioRxiv 219386. doi:10.1101/219386
MINI REVIEWpublished: 15 November 2017
doi: 10.3389/fmicb.2017.02224
Frontiers in Microbiology | www.frontiersin.org 1 November 2017 | Volume 8 | Article 2224
Edited by:
Jessica Galloway-Pena,
University of Texas MD Anderson
Cancer Center, United States
Reviewed by:
Ionas Erb,
Centre for Genomic Regulation, Spain
Jennifer Stearns,
McMaster University, Canada
*Correspondence:
Gregory B. Gloor
Specialty section:
This article was submitted to
Systems Microbiology,
a section of the journal
Frontiers in Microbiology
Received: 10 July 2017
Accepted: 30 October 2017
Published: 15 November 2017
Citation:
Gloor GB, Macklaim JM,
Pawlowsky-Glahn V and Egozcue JJ
(2017) Microbiome Datasets Are
Compositional: And This Is Not
Optional. Front. Microbiol. 8:2224.
doi: 10.3389/fmicb.2017.02224
Microbiome Datasets AreCompositional: And This Is NotOptionalGregory B. Gloor 1*, Jean M. Macklaim1, Vera Pawlowsky-Glahn2 and Juan J. Egozcue3
1 Department of Biochemistry, University of Western Ontario, London, ON, Canada, 2 Departments of Computer Science,
Applied Mathematics, and Statistics, Universitat de Girona, Girona, Spain, 3 Department of Applied Mathematics, Universitat
Politècnica de Catalunya, Barcelona, Spain
Datasets collected by high-throughput sequencing (HTS) of 16S rRNA gene amplimers,
metagenomes or metatranscriptomes are commonplace and being used to study human
disease states, ecological differences between sites, and the built environment. There is
increasing awareness that microbiome datasets generated by HTS are compositional
because they have an arbitrary total imposed by the instrument. However, many
investigators are either unaware of this or assume specific properties of the compositional
data. The purpose of this review is to alert investigators to the dangers inherent in
ignoring the compositional nature of the data, and point out that HTS datasets derived
from microbiome studies can and should be treated as compositions at all stages of
analysis. We briefly introduce compositional data, illustrate the pathologies that occur
when compositional data are analyzed inappropriately, and finally give guidance and point
to resources and examples for the analysis of microbiome datasets using compositional
data analysis.
Keywords: microbiota, compositional data, high-throughput sequencing, correlation, Bayesian estimation, count
normalization, relative abundance
1. INTRODUCTION
The collection and analysis of microbiome datasets presents many challenges in the study design,sample collection, storage, and sequencing phases, and these have been well reviewed (Robinsonet al., 2016). Many methods for the analysis of microbiome datasets assume that sequencingdata are equivalent to ecological data where the counts of reads assigned to organisms are oftennormalized to a constant area or volume. Methods applied include count-based strategies such asBray-Curtis dissimilarity, zero-inflated Gaussianmodels and negative binomial models (McMurdieand Holmes, 2014; Weiss et al., 2017).
In an ecological study it is possible for many different species to co-exist, and their absoluteabundance may be important. For example, in an area containing only tigers, it is importantto know if the population size is sufficient to maintain needed genetic diversity for long-termsurvival (Shaffer, 1981). However, the abundance of one species may not influence the abundanceof another; the area may contain both tigers and ladybugs, and the migration of several ladybugsinto the area would not be expected to affect the number of tigers.
The assumption of true independence can not hold in high-throughput sequencing (HTS)experiments because the sequencing instruments can deliver reads only up to the capacity of theinstrument. Thus, it is proper to think of these instruments as containing a fixed number of slots
-
THANKS
CodaCourse Staff University of Girona (UdG)
Agricultural Research Service