Deciphering the enigma of undetected species, phylogenetic, andfunctional diversity based on Good-Turing theory
ANNE CHAO,1,8 CHUN-HUO CHIU,1 ROBERT K. COLWELL,2,3,4 LUIZ FERNANDO S. MAGNAGO,5 ROBIN L. CHAZDON,2,6
AND NICHOLAS J. GOTELLI7
1Institute of Statistics, National Tsing Hua University, Hsin-Chu 30043 Taiwan2Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, Connecticut 06269 USA
3University of Colorado Museum of Natural History, Boulder, Colorado 80309 USA4Departmento de Ecologia, Universidade Federal de Goi�as, CP 131, 74.001-970, Goiania, GO, Brasil
5Departamento de Biologia, Setor de Ecologia e Conservac�~ao, Universidade Federal de Lavras, Lavras 37200-000 Brasil6Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, Colorado 80309 USA
7Department of Biology, University of Vermont, Burlington, Vermont 05405 USA
Abstract. Estimating the species, phylogenetic, and functional diversity of a community ischallenging because rare species are often undetected, even with intensive sampling. TheGood-Turing frequency formula, originally developed for cryptography, estimates in an eco-logical context the true frequencies of rare species in a single assemblage based on an incom-plete sample of individuals. Until now, this formula has never been used to estimate undetectedspecies, phylogenetic, and functional diversity. Here, we first generalize the Good-Turing for-mula to incomplete sampling of two assemblages. The original formula and its two-assemblagegeneralization provide a novel and unified approach to notation, terminology, and estimationof undetected biological diversity. For species richness, the Good-Turing framework offers anintuitive way to derive the non-parametric estimators of the undetected species richness in asingle assemblage, and of the undetected species shared between two assemblages. For phyloge-netic diversity, the unified approach leads to an estimator of the undetected Faith’s phyloge-netic diversity (PD, the total length of undetected branches of a phylogenetic tree connectingall species), as well as a new estimator of undetected PD shared between two phylogenetic trees.For functional diversity based on species traits, the unified approach yields a new estimator ofundetected Walker et al.’s functional attribute diversity (FAD, the total species-pairwise func-tional distance) in a single assemblage, as well as a new estimator of undetected FAD sharedbetween two assemblages. Although some of the resulting estimators have been previously pub-lished (but derived with traditional mathematical inequalities), all taxonomic, phylogenetic,and functional diversity estimators are now derived under the same framework. All the derivedestimators are theoretically lower bounds of the corresponding undetected diversities; ourapproach reveals the sufficient conditions under which the estimators are nearly unbiased, thusoffering new insights. Simulation results are reported to numerically verify the performance ofthe derived estimators. We illustrate all estimators and assess their sampling uncertainty withan empirical dataset for Brazilian rain forest trees. These estimators should be widely applica-ble to many current problems in ecology, such as the effects of climate change on spatial andtemporal beta diversity and the contribution of trait diversity to ecosystem multi-functionality.
Key words: functional attribute diversity; functional diversity; phylogenetic diversity; shared diversity;species diversity; taxonomic diversity.
INTRODUCTION
Nearly all biodiversity studies and analyses are basedon sampling data taken from focal assemblages.However, due to practical limitations, it is virtuallyimpossible to detect all species, especially in hyper-diverse assemblages with many rare species. In almostevery biodiversity survey and monitoring project, someproportion of the species that are present fail to be
detected. Not only the presence, but also the functionaltraits of these species remain undetected. Moreover, theplacement of undetected species on the phylogenetic treeof the observed species is unknown. Consequently, tradi-tional measures of species, functional, and phylogeneticdiversity from sample data typically underestimate thetrue diversities (observed plus undetected). The magni-tude of this negative bias can be substantial.For species diversity, the estimation of undetected
richness based on incomplete samples from a singleassemblage has been widely applied, not only in ecologyand conservation biology, but also in many other disci-plines; see Colwell and Coddington (1994), Chazdonet al. (1998), Magurran (2004), Chao (2005), Hortal
Manuscript received 21 September 2016; revised 27 April2017; accepted 24 July 2017. Corresponding Editor: Tom E. X.Miller.
8 E-mail: [email protected]
2914
Ecology, 98(11), 2017, pp. 2914–2929© 2017 by the Ecological Society of America
et al. (2006), Gotelli and Colwell (2011), Gotelli andChao (2013) and Chao and Chiu (2016) for variousapplications. For two assemblages, shared species rich-ness plays an important role in assessing assemblageoverlap and forms a basis for constructing various typesof beta diversity and (dis)similarity measures, such asthe classic Sørensen and Jaccard indices (Colwell andCoddington 1994, Magurran 2004, Jost et al. 2011,Gotelli and Chao 2013). Compared with estimating spe-cies richness in a single assemblage, the estimation ofshared species richness, taking undetected species intoaccount, has received relatively little attention; see Chaoand Chiu (2012) for a review.In traditional measures of species diversity, all species
(or taxa at some other rank) are considered to be equallydistinct from one another. Species differences can bebased directly on their evolutionary histories, either inthe form of taxonomic classification or well-supportedphylogenetic trees. A rapidly growing literature addressesphylogenetic diversity metrics and related (dis)similaritymeasures; see Cavender-Bares et al. (2012) for a review.A widely used phylogenetic metric is Faith’s (1992) PD(phylogenetic diversity), which is defined as the sum ofthe branch lengths of a phylogenetic tree connecting allspecies in the target assemblage. Throughout this paper,PD refers to Faith’s (1992) PD. For most data sets, PDis highly correlated with species richness (e.g., Matoset al. 2017). When a sample fails to detect all species pre-sent, the lineages/branches associated with these unde-tected species are also missing from the phylogenetic treeof the observed species. The undetected PD in an incom-plete sample was not discussed until recent years (Car-doso et al. 2014, Chao et al. 2015a).The phylogenetic version of the Jaccard dissimilarity
index is referred to as the UniFrac measure, developedby Lozupone and Knight (2005). The phylogenetic ver-sion of the Sørensen similarity index is referred to as thePhyloSør (phylo-Sørensen) index, developed by Bryantet al. (2008) and Ferrier et al. (2007). All these phyloge-netic (dis)similarity measures are based on the sharedbranch lengths between two phylogenetic trees. However,to our knowledge, the estimation of the undetectedshared PD (i.e., the total length of undetected branchesshared by two phylogenetic trees) has not previouslybeen discussed in the literature.When species are described by a set of traits that affect
organismal and/or ecosystem functioning, pairwise spe-cies differences within an assemblage can also be mea-sured by the dissimilarity or distances between their traitprofiles, which can be weighted or unweighted by theirabundances. Functional diversity or trait diversity quan-tifies the diversity of species’ traits among coexistingspecies in an assemblage (Tilman et al. 1997, Diaz andCabido 2001, Swenson et al. 2012). Functional diversityis regarded as key to understanding ecosystem processesand their response to environmental stress or distur-bance (Cadotte et al. 2009). A distance-based measureat the assemblage level for quantifying functional
diversity is FAD (functional attribute diversity, asdefined by Walker et al. 1999), which is the sum of thespecies-pairwise functional distances. A modified versioncalled MFAD (modified FAD) was proposed by Schmeraet al. (2009), who replaced species with “functionalunits” (here the collection of all species with identicaltraits is regarded as a single functional unit). For sim-plicity, we focus on the estimation of FAD, but a similarapproach can be applied if species are replaced by func-tional units or other clusters of species. For incompletesamples, the traits of undetected species are missing, andthus their pairwise distances are not recorded and can-not be considered in the observed FAD. As far as we areaware, there have been no estimators previously devel-oped for undetected FAD in a single assemblage or unde-tected FAD shared by two assemblages.In his famous cryptanalysis to crack German ciphers
during World War II, Alan Turing, regarded as the foun-der of modern computer sciences, developed novel statis-tical methods to estimate the true frequencies of rarecode elements (including still-undetected code elements),based on the observed frequencies in “samples” of inter-cepted Nazi code. According to Good (1953, 2000), Tur-ing never published his wartime statistical work, butpermitted Good to publish it after the war. The twoinfluential papers by Good (1953) and Good and Toul-min (1956) presented Turing’s wartime statistical workon the frequency formula and related topics. The fre-quency formula is now referred to as the Good-Turingfrequency formula, which has a wide range of applica-tions in biological sciences, statistics, computer sciences,information sciences, and linguistics, among others(McGrayne 2011, p. 100).In an ecological context, Turing’s statistical problem
can be formulated as an estimation of the true frequen-cies of rare species when a random sample of individualsis drawn from an assemblage. In Turing’s case, the spe-cies abundances are highly heterogeneous, with manyrare species, so that all samples have undetected species.The Good-Turing formula answers the following ques-tion: given a species that appears r times (r = 0, 1, 2, . . .)in an incomplete sample of n individuals, what is its truerelative frequency in the entire assemblage? As will bedescribed below, Turing gave a surprising answer that iscontrary to most people’s intuition.Until now, the Good-Turing formula has never been
applied to estimate undetected species, phylogenetic, andfunctional diversity. In this paper, we generalize theGood-Turing formula, originally developed for a singleassemblage, to two assemblages. We also extend theGood-Turing formula to a phylogenetic version thatincorporates evolutionary history among species as wellas a functional version that takes into account func-tional traits associated with each species. We show in thispaper that the Good-Turing formula and its generaliza-tions can be used in a unified way to derive estimators ofundetected species, phylogenetic, and functional diversi-ties based on incomplete samples.
November 2017 ESTIMATING UNDETECTED DIVERSITY 2915
For species diversity, we apply the Good-Turing for-mula to intuitively derive an estimator of the number ofundetected species in an assemblage. The resulting esti-mator turns out to be the Chao (1984) non-parametriclower bound. The two-assemblage generalized formulayields Pan et al.’s (2009) lower bound of the number ofundetected shared species when a sample of individualsis taken from each of two assemblages. For phylogeneticdiversity, the unified approach yields a recently pub-lished estimator of undetected PD in a single assemblage(Cardoso et al. 2014, Chao et al. 2015a). The two-assemblage generalized formula yields a new estimatorof the undetected PD shared between two assemblages.For functional diversity based on species traits, theformulas yield a new estimator of undetected FAD in asingle assemblage as well as a new estimator ofundetected FAD shared between two assemblages.All the resulting Good-Turing estimators of unde-
tected biodiversity measures (including three previouslypublished, derived from traditional mathematicalinequalities, and three new ones) are theoretically lowerbounds of the corresponding undetected diversities.Good-Turing’s perspectives also reveal the sufficientconditions under which the derived estimators are nearlyunbiased. It would be impossible to reveal these condi-tions with the traditional derivations of the correspond-ing estimators from inequalities. Thus, Good-Turingperspectives further justify and provide new insights forthe three established estimators.An important advance beyond previous studies is that
Good-Turing’s perspective here provides a novel andunified approach to the notation, terminology, and esti-mation of undetected diversity. All taxonomic, phyloge-netic and functional diversity estimators are now derivedand linked under the same framework. Simulationresults are reported here to numerically validate Good-Turing theory and examine the performance of thederived estimators. An empirical dataset for tree species,collected in Brazilian rain forest, is used to illustrate allestimators.
SPECIES DIVERSITY: ONE-ASSEMBLAGE GOOD-TURING
FORMULA
The original Good-Turing formula
In the one-assemblage model formulation, we assumethat there are S species and their true species relativeabundances (or generally, species detection probabilities)are denoted by (p1, p2, . . ., pS),
PSi¼1 pi ¼ 1. Assume a
sample of n individuals is selected with replacement.Here we follow Good-Turing’s original model (Good1953) in which the detection probability of each speciesis simply its relative abundance; all our derivations canbe directly extended to a general model in which speciesdetection probability is proportional to the product ofits abundance and individual detectability; see Discus-sion. Let Xi denote the species frequency (abundance) of
the i-th species in the sample, i = 1, 2, . . ., S,PXi [ 0 Xi ¼ n. Only species with frequency X > 0 in the
sample are detected. Those species with frequency X = 0in the sample are not detected (although they are presentin the assemblage) and are therefore not included in thesample data.We define the abundance frequency count fr as the num-
ber of species each represented by exactly r individuals inthe sample. Good (1953) referred to it as “the frequencyof frequency r.” Thus f1 is the number of “singletons”(those species that are represented by exactly 1 individualin the sample), and f2 is the number of “doubletons”(those that are represented by exactly 2 individuals in thesample). Let Sobs denote the total number of those speciesobserved in the sample; Sobs ¼
Pi[ 0 fi � fþ. Also, f0 is
the number of undetected species: species that are presentin the assemblage of S species, but were not detected inthe sample of n individuals and Sobs species. Therefore,we have Sobs + f0 = S.Turing’s statistical problem can be formulated as fol-
lows. Given data, for those species that each appearedexactly r times (r = 0, 1, 2, . . .) in an incomplete sampleof n individuals, the mean of their true relative abun-dances/frequencies in the assemblage, ar, can be mathe-matically expressed as
ar ¼XS
i¼1pi IðXi ¼ rÞ=fr; r ¼ 0; 1; 2; . . . (1a)
where I(A) is the indicator function, i.e., I(A) = 1 if theevent A occurs, and 0 otherwise. The numerator inEq. (1a) represents the total true relative frequencies ofthose species that each appeared exactly r times in thesample. Dividing the total by fr, we obtain the mean (perspecies) of their relative frequencies. Turing and Goodfocused on the case of small r, i.e., rare species (or rarecode elements, in Turing’s case). Note that for the spe-cial case of r = 0, Eq. (1a) implies
a0f0 ¼XS
i¼1piIðXi ¼ 0Þ; (1b)
which is the “coverage deficit” (Chao and Jost 2012) orthe complement of the “sample coverage” defined inGood (1953). The coverage deficit of the sample quanti-fies the proportion of the total individuals in the assem-blage that belong to undetected species; it is also theprobability that a new, previously-undetected specieswould be found if the sample were enlarged by one indi-vidual. This is a very important measure in diversity esti-mation (Chao and Jost 2012).Turing and Good discovered a surprisingly simple and
remarkably effective, although non-intuitive, estimatorfor ar. The Good-Turing frequency formula states thatar, r = 0, 1, 2, . . ., is not estimated by its sample fre-quency r/n, but rather by
~ar ¼ ðrþ 1Þn
frþ1
fr; r ¼ 0; 1; 2; . . . (1c)
2916 ANNE CHAO ET AL. Ecology, Vol. 98, No. 11
In other words, ar should be estimated by r*/n, wherer* = (r + 1)fr+1/fr. The Good-Turing frequency formula isthus contrary to most people’s intuition because the esti-mator in (1c) depends not only on the sample frequency rof the focal species, but also on the frequency informationderived from species in the next frequency class, r + 1.Good (1953) used a fully Bayesian approach to theoret-
ically justify the formula (1c), whereas Robbins (1968)derived it as an empirical Bayes estimator. Good (2000)wrote “when preparing my 1953 article, I had forgottenTuring’s somewhat informal proof in 1940 or 1941, whichinvolved cards or urn models in some way, and I workedout a separate proof [Bayes estimator]. I still don’t recallTuring’s proof.” Nevertheless, Good (1983, p. 28) pro-vided a very intuitive non-Bayesian justification of theGood-Turing frequency formula as follows: Given anoriginal sample of size n, consider the probability of theevent that the next individual will be a species that hadappeared r times in the original sample. (Mathematically,this probability is simply
PSi¼1 piIðXi ¼ rÞ ¼ arfr , as
defined in Eq. 1a.) If this event occurs, then the species towhich the additional individual belongs must appearr + 1 times in the enlarged sample of size n + 1. Becausethe order in which individuals were sampled is assumedto be irrelevant, the total number of individuals in theenlarged sample of size n + 1 for those species (thatappeared in the additional individual and had appeared rtimes in the original sample) is (r + 1)fr+1. Thus, theprobability of the aforementioned event in the enlargedsample of size n + 1 is ðrþ 1Þfrþ1 =ðnþ 1Þ, which can beapproximated by ðrþ 1Þfrþ1 =n if n is large enough.Dividing this by the number of such species, fr, we obtainthe mean relative abundance of those species, which is theclassic Good-Turing frequency formula as given inEq. (1c). Chiu et al. (2014b) proposed an improved for-mula ar shown below for r = 0, 1, 2, . . .,
ar ¼ ðrþ 1Þfrþ1
ðn� rÞfr þ ðrþ 1Þfrþ1� ðrþ 1Þfrþ1
ðn� rÞfr : (1d)
This improved estimator generally has smaller meansquared error than the original Good-Turing estimator.In our subsequent derivation, we adopt the rightmostterm in Eq. (1d); a simple non-Bayesian proof is pro-vided (in Appendix S1) to facilitate the generalization tothe two-assemblage case.
Undetected species richness
Statistically, species richness (observed species plus thenumber of undetected species) is difficult to estimateaccurately if there are many almost undetectable speciesin a hyper-diverse community. Practically, an accuratelower bound for species richness is preferable to an inac-curate point estimator. We now demonstrate that theimproved Good-Turing formula can be intuitively usedto provide a lower bound for the number of undetected
species and to clearly reveal, for the first time, the condi-tions under which the lower bound is a nearly unbiasedpoint estimator.For r = 0, both Eq. (1c) and Eq. (1d) imply that the
mean population relative frequency for those undetectedspecies is approximately a0 ¼ f1=ðnf0Þ, which is notobtainable from observed data because f0 is unknown.However, this relation implies that the product of a0 andf0, the estimated proportion of the total number of indi-viduals that is due to undetected species, can be well esti-mated by the proportion of singletons, f1/n. Fornotational simplicity, let da0f0 denote the estimator of theproduct of a0 and f0. Then we have
da0f0 ¼ f1n: (2a)
Equation (1d) also implies that, for those species thatappeared as singletons (r = 1) in a sample, their meanrelative frequency is estimated by
a1 ¼ 2f2n� 1ð Þf1 : (2b)
Intuitively, we expect that the mean relative frequencyof all undetected species should be less than the meanrelative frequency of all singletons, i.e., a0 ≤ a1, and thisordering should be preserved by the corresponding esti-mates. Combining (2a) and (2b), we readily obtain alower bound for the number of undetected species:
bf0 ¼ da0f0a0
�da0f0a1
¼f1n2f2
n�1ð Þf1¼ n� 1ð Þ
nf 212f2
: (2c)
This lower bound for f0 is identical to that provedrigorously by Chao (1984, 1987) by means of a Cauchy-Schwarz inequality: E(f0) 9 2E(f2) ≥ (1 � 1/n)[E(f1)]
2,which may not be intuitively understood by most ecolo-gists. Here Good-Turing’s approach is intuitive andprovides a sufficient condition for the resulting estimatorbeing unbiased, as elaborated later. Based on Eq. (2c),the estimated number of undetected species is basedexclusively on the information on the rarest observedspecies (the number of singletons and doubletons). Theidea behind this lower bound is that detected abundantspecies carry negligible information about the undetectedspecies; detected rare species carry nearly all such infor-mation. From Eq. (2c), the Good-Turing formula leadsto the following Chao1 species richness estimator, with aslight modification when f2 = 0. (Colwell and Coddington1994 gave the name Chao1 to this estimator):
SChao1 ¼Sobs þ ðn� 1Þ
nf 212f2
; if f2 [ 0;
Sobs þ ðn� 1Þn
f1ðf1 � 1Þ2
; if f2 ¼ 0:
8>>><>>>: (3a)
November 2017 ESTIMATING UNDETECTED DIVERSITY 2917
Notice that, in the above derivation, if a0�a1 (i.e.,undetected species and singletons have identical mean rel-ative abundances), then the inequality sign in Eq. (2c)becomes an equality sign, implying that the lower boundbecomes an unbiased point estimator. Only through theGood-Turing perspectives can this condition be revealed.A simple sufficient condition for the Chao1 lower
bound being nearly unbiased is that rare species (specifi-cally, singletons and undetected species) have approxi-mately homogenous abundances (because this impliesa0�a1); in this case, the abundant species could behighly heterogeneous without affecting the estimator.However, if rare species are heterogeneous and samplesize is not sufficiently large, then at best we can provideonly a lower bound, because the data provide insufficientinformation to accurately estimate diversity when rarespecies abundances are substantially heterogeneous. Asimulation study is described to examine the perfor-mance of the Chao1 estimator in Discussion.Chao (1987) applied a standard approximation
method and obtained the following estimated varianceestimators:
varðSChao1Þ ¼ f2
14
n� 1n
� �2 f1f2
� �4
þ n� 1n
� �2 f1f2
� �3
þ 12
n� 1n
� �f1f2
� �2
26664
37775
(3b)
A confidence interval for species richness can then beobtained using a log-transformation, so that the lowerconfidence bound is always greater than the observedspecies richness if sampling is incomplete (i.e., there aresingletons in the sample). When f2 = 0, a slight modifica-tion of the variance formula is needed; see Chao andChiu (2012).
SPECIES DIVERSITY: TWO-ASSEMBLAGE GOOD-TURING
FORMULA
Two-assemblage Good-Turing formulas
We now extend the one-assemblage model formulationand data framework to two assemblages (I and II),which can differ not only in their species richness, butalso in their species composition. Assume that there areS species in the pooled assemblage. The true relative spe-cies abundances or frequencies in Assemblages I and IIare denoted by (p11, p21, . . ., pS1) and, (p12, p22, . . ., pS2)respectively, pi1; pi2 � 0; i ¼ 1; 2; . . .; S: Let the num-ber of shared species between the two assemblages beSshared (or S12 for notational simplicity). Without loss ofgenerality, we assume that the first Sshared species in thepooled assemblage are the shared species.A random sample is taken from each of the two
assemblages (Sample I with size n1 from Assemblage Iand Sample II with size n2 from Assemblage II). Denote
the number of observed shared species by Sshared,obs, andthe observed species abundances in the two assemblages,respectively, by (X11, X21, . . ., XS1) and(X12, X22, . . ., XS2). For any two non-negative integers rand v, define
frv ¼XS12
i¼1IðXi1 ¼ r; Xi2 ¼ vÞ; r; v ¼ 0; 1; 2; . . . (4a)
That is, frv denotes the number of shared species that areobserved r times in Sample I and v times in Sample II.In particular, f11 denotes the number of shared speciesthat are singletons in both samples, and f00 denotes thenumber of shared species that are undetected in bothsamples. Also, let fr+ denote the number of shared spe-cies that are observed r times in Sample I and areobserved at least once (using a “+” sign to replace theindex v) in Sample II, with a similar symmetric defini-tion for f+v. Thus, f++ becomes the number of observedspecies shared between the two samples. Mathematically,we have the following expressions:
frþ ¼XS12
i¼1IðXi1 ¼ r; Xi2[0Þ¼
Xv[0
frv; (4b)
fþv ¼XS12
i¼1IðXi1[0; Xi2 ¼ vÞ¼
Xr[0
frv; (4c)
fþþ ¼XS12
i¼1IðXi1[0; Xi2[0Þ¼
Xr;v[0
frv ¼Sshared;obs:
(4d)
Here we generalize the original Good-Turing formulato two assemblages. There are two parts to the general-ization. The first part, below, is a direct generalization ofthe original formula; the second part is proved inAppendix S1 by an argument parallel to that applied inthe developing the original formula.(1) Given two-sample data, let arþ ¼ PS12
i¼1 pi1IðXi1 ¼ r; Xi2 [ 0Þ=frþ be the mean of the true relativefrequencies in Assemblage I for those shared species thateach appeared exactly r times in Sample I and appearedat least once in Sample II. A direct generalization of theoriginal Good-Turing formula in Eq. (1d) leads to
arþ ¼ ðrþ 1Þfrþ1;þðn1 � rÞfrþ ; r ¼ 0; 1; 2; . . . (5a)
Similarly, we have a symmetric formula for the meanof the true relative frequencies in Assemblage II forthose shared species that appeared at least once in Sam-ple I and appeared v times in Sample II.
aþv ¼ ðvþ 1Þfþ;vþ1
ðn2 � vÞfþv; v ¼ 0; 1; 2; . . . (5b)
(2) For any shared species that appeared exactly rtimes in each sample, consider calculating the product ofits true relative abundances in the two assemblages; the
2918 ANNE CHAO ET AL. Ecology, Vol. 98, No. 11
mean of the products among all such shared species(there are frr such shared species) can be expressed asarr ¼
PS12i¼1 pi1 pi2 IðXi1 ¼ r; Xi2 ¼ rÞ=frr, r = 0, 1, 2. . .
The following generalized two-assemblage Good-Turingformula provides an estimator for arr (see Appendix S1for a proof):
arr ¼ ðrþ 1Þ2frþ1;rþ1
ðn1 � rÞðn2 � rÞfrr ; r ¼ 0; 1; 2; . . . (5c)
The generalized formulas in Eqs. (5a)–(5c) leadelegantly to an estimator of undetected shared speciesrichness between two assemblages, as shown below.
Undetected shared species richness betweentwo assemblages
Under the two-assemblage model formulation anddata framework, the true number of shared species canbe expressed as the sum of four terms:
Sshared ¼ Sshared;obs þ f0þ þ fþ0 þ f00: (6)
The four terms in the right hand side of Eq. (6), asdefined earlier, represent, respectively, the number ofshared species observed in both samples (f++), the num-ber of shared species observed only in Sample II (f0+),the number of shared species observed only in Sample I(f+0), and the number of shared species undetected inboth samples (f00), but present in both assemblages.Only the first term is observable. The sum of the lastthree terms represents the total undetected shared spe-cies richness. Applying Eqs. (5a) and (5b) and usingnotation and derivation similar to Eq. (2c), we have
f0þ ¼da0þf0þa0þ
�da0þf0þa1þ
¼f1þn12f2þ
n1�1ð Þf1þ¼ n1 � 1ð Þ
n1
f 21þ2f2þ
: (7a)
and a symmetric expression
fþ0 ¼daþ0fþ0
aþ0�
daþ0fþ0
aþ1¼
fþ1n22fþ2
n2�1ð Þfþ1
¼ n2 � 1ð Þn2
f 2þ1
2fþ2:
(7b)
Furthermore, we expect that a00 ≤ a11, and this order-ing is preserved by the corresponding estimates, imply-ing the following inequality, from Eq. (5c):
f00 ¼da00f00a00
�da00f00a11
¼f11n1n24f22
ðn1�1Þðn2�1Þf11
¼ ðn1 � 1Þn1
ðn2 � 1Þn2
f 2114f22
:
(7c)
Combining (6) and (7a) – (7c), we obtain the followingestimator of shared species richness:
SChao1;shared ¼Sshared;obs þ k1f1þ2
2f2þþ k2
fþ12
2fþ2
þ k1k2f11
2
4f22;
(7d)
where ki = (ni � 1)/ni, i = 1, 2. A modification similar tothat in Eq. (3a) can be applied to each term to avoid azero divisor. The sum of the last three terms estimates theundetected shared species richness. The resulting estima-tor in Eq. (7d) is identical to that derived by Pan et al.(2009), who derived it by means of complicated mathe-matical inequalities. Here we show that Good-Turing’sframework provides a simple, unified approach to infer-ring undetected diversity, not only for one assemblage butalso for two assemblages. The estimator in Eq. (7d) isreferred to as the Chao1-shared estimator because it canbe regarded as an extension of the single-assemblageChao1 estimator (Eq. 3a) to the case of two assemblages.Pan et al. (2009) also derived a variance estimator byusing a standard approximation theory, allowing con-struction of a confidence interval for true shared speciesrichness. From the above derivation, the Chao1-sharedlower bound becomes a nearly unbiased point estimatorprovided a0þ � a1þ; aþ0 � aþ1; and a00 � a11 in thederivations. Considering only shared species, we see thata simple sufficient condition is that the undetected speciesand the detected singletons have approximately the sameabundances in each of the two assemblages.
PHYLOGENETIC DIVERSITY
Undetected Faith’s PD in a single assemblage
To formulate phylogenetic diversity, we assume that allS species in an assemblage are connected by a rootedultrametric or non-ultrametric phylogenetic tree, with allspecies, observed and unobserved, as tip nodes. In thispaper, all phylogenetic diversity measures and estimatorsare computed from a given fixed reference point that isancestral to all taxa considered in the study. The choiceof the reference point is thus independent of the samplingdata. Assume that there are B branch segments and Bcorresponding nodes, B ≥ S. Let ai denote the total rela-tive abundance of the species descended from the ithnode/branch, i = 1, 2, . . ., B, and Li denote the length ofbranch i. Therefore, the set of species relative abundances(p1, p2, . . ., pS) is expanded to a larger relative abundanceset fai; i ¼ 1; 2; . . .;Bg with (p1, p2, . . ., pS) as its first Selements. For simplicity, we refer to ai as the node/branchrelative abundance of the ith node/branch, althoughPB
i¼1 ai is not necessarily equal to unity; see Fig. 1 ofChao et al. (2015a) for an illustrative example. Faith’s(1992) PD is expressed as PD ¼ PB
i¼1 Li.We assume an empirical sample of n individuals with
sample species abundances ðX1; X2; . . . ;XSÞ is takenfrom the assemblage. Define X �
i as the sum of theobserved species abundances for those species in the
November 2017 ESTIMATING UNDETECTED DIVERSITY 2919
sample that are descended from branch i. Then we canexpand the set of observed species abundances to a lar-ger branch abundance set fX �
i ; i ¼ 1; 2; . . .; Bg withðX1; X2; . . . ;XSÞ as its first S elements. We refer to X �
i ;i ¼ 1; 2; . . .; B, as the sample node/branch abundance ofnode/branch i. Let
gr ¼XB
i¼1Li IðX �
i ¼ rÞ; r ¼ 0; 1; 2; . . .; (8a)
be the total length of those branches with sample abun-dance r in the set fX �
i ; i ¼ 1; 2; . . .; Bg, where the indi-cator function I (∙) is defined in Eq. (1a). The undetectedPD in the sample is g0, which is the total length of unde-tected branches; g0 is unknown but fg1; g2; . . .g can becomputed from the sample and the observed tree (the treespanned by the observed species). For r > 0, gr is identicalto the total length of the branches with sample abundancer in the observed tree. For example, g1 denotes the totallength of those nodes/branches with sample abundance =1 in the observed tree; g2 denotes the total length of thosebranches with sample abundance = 2 in the observed tree.Let PDobs denote the observed PD. Then we havePDobs ¼
Pi[ 0 gi � gþ and PD = PDobs + g0. A simple
example is provided in Appendix S2: Fig. S1 to illustratethese measures.We now extend the Good-Turing frequency formula
to its phylogenetic version. For those nodes/branchesthat each is with a sample abundance/frequency of r, thebranch-length-weighted mean (per unit length) of theirtrue abundances in the entire assemblage, kr, can bemathematically expressed as
kr ¼XB
i¼1Liai IðX �
i ¼ rÞ=gr; r ¼ 0; 1; 2; . . .
Derivation steps parallel to those used for the speciesdiversity lead to the following phylogenetic version ofthe single-assemblage Good-Turing frequency formula(Appendix S2):
kr ¼ ðrþ 1Þgrþ1
ðn� rÞgr ; r ¼ 0; 1; 2. . . (8b)
Intuitively, we expect that k0 ≤ k1, and this ordering ispreserved by the corresponding estimates. From theabove formula, a lower bound for the undetected PD isobtained:
g0 ¼dk0g0k0
�dk0g0k1
¼g1n2g2
n�1ð Þg1¼ n� 1ð Þ
ng212g2
: (8c)
The lower bound in Eq. (8c) is identical to that proposedin Chao et al. (2015a) via the Cauchy-Schwarz inequal-ity. Here, instead, we prove it under a unified frameworkby means of a phylogenetic version of the originalGood-Turing formula. Cardoso et al. (2014) adapted theChao1 estimator (Eq. 3a) to obtain the same estimator.
When g2 is relatively small, including the case of g2 =0, the above estimator may yield an extremely large valueand thus exhibit a large variance. To cope with suchcases, Chao et al. (2015a) and Hsieh and Chao (2017)proposed the following modified Chao1-PD estimator:
dPDChao1¼PDobsþðn�1Þ
ng212g2
; if g2[g1f �22f �1
;
PDobsþðn�1Þn
g1ðf �1 �1Þ2ðf �2 þ1Þ ; if g2�
g1f �22f �1
;
8>>><>>>: (8d)
where f �1 and f �2 denote, respectively, the number ofnodes/branches with abundance = 1 and abundance = 2in the observed tree. Since any node/branch with abun-dance = 1 can occur only at the tip nodes, we havef �1 ¼ f1 i.e., f �1 is identical to the number of singletons atthe tip nodes. However, f �2 is not necessarily equal to f2;see Appendix S2: Fig. S1 for an example.Based on Eq. (8c), the Chao1-PD estimator is nearly
unbiased when k1 � k0, which means that the singletons(which occur only at the tip nodes) and the undetectednodes (which can be tip or interior nodes) have the samelength-weighted mean abundances. A sufficient simple con-dition is that all rare nodes (including the singletons and theundetected) have approximately the same abundances. Thevariance of the Chao1-PD estimator can be obtained usingEq. (3b) with {f1, f2} being replaced by {g1, g2}. The con-struction of the confidence interval for Faith’s PD based onthe Chao1-PD estimator can be similarly obtained.From the derivations above, we see that all estimation
procedures and the derivation steps for developing thephylogenetic version of the Good-Turing formula andPD estimators are parallel to those for the original for-mula and species richness estimators. A summary of for-mulas and descriptions for estimating species richnessand Faith’s PD is provided in Appendix S2: Table S1,where the analogy between the two estimation frame-works is transparently displayed. The analogy was firstproposed by Faith (1992). From Faith’s perspective,each unit-length branch is regarded as a “feature” inphylogenetic diversity (like a “species” in species diver-sity). Chao et al. (2014a) subsequently referred to eachunit-length branch segment as a phylogenetic entity. Forexample, a branch of 8-unit length is counted as 8 phylo-genetic entities. All entities are phylogenetically equallydistinct, just as all species are assumed taxonomicallyequally distinct in computing species richness. Instead ofspecies, for PD we are measuring the total number ofphylogenetic entities, or equivalently, the total branchlength (because each entity has length of unity). Usingthis perspective, the measures of branch lengthsfgk; k ¼ 0; 1; . . .; g in estimating PD play the samerole as the frequency counts ffk; k ¼ 0; 1; . . .g in esti-mating species richness. This analogy to counting-upspecies means that most ecological indices defined at thespecies level can be converted to PD equivalents (bycounting phylogenetic entities rather than species).
2920 ANNE CHAO ET AL. Ecology, Vol. 98, No. 11
Undetected shared PD between two assemblages
Following the approach to the two-assemblage modelformulation and the data framework described in thesection Two-assemblage Good-Turing Formulas, weassume that all S species of the pooled assemblage areindexed by 1, 2, . . ., S. Assume these S species are con-nected by a rooted ultrametric or non-ultrametric phylo-genetic tree, with the S species as tip nodes. Given afixed reference point that is ancestral to all taxa consid-ered in the study, we assume that there are B branch seg-ments and B corresponding nodes in the pooled tree,and let Li denote the length of branch i. Because thephylogenetic tree of each individual assemblage is a sub-tree of the pooled tree, the diversity for each individualassemblage can be computed from the pooled tree struc-ture with only the node (or branch) abundances varyingbetween assemblages, as illustrated by Chiu et al.(2014a, Figure 2). Assume that there are B12 branchesshared by the two assemblages for the given referencepoint, and without losing generality, these sharedbranches are indexed from 1, 2, . . ., B12 for notationalsimplicity. Denote the true shared PD (the total lengthof branch segments shared by the two individual sub-trees) by PDshared and the shared PD between twoobserved trees by PDshared,obs.As in the preceding section, the two sets of relative
species abundances in Assemblages I and II are, respec-tively, expanded to the node/branch relative abundance(or frequency) sets fai1; i ¼ 1; 2; . . .;Bg andfai2; i ¼ 1; 2; . . .;Bg. We also expand the set ofobserved species frequencies ðX11; X21; . . . ;XS1Þ inAssemblage I to a larger sample node/branch frequencyset fX �
i1; i ¼ 1; 2; . . .; Bg. Similarly, we extend the set ofobserved species frequencies ðX12; X22; . . . ;XS2Þ inAssemblage II to a larger set fX �
i2; i ¼ 1; 2; . . .; Bg. Thefrequency counts of shared species, frv, (Eq. 4a), are alsoextended to their phylogenetic versions:
grv ¼XB12
i¼1Li IðX �
i1 ¼ r; X �i2 ¼ vÞ; r; v ¼ 0; 1; 2; . . . (9)
Here grv measures the total length of those sharedbranches with node/branch abundance r in Sample I andnode/branch abundance v in Sample II. We can similarlydefine gr+, g+v and g++, as we did in Eqs. (4b)–(4d). Heregr+ denotes the total length of those shared branchesthat have abundance r in Sample I and non-zero node/branch abundance in Sample II, with a similar interpre-tation for g+v. Now g++ becomes the observed PD sharedby the two samples, i.e., g++ = PDshared,obs. Thus {grv,gr+, g+v, g++; r, v = 0, 1, 2. . .} are analogous to {frv, fr+,f+v, f++; r, v = 0, 1, 2. . .} as defined in Eqs. (4a) – (4d).The only difference is that each of the former set mea-sures the lengths of shared branches, whereas each of thelater set counts shared species.The true shared PD can be expressed as the sum of
four terms:
PDshared ¼ PDshared;obs þ g0þ þ gþ0 þ g00: (10a)
The four terms in the right hand side of the aboveequation represent, respectively, the observed sharedPD, the shared PD that is missed only in Sample I, theshared PD that is missed only in Sample II, and theshared PD that is missed by both samples. For anyshared branch segment with node/branch abundance rin each observed branch set, consider calculating theproduct of its true branch relative abundances in thetwo assemblages. The branch-length-weighted mean(per unit-length) of the products among all suchshared branches (their total length is grr) can beexpressed as
krr ¼XB12
i¼1Liai1ai2 IðX �
i1 ¼ r; X �i2 ¼ rÞ=grr; r ¼ 0; 1; 2. . .
The two-assemblage phylogenetic version of theGood-Turing formula (see Appendix S2 for a proof) pro-vides the following estimator of krr which is similar tothat for shared species richness in Eq. (5c):
krr ¼ ðrþ 1Þ2grþ1;rþ1
ðn1 � rÞðn2 � rÞgrr ; r ¼ 0; 1; 2; . . . (10b)
Under the intuitive expectation that k00 ≤ k11 (and thisordering is preserved by the corresponding estimates),we obtain the following lower bound of g00:
g00 ¼dk00g00k00
�dk00g00k11
¼g11n1n24g22
n1�1ð Þ n2�1ð Þg11
¼ n1 � 1ð Þn1
n2 � 1ð Þn2
g2114g22
:
(10c)
Then the same estimation procedures we used todevelop the Chao1-shared estimator in Eq. (7d) lead tothe following Chao1-PD-shared estimator (Appendix S2):
dPDChao1 ¼PDshared;obs þ k1g21þ2g2þ
þ k2g2þ1
2gþ2
þ k1k2g2114g22
;
(10d)
where ki = (ni � 1)/ni, i = 1, 2. The sum of the last threeterms estimates the total length of undetected branchesshared by two individual sub-trees. A modification simi-lar to that proposed in Eq. (8d) can be applied to eachof the three terms. The sufficient conditions for theChao1-PD-shared estimator being nearly unbiased canbe similarly formulated as those for the Chao1-sharedestimator, simply by replacing “shared species” with“shared nodes/branches.”The variance and confidence interval associated with
this estimator follow directly from those for the Chao1-
November 2017 ESTIMATING UNDETECTED DIVERSITY 2921
shared estimator, by replacing the counts {frv, fr+, f+v,f++; r, v = 0, 1, 2, . . .} with {grv, gr+, g+v, g++; r, v = 0, 1,2, . . .} in the formulas. This correspondence also reflectsthe analogy between the estimation of shared speciesand shared PD between two assemblages. A comparisonof estimation formulas used in estimating shared speciesrichness and in estimating shared PD is provided inAppendix S2: Table S2, where any unit-length branchshared by two assemblages plays the same role as a“shared species” in the estimation of shared species rich-ness. The measures of shared branch lengthsfgrv; r; v ¼ 0; 1; 2; . . . g used in estimating shared PDplay a role corresponding to the frequency countsffrv; r; v ¼ 0; 1; 2; . . .g in estimating shared speciesdiversity. Hence, not only within-assemblage but alsobetween-assemblage measures defined at the species levelcan be converted to PD equivalents.
FUNCTIONAL DIVERSITY
Undetected FAD in a single assemblage
Consider an assemblage in which all species are char-acterized by a set of functional traits, which can be cate-gorical or continuous variables. The pairwise distancesare calculated by some distance metric (e.g., Euclideandistance or Gower distance) based on the values of spe-cies traits. Let dij be the functional distance between theith and jth species, with dij = dji > 0. Walker et al.’s(1999) FAD in the entire assemblage is expressed asFAD ¼ PS
i; j¼1 dij . In our estimation procedures, we donot require zero distance between any two species. Thatis, intraspecific variability in traits is allowed, so that diimay be greater than zero. For example, when trait valuesare available at the individual level, we can simply definethe distance between two species as the mean of all dis-tances between any pair of individuals, one chosenrespectively from each of the two species.In our approach, one unit length of distance is analo-
gous to a pair of species shared between two identicalassemblages. We first extend all frequency counts ofshared species, frv, defined in Eqs. (4a)–(4d) to the corre-sponding measures of pairwise distance based on a sam-ple of n individuals, with a species sample abundance setðX1; X2; . . . ;XSÞ. Let Frv denote the total distance ofthose species pairs with sample abundances r and vrespectively for the two species in each pair. That is,
Frv ¼XSi; j¼1
dij IðXi ¼ r;Xj ¼ vÞ: (11)
We can similarly define Fr+, F+v and F++, as we did inEqs. (4b) – (4d). Here Fr+ denotes the total distance ofthose species pairs with sample abundance r and k,where k > 0 can be any positive integer, with a similardefinition for F+v, while F++ becomes the observed FADin the sample, i.e., F++ = FADobs.
Given one-sample data (data for a single assemblage)and the species-pairwise distance matrix for all species,the true FAD can be expressed as the sum of four terms:
FAD ¼ FADobs þ F0þ þ Fþ0 þ F00: (12a)
The four terms in the right hand side of the above equa-tion denote, respectively, the observed FAD, the unde-tected FAD for species pairs in which one species ismissing from the sample (F0+ = F+0), and the undetectedFAD for species pairs in which both species are missingfrom the sample (F00). This expression is analogous toEq. (6) for shared species and to Eq. (10a) for sharedPD. Out of all the species which appeared exactly r timesin the sample, consider choosing any two to form a pairand calculating the product of their true relative abun-dances in the assemblage. The distance-weighted mean(per unit-distance) of the products among all such pairs(their total pairwise distance is Frr) can be expressed as
hrr ¼XS
i; j¼1dij pi pj IðXi ¼ r; Xj ¼ rÞ=Frr; r ¼ 0; 1; 2. . .
The functional version of the Good-Turing frequencyformula provides the following estimator of hrr (seeAppendix S3 for a proof):
hrr ¼ ðrþ 1Þ2Frþ1;rþ1
ðn� 2rÞðn� 2r� 1ÞFrr; r ¼ 0; 1; 2; . . . (12b)
The above formula is slightly different from the two-sample Good-Turing formulas for estimating shared spe-cies (Eq. 5c) and shared PD (Eq. 10b), because, here, weonly have one-sample data taken from a single assem-blage, whereas the data for the previous formulas aretaken from two different assemblages. Under the intu-itive expectation that h00 ≤ h11 (and this ordering is pre-served by the corresponding estimates), we obtain thefollowing lower bound for F00:
F00 ¼dh00F00
h00�
dh00F00
h11¼
F11n n�1ð Þ4F22
n�2ð Þ n�3ð ÞF11
¼ n� 2ð Þn
n� 3ð Þn� 1
F211
4F22:
(12c)
Then the same estimation procedures as we obtainedfor Chao1-shared in Eq. (7d) leads to the followingChao1-FAD estimator:
dFADChao1 ¼FADobs þ kF21þ
2F2þþ k
F2þ1
2Fþ2
þ ðn� 2Þðn� 3Þnðn� 1Þ
F211
4F22;
(12d)
where k = (n � 1)/n. The sum of the last three termsestimates the undetected FAD. A modification similar to
2922 ANNE CHAO ET AL. Ecology, Vol. 98, No. 11
that proposed in Eq. (8d) can be applied to each of thethree terms. The variance and confidence interval associ-ated with this estimator follow directly from those forthe Chao1-shared estimator. Under the condition thatundetected species and singletons have approximatelyhomogenous abundances, the Chao1-FAD estimator isnearly unbiased for any given species-pairwise distancematrix.A summary of formulas and descriptions for estimat-
ing shared species richness and FAD is given inAppendix S3: Table S1, where the analogy between theestimation procedures of the two measures can be seen.Chao et al. (2014a) define a “functional entity” as a spe-cies pair with one unit of distance between the two spe-cies. In FAD, a functional entity plays the same role as a“shared species” between two assemblages. For example,a species-pair with distance dij = 5 is counted as 5“shared species” (i.e., 5 functional entities). Thus themeasures of total distances of species pairs, {Frv, Fr+,F+v, F++; r, v = 0, 1, 2, . . .}, play the same roles as thecounts of shared species richness {frv, fr+, f+v, f++; r,v = 0, 1, 2, . . .} (defined in Eqs. 4a – 4d).
Undetected shared FAD between two assemblages
Under the two-assemblage model formulation anddata framework described in the section Two-assem-blage Good-Turing Formulas, we further assume thatthe functional distance between the ith and jth speciesfor all S species of the pooled assemblage is denoted asdij, with dij = dji > 0. The FAD of each individual assem-blage can be computed from a sub-matrix of the dis-tance matrix of all S species in the pooled assemblage.Denote the number of shared species by S12, sharedFAD (i.e., the total species-pairwise distance of two spe-cies each is shared by two assemblages) by FADshared,and the FAD based on the observed shared species byFADshared,obs.Given two-sample data with species sample abun-
dances (X11, X21, . . ., XS1) and (X12, X22, . . ., XS2), asdefined earlier, we extend all frequency counts Frv
defined in Eq. (11) to the two-assemblage case. That is,we define for r, v, k, m = 0, 1, 2. . .
FðrvÞðkmÞ ¼XS12
i; j¼1
dij IðXi1 ¼ r; Xj1 ¼ vÞ
� IðXi2 ¼ k; Xj2 ¼ mÞ:(13)
Here F(rv)(km) measures the total functional distance ofthose pairs of shared species with abundances (r, v) inSample I and abundances (k, m) in Sample II for the twospecies in each pair. Where any index is replaced by a“+” sign, it means that those species occur at leastonce in the corresponding sample. Thus, we can defineF(+v)(km), F(+v)(+m), F(++)(km), and F(rv)(+m), etc. Thenshared FAD can be expressed as the sum of 16 terms (seeAppendix S3: Table S2 for illustrative details)
FADshared ¼X
r; v¼þ;0
Xk;m¼þ;0
FðrvÞðkmÞ
¼ FðþþÞðþþÞ þ FðþþÞðþ0Þ þ FðþþÞð0þÞþ . . .þ Fð00Þð00Þ:
(14a)
Here F(++)(++) = FADshared,obs denotes the observedshared FAD; the other 15 terms are unknown. For eachterm, we can apply similar-type Good-Turing formulasto obtain an estimator as given in Appendix S3:Table S3. Then we have the following Chao1-FAD-shared estimator:
dFADChao1;shared ¼ FðþþÞðþþÞ þ FðþþÞðþ0Þþ FðþþÞð0þÞ þ . . .þ Fð00Þð00Þ:
(14b)
A bootstrap method can be used to assess the samplingvariance of the above estimator and to obtain the associ-ated confidence interval.
EXAMPLE
We apply all the estimators derived from the Good-Turing frequency formula and its generalizations to therain-forest tree data described and discussed in Mag-nago et al. (2014). The tree species abundance data werecollected between January 2011 and January 2012 from11 forest fragments in Esp�ırito Santo State, in south-eastern Brazil. Sampling data for 10 fragments includedone Edge and one Interior transect, whereas in one frag-ment sampling data included two Edge and two Interiortransects (Magnago et al. 2014), with a distance of5.7 2.4 km between transects. In total, the study com-prised 11 fragments and 24 transects. Each Edge transectwas placed about 5 m inside the fragment and parallelto the forest edge, and each Interior transect was locatedat least 300 m from the nearest edge. One of the goals ofthe original study was to compare functional diversitybetween Edge habitat and Interior habitat. Within eachtransect, every living tree with a diameter at breastheight (DBH) > 4.8 cm and 1.3 m height was recorded.In the original data (Table S2 of Magnago et al.
2014), pooled from all 24 transect samples in 11 frag-ments, there were 443 species among 4140 individualtrees. However, in order to construct the phylogenetictree for the observed species, using Phylomatic (http://www.phylodiversity.net/phylomatic; Webb and Dono-ghue 2005), we had to exclude 18 species, including 5species not identified in the species list of Magnago et al.(2014) and 13 species not included in Phylomatic.Among the 18 species excluded, 12 species were found inboth habitats, 3 species were unique to the Edge habitatand 3 species were unique to the Interior habitat. Thedata considered in our analysis thus include 425 speciesfrom a total of 3868 individuals; species abundance dataare available in Github (https://github.com/AnneChao).The species abundance frequency counts of these 425species are summarized in Table 1. There were 319
November 2017 ESTIMATING UNDETECTED DIVERSITY 2923
species (including 110 singletons and 48 doubletons)among 1794 individuals in the data from the Edge habi-tat, and 356 species (including 123 singletons and 48doubletons) among 2074 individuals in the data fromthe Interior habitat. There were 250 species in commonbetween the two habitats in the data. The sample cover-age estimates for these two habitats are nearly equal(93.9% for the Edge habitat and 94.1% for the Interiorhabitat), in spite of different sample sizes.
To illustrate our estimators, we focus only on thediversity analysis for the pooled samples over all frag-ments. That is, we regard the pooled tree records as sam-pling data from the entire study area represented by the11 fragments and aim to infer and compare the esti-mated true or asymptotic diversity (observed plus unde-tected) of the two habitats for the whole study area. Wecould also apply our estimation to the Edge and Interiortransect data within each fragment to compare the diver-sity of the two habitats; in this case, the target becomesthe estimated true diversity of each fragment. Compar-ison of compositional, phylogenetic, and functional dif-ferentiation among the 11 fragments, i.e., beta diversitywith adjustment for under-sampling bias will bereported elsewhere.Estimates for species richness and shared species rich-
ness between the two habitats appear in Table 2. Theundetected species richness estimates (Eq. 2c) for theEdge and Interior habitats are, respectively, 126 and 158species, implying that the total richness estimate (Eq. 3a)for the Edge habitat is 445 species, with a 95% confidenceinterval of (396, 525), and for the Interior habitat 514 spe-cies, with a 95% confidence interval of (455, 609). Theseresults show that the Interior habitat has higher estimatedspecies richness; however, the difference is not statisticallysignificant at a level of 5%. The undetected shared speciesrichness between the two habitats is estimated to be 139,leading to an estimate of shared richness (Eq. 7d) of 389species, with a 95% confidence interval of (347, 450).Although the true species richness and shared speciesrichness between the two habitats are unknown and ourestimates theoretically represent lower bounds, the datacollector among us (L. F. S. Magnago) believes these esti-mates provide reasonable adjustments based on his expe-riences in the fields, CVRD (Herbarium of Natural ValeReserve) herbarium collection, and other floristic studiesfor the region (Rolim and Nascimento 1997, Jesus andRolim 2005, Paula and Soares 2011). We also assume thatthis conclusion can be extended to the following estimatesof phylogenetic diversity and functional diversity. Thebiological processes that may give rise detection failureare elaborated in the Discussion section.The phylogenetic tree (in Newick format) of the 425
observed species which was constructed using the soft-ware Phylomatic (Webb and Donoghue 2005) is given inGithub (https://github.com/AnneChao). The observedFaith’s PD values are, respectively, 24516 and 27727 Myrin the Edge and Interior habitat, with a shared PD20680 Myr. Table 3 shows the estimates for Faith’s PD ineach habitat and shared PD between the two habitats.The PD estimates (Eq. 8d) for the Edge and Interior habi-tats are, respectively, 32011 with a 95% confidence inter-val of (31542, 32511), and 34550 with a 95% confidenceinterval of (34143, 34983). The two intervals do not over-lap, implying that the Interior habitat has significantlyhigher PD than the Edge habitat. The shared PD betweenthe two habitats is estimated (Eq. 10d) to be 29360 Myr,with a 95% confidence interval of (28976, 29761).
TABLE 1. The species frequency counts of the tree abundancedata from two habitats (Edge and Interior) in south-easternBrazil (Magnago et al. 2014), where fi denotes the number ofspecies represented by exactly i individuals in the sample.
Number of individuals insample, i
Number of species, fi
Edge Interior
1 110 1232 48 483 38 414 28 325 13 196 11 177 12 68 5 79 4 610 6 711 3 612 4 313 3 314 5 315 2 316 4 417 2 218 2 119 2 320 2 421 1 523 2 225 1 127 1 228 1 130 132 1 234 135 136 237 141 145 146 149 152 1110 1123 1140 1
Notes: Detailed species abundances are taken from Table S2 ofMagnago et al. (2014). For the Edge habitat, Sobs = 319,n = 1794, sample coverage estimate = 93.9%; for the Interiorhabitat, Sobs = 356, n = 2074, sample coverage estimate = 94.1%.
2924 ANNE CHAO ET AL. Ecology, Vol. 98, No. 11
The functional traits for these tree data were basedprimarily on morphological and physical characteristicsof trees, their roles as trophic resources, their dispersalmodes, and their roles in carbon storage and foreststructure. All observed species were described by a set ofsix functional traits, including five categorical variables:fruit size (size categories), seed size (size categories), fruittype, fruit dispersal syndrome, and successional group,together with one quantitative variable: wood density.Based on these six traits, the species pairwise distancematrix for the data (pooled across replicates) was calcu-lated by a Gower mixed-variables coefficient of distancewith equal weights for all traits; the Gower distancematrix of the 425 observed species is provided in Github(https://github.com/AnneChao). Table 4 shows the esti-mates for FAD in each habitat and shared FAD betweenthe two habitats. In the Edge and Interior habitats, theobserved FAD (total pairwise Gower distances) are36603 and 43438, respectively. Based on Eq. (12d), theFAD estimate is 69670 in the Edge habitat, with a 95%confidence interval of (68764, 70602), and 86802 in theInterior habitat, with a 95% confidence interval of(85659, 87975). Here each estimate of true FAD is almostdouble the observed FAD. We can conclude that theFAD in the Interior habitat is significantly higher than
the FAD in the Edge habitat because the two intervalsdo not overlap. The shared FAD between the two habi-tats is estimated (Eq. 14b) to be 49061 with a 95% boot-strap confidence interval of (42034, 58562) based on 200bootstrap replications. This wide confidence intervalreflects large sampling variance due to the estimation of15 parameters and each estimate is subject to somedegree of sampling uncertainty.Our analysis suggests that the pooled Interior habitat
has significantly higher phylogenetic diversity and signif-icantly higher functional diversity than the pooled Edgehabitat, but no such significance can be concluded forspecies diversity. These results are generally consistentwith previous findings (Magnago et al. 2014, Matoset al. 2017) based on empirical data and model expecta-tions. The previous authors found that the Edge habitatcharacteristics (microclimate, light, etc.) do not changewith increasing fragment size whereas the Interior habi-tat characteristics change with fragment size. Conse-quently, fragment size usually does not produce effectson species richness and composition at fragment edges,but strongly influences the diversity in the interior offragments. Because all fragment sizes are consideredtogether in this paper, we thus expect greater diversityvalues for Interior than for Edge habitats.
TABLE 2. Summary of species data and richness estimates based on tree species abundance data for the Edge and Interior habitatsin forest fragments of south-eastern Brazil (Magnago et al. 2014), showing (a) undetected species richness and Chao1 richnesspoint and interval estimates for each habitat (see Eq. 3a) and (b) undetected shared species richness between the two habitats andthe corresponding Chao1-shared richness point and interval estimates (see Eq. 7d).
Habitat Sample size f1 f2 Observed richness Undetected richnessChao1richness 95% conf. interval
(a)Edge 1794 110 48 319 126 445 396, 525Interior 2074 123 48 356 158 514 455, 609
Observed sharedrichness f+1 f+2 f1+ f2+ f11 f22 fþ0 f0þ f00
Undetected sharedrichness
Chao1-sharedrichness 95% conf. interval
(b)250 64 30 60 37 25 7 68 49 22 139 389 347, 450
TABLE 3. Summary of phylogenetic data and diversity estimates for the Edge and Interior habitats in forest fragments of south-eastern Brazil (Magnago et al. 2014), showing (a) undetected PD and Chao1-PD point and interval estimates for each habitat(see Eq. 8d), and (b) undetected shared PD between the two habitats and the corresponding Chao1-PD-shared point and intervalestimates (see Eq. 10d).
Habitat Sample size g1 g2 Observed PD Undetected PD Chao1-PD 95% conf. interval
(a)Edge 1794 6578 2885 24516 7495 32011 31542, 32511Interior 2074 7065 3656 27727 6823 34550 34143, 34983
Observedshared PD g+1 g+2 g1+ g2+ g11 g22 gþ0 g0þ g00
Undetectedshared PD
Chao1-PDshared 95% conf. interval
(b)20680 3888 2177 3929 2125 1711 463 3470 3630 1579 8680 29360 28976, 29761
Note: The phylogenetic tree for all observed species is based on Phylomatic (Webb and Donoghue 2005).
November 2017 ESTIMATING UNDETECTED DIVERSITY 2925
CONCLUSION AND DISCUSSION
We have generalized the original one-assemblage Good-Turing frequency formula (Eqs. 1c and 1d) to the case oftwo assemblages (Eq. 5c), and also extended it to a phylo-genetic version (Eqs. 8b and 10b) as well as a functionaltrait version (Eq. 12b and Appendix S3: Table S3). Wehave also applied the original and generalized formulas toobtain various estimators of undetected species, phyloge-netic, and functional diversity, as summarized below.
1. For species diversity, the estimator of species richnessderived from the Good-Turing frequency formula isidentical to the Chao1 estimator (Chao 1984, 1987) inEq. (3a). The estimator of shared species between twoassemblages in Eq. (7d) is identical to the Chao1-shared species estimator proposed in Pan et al. (2009).
2. For phylogenetic diversity, the resulting estimator (Eq.8d) of Faith’s PD is identical to the Chao1-PD estima-tor proposed recently by Chao et al. (2015a), but theestimator of the shared Faith’s PD (Eq. 10d) is new.
3. For functional trait diversity, the estimator of FAD ina single assemblage (Eq. 12d) and shared FADbetween two assemblages (Eq. 14b) are both new; seeAppendix S3: Tables S1–S3 for a summary.
The R code “Good-Turing” for computing all the esti-mators discussed in this paper is available in Github(https://github.com/AnneChao) along with a descriptionof the running procedures. As an alternative, readerswithout a background in R, can utilize the online soft-ware “GoodTuring”, made available from https://chao.shinyapps.io/GoodTuring/ to facilitate all computations.We have proved that each of the derived estimators is
theoretically a lower bound of the corresponding diver-sity. Good-Turing’s perspectives reveal the sufficientconditions under which the resulting estimator is nearlyunbiased. For example, a simple sufficient condition forthe Chao1 species richness estimator being nearly unbi-ased is that rare species (specifically, undetected speciesand singletons in sample) have approximately homoge-nous abundances. Similar conditions for other estima-tors are also clearly specified in each subsection. See thenext paragraph for more relaxed conditions.Although, in our derivations, we follow the Good-Tur-
ing original model by assuming that the detection proba-bility of each species is simply its relative abundance, allour derivations can be directly extended to a generalmodel as discussed in Chao and Chiu (2016). The generalmodel assumes that detection probability is proportionalto the product of abundance and individual detectability,which may vary among species. Based on samples ofindividuals, individual detectability for mobile organismsis determined by many possible factors such as individualmovement patterns, color, size, habitat, life cycle andvocalizations; for assemblages of sessile organisms,such as trees, that are surveyed from selected samplingunits (e.g., transects, plots or quadrats), individualT
ABLE4.
Summaryof
function
alda
taan
ddiversityestimates
basedon
tree
speciesabun
danceda
ta,as
wellas
speciestraits
collected
from
theEdg
ean
dInterior
habitats
offorest
fragments
insouth-easternBrazil(M
agna
goet
al.20
14),show
ing(a)un
detected
FAD
andCha
o1-FAD
pointan
dinterval
estimates
foreach
habitat(see
Eq.
12d),an
d(b)un
detected
shared
FAD
betw
eenthetw
oha
bitatsan
dthecorrespo
ndingCha
o1-FAD-sha
redpo
intan
dinterval
estimates
(see
Eq.
14ban
dApp
endixS3
).
Hab
itat
Observed
FAD
F+1(=
F1+)
F+2(=
F2+)
F11
F22
Fþ0
¼F0þ
F00
Und
etected
FAD
Cha
o1-FAD
95%
conf.
interval
(a)
Edg
e36
603
1245
255
7242
0083
713
906
5256
33068
6967
068
764,
70602
Interior
43438
14769
6059
4940
825
17992
7380
43364
86802
85659,
87975
Observed
shared
FAD
Fðþ
þÞð00Þ
Fðþ
þÞð0þÞ
¼Fðþ
þÞðþ
0ÞFð00Þðþ
þÞ
Fðþ
0Þðþ
þÞ¼
Fð0þÞ
ðþþÞ
Fðþ
0Þð0þÞ
¼Fð0þÞ
ðþ0Þ
Fðþ
0Þð00Þ
¼Fð0þÞ
ð00Þ
Fðþ
0Þðþ
0Þ¼
Fð0þÞ
ð0þÞ
Fð00Þðþ
0Þ¼
Fð00Þð0þÞ
Fð00Þð00Þ
Und
etected
shared
FAD
Cha
o1-FAD
shared
95%
conf.
interval
(b)
2207
972
757
9367
139
0610
2816
617
5312
834
2698
149
061
4203
4,58
562
Note:
The
function
aldistan
cematrixbetw
eenan
ytw
oob
served
speciesin
thepo
oled
assemblageisob
tained
byGow
erdistan
ce.
2926 ANNE CHAO ET AL. Ecology, Vol. 98, No. 11
detectability may depend on the area and topography ofselected sampling units, species spatial/temporal aggrega-tion or clustering, life history stage, as well as other fac-tors. Consequently, our estimators are actually valid inmore relaxed conditions. For example, the Chao1 speciesrichness estimator is nearly unbiased when hard-to-detectspecies (specifically, undetected species and speciesdetected by one individual) have approximately homoge-nous detection probabilities. Similar relaxed conditionscan be formulated for other estimators. Under the specialcase that all individuals have the same detectability, thedetection probability of each species reduces to its rela-tive abundance. In other applications in which speciesdetection probability may vary with time, our methodcan also be applied when the probability of detecting aspecies is modeled as the average detection probabilityover the sampling-time interval.When rare or hard-to-detect species are highly hetero-
geneous in detection probabilities, such as in microbialassemblages or DNA sequencing data, all estimatorsderived in this paper provide non-parametric lowerbounds which are valid for species, phylogenetic andfunctional diversities. In such assemblages, sample datado not provide sufficient information to accurately esti-mate asymptotic diversities due to heterogeneity of rarespecies; no statistical methods can produce reliable esti-mates unless strong assumptions are made. In suchcases, from our perspective, an accurate lower bound ismore practically useful than an imprecise point estimate.In Appendix S4, some representative simulation resultsare reported to validate the estimators derived fromGood-Turing theory and also to demonstrate that ourestimators provide useful and informative lower boundswhen accurate point estimators are not attainable.For cases that fail to meet or may not be assured to meet
the criteria for nearly-unbiased point estimation, how canwe make fair comparison of diversities across studies? Wesuggest using a non-asymptotic approach via sample-size-and coverage-based rarefaction and extrapolation on thebasis of standardized sample size or sample completeness(as measured by sample coverage). This non-asymptoticapproach facilitates fair comparison of diversities forequally-large or equally-complete samples across multipleassemblages. For species richness, sample-size-based rar-efaction and extrapolation methods were developed byColwell et al. (2012), and corresponding coverage-basedmethods were proposed by Chao and Jost (2012). ForFaith’s PD, a similar non-asymptotic method was recentlypresented by Chao et al. (2015a) and Hsieh and Chao(2017). Rarefaction and extrapolation methods for func-tional diversity are still under development by the authors.For the analysis of multiple-assemblage phylogenetic
structures, Webb et al. (2002) and Webb et al. (2008) pro-posed using a null model randomization test based on PD(or other metrics) to assess whether a specific assemblagehas a higher PD (more even or uniform) or a lower PD(phylogenetic aggregation or clustering) than expectedfrom an assembly randomly sampled from the observed
species pool. That is, the observed species pool is the com-posite list of species from all assemblages; PD is calculatedfrom the phylogenetic tree spanned by these detected spe-cies. In Webb et al.’s approach and in most null model orrandomization tests (Gotelli et al. 2010 is an exception), itis assumed that (1) sampling is complete for all assem-blages, and (2) the aggregation of observed species for allassemblages constitutes the “complete” species pool. How-ever, for the Brazilian rain forest data analyzed in ouranalysis, if we treat each fragment as an assemblage, thenneither of these assumptions is satisfied. Within each frag-ment, some species were undetected by the transect data,and some species remained undetected even in the pooledtransect data for the entire study area represented by the11 fragments. In the following, we use the Brazilian datato illustrate our suggested modifications.
1. Based on the pooled transect data from all 11 frag-ments, the Chao1 richness estimator yields the esti-mated number of species for the complete pool, whichcomprises all species recorded in the fragment tran-sects plus undetected species in the study area repre-sented by these fragments. Following the approach ofChao et al. (2015b), we can construct a complete spe-cies-rank abundance distribution (RAD) by sepa-rately adjusting the sample relative abundances forthe set of species detected in the pooled, observeddata and estimating the relative abundances for theset of species undetected in the pooled data, butinferred to be present in the area represented by allfragments. The combined RAD then fully character-izes the taxonomic assemblage structure. Thus, notonly species richness but also species relativeabundances can be estimated for all species in the esti-mated complete pool. For this analysis, it is notnecessary to know the identities of the undetectedspecies, but we do need to estimate their number andrelative abundances in the complete pool.
2. Next, a random assemblage of the same size as eachspecific assemblage is sampled from the completepool. If the sampled assemblage includes species thatbelong to the group of undetected species in thepooled data, then we treat those species as unresolvedin their phylogenetic placement and locate thoseundetected species on the observed tree in some ran-dom manner (e.g., Rangel et al. 2015). How to opti-mally locate undetected species is still underinvestigation. The Chao1-PD estimate (Eq. 8d) canthus be calculated not only for the data of the specificassemblage but also for each random assemblagesampled from the complete pool.
3. After many assemblages of the same size have beenrandomly selected from the complete species pool asin Step (2), the mean and standard error of the result-ing PD values can be computed to obtain the stan-dardized effect size (Gotelli and McCabe 2002),adjusted for the under-sampling biases for both thespecific assemblage and the pooled assemblage.
November 2017 ESTIMATING UNDETECTED DIVERSITY 2927
A similar procedure can be applied to FAD and othermetrics, although this procedure should first be bench-marked with simulated and empirical data sets to assessits performance and to see if this procedure is an effec-tive remedy for the under-sampling problem that is per-vasive in the analysis of community structure based onstandardized biodiversity sampling of multiple samples.This paper is restricted to the estimation of diversity
in one assemblage and the estimation of shared diversitybetween two assemblages. When there are more thantwo assemblages, our approach can be further extendedto such cases. All the derivations are nearly parallel. Forexample, we can directly obtain an estimator of the spe-cies shared by multiple assemblages based on samplingdata from each assemblage, and the resulting estimatoris identical to the one proposed by Pan et al. (2009).Our derivation in this paper is limited to individual-
based abundance data for individuals sampled randomlyfrom assemblages. In many ecological field studies, thesampling unit is not an individual, but a trap, net, quad-rat, plot, or timed survey. For such studies, the samplingunits, not the individuals, are sampled randomly andindependently. In these cases, estimation is usually basedon a set of sampling units in which only the incidence(detection or non-detection) of each species is recorded.This type of data is referred to as (multiple) incidencedata. The sampling model and estimation were devel-oped in Colwell et al. (2012) and Chao et al. (2014b).Chao and Colwell (2017) recently extended the Good-Turing frequency formula and its generalizations toincidence data. All estimators can be computed fromthe online software “GoodTuring”. We expect thatGood-Turing’s theory will find wide applications inbiodiversity studies; see Chao et al. (2017) for a recentapplication.
ACKNOWLEDGMENTS
The authors thank a Subject Matter Editor (Tom Miller),Joaqu�ın Hortal, and an anonymous reviewer for very thoughtfuland helpful comments and suggestions. This work was supportedby the Taiwan Ministry of Science and Technology under Con-tracts 104-2628-M007-003 and 105-2628-M007-001 (for AC) and104-2118-M-002-008-MY3 (for CHC). RKC and RLC were sup-ported by CAPES Ciencia sem Fronteiras (Brazil). LFSM wassupported by CAPES/PNPD. NJG was supported by U. S. NSFDEB 1257625, NSF DEB 1144055, and NSF DEB 1136644. Thefieldwork survey was supported by Reserva Natural Vale, FibriaCelulose S.A., Marcos Daniel Institute, Pro-Tapir project andReserva Biol�ogica de Sooretama.
LITERATURE CITED
Bryant, J. A., C. Lamanna, H. Morlon, A. J. Kerkhoff, B. J.Enquist, and J. L. Green. 2008. Microbes on mountainsides:contrasting elevational patterns of bacterial and plant diver-sity. Proceedings of the National Academy of Sciences of theUnited States of America 105:11505–11511.
Cadotte, M. W., J. Cavender-Bares, D. Tilman, and T. H. Oak-ley. 2009. Using phylogenetic, functional and trait diversity tounderstand patterns of plant community productivity. PLoSONE 4:e5695.
Cardoso, P., F. Rigal, P. A. Borges, and J. C. Carvalho. 2014. Anew frontier in biodiversity inventory: a proposal for estima-tors of phylogenetic and functional diversity. Methods inEcology and Evolution 5:452–461.
Cavender-Bares, J., D. D. Ackerly, and K. H. Kozak. 2012. Inte-grating ecology and phylogenetics: the footprint of history inmodern-day communities. Ecology 93:S1–S3.
Chao, A. 1984. Nonparametric estimation of the number ofclasses in a population. Scandinavian Journal of Statistics11:265–270.
Chao, A. 1987. Estimating the population size for capture-recap-ture data with unequal catchability. Biometrics 43:783–791.
Chao, A. 2005. Species estimation and applications. Pages7907–7916 in N. Balakrishnan, C. Read, and B. Vidakovic,editors. Encyclopedia of statistical sciences. Wiley, New York.
Chao, A., and C.-H. Chiu. 2012. Estimation of species richnessand shared species richness. Pages 76–111 in N. Balakrishnan,editor. Methods and applications of statistics in the atmo-spheric and earth sciences. Wiley, New York, USA.
Chao, A., and C.-H. Chiu. 2016. Species richness: estimationand comparison. Wiley StatsRef: Statistics Reference Online.1–26.
Chao, A., and R. K. Colwell. 2017. Thirty years of progenyfrom Chao’s inequality: estimating and comparing richnesswith incidence data and incomplete sampling (invited article).Statistics and Operation Research Transactions 41:3–54.
Chao, A., and L. Jost. 2012. Coverage-based rarefaction andextrapolation: standardizing samples by completeness ratherthan size. Ecology 93:2533–2547.
Chao, A., C.-H. Chiu, and L. Jost. 2014a. Unifying speciesdiversity, phylogenetic diversity, functional diversity, andrelated similarity and differentiation measures through Hillnumbers. Annual Review of Ecology, Evolution, and System-atics 45:297–324.
Chao, A., N. J. Gotelli, T. C. Hsieh, E. L. Sander, K. Ma, R. K.Colwell, and A. M. Ellison. 2014b. Rarefaction and extrapo-lation with Hill numbers: a framework for sampling and esti-mation in species diversity studies. Ecological Monographs84:45–67.
Chao, A., C.-H. Chiu, T. C. Hsieh, T. Davis, D. A. Nipperess, andD. P. Faith. 2015a. Rarefaction and extrapolation of phyloge-netic diversity. Methods in Ecology and Evolution 6:380–388.
Chao, A., T. C. Hsieh, R. L. Chazdon, R. K. Colwell, and N. J.Gotelli. 2015b. Unveiling the species-rank abundance distri-bution by generalizing the Good-Turing sample coverage the-ory. Ecology 96:1189–1201.
Chao, A., R. K. Colwell, C.-H. Chiu, and D. Townsend. 2017.Seen once or more than once: using Good-Turing theory toestimate species richness using only unique observationsand a species list. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12768.
Chazdon, R. L., R. K. Colwell, J. S. Denslow, and M. R. Guar-iguata. 1998. Statistical methods for estimating species rich-ness of woody regeneration in primary and secondary rainforests of Northeastern Costa Rica. Pages 285–309 in F. Dall-meier, and J. A. Comiskey, editors. Forest biodiversityresearch, monitoring and modeling: conceptual backgroundand old world case studies. Parthenon Publishing, Paris.
Chiu, C.-H., L. Jost, and A. Chao. 2014a. Phylogenetic betadiversity, similarity, and differentiation measures based onHill numbers. Ecological Monographs 84:21–44.
Chiu, C.-H., Y. T. Wang, B. A. Walther, and A. Chao. 2014b.An improved nonparametric lower bound of species richnessvia a modified Good-Turing frequency formula. Biometrics70:671–682.
Colwell, R. K., and J. A. Coddington. 1994. Estimating terres-trial biodiversity through extrapolation. Philosophical
2928 ANNE CHAO ET AL. Ecology, Vol. 98, No. 11
Transactions of the Royal Society of London B - BiologicalSciences 345:101–118.
Colwell, R. K., A. Chao, N. J. Gotelli, S.-Y. Lin, C. X. Mao,R. L. Chazdon, and J. T. Longino. 2012. Models and estima-tors linking individual-based and sample-based rarefaction,extrapolation and comparison of assemblages. Journal ofPlant Ecology 5:3–21.
Diaz, S., and M. Cabido. 2001. Vive la diff�erence: plant func-tional diversity matters to ecosystem processes. Trends inEcology and Evolution 16:646–655.
Faith, D. P. 1992. Conservation evaluation and phylogeneticdiversity. Biological Conservation 61:1–10.
Ferrier, S., G. Manion, J. Elith, and K. Richardson. 2007. Usinggeneralized dissimilarity modelling to analyse and predictpatterns of beta diversity in regional biodiversity assessment.Diversity and Distributions 13:252–264.
Good, I. J. 1953. The population frequencies of species and theestimation of population parameters. Biometrika 40:237–264.
Good, I. J. 1983. Good thinking: the foundations of probabil-ity and its applications. University of Minnesota Press,Minneapolis, USA.
Good, I. J. 2000. Turing’s anticipation of empirical Bayes inconnection with the cryptanalysis of the naval enigma. Jour-nal of Statistical Computation and Simulation 66:101–111.
Good, I. J., and G. Toulmin. 1956. The number of new speciesand the increase of population coverage when a sample isincreased. Biometrika 43:45–63.
Gotelli, N. J., and A. Chao. 2013. Measuring and estimatingspecies richness, species diversity, and biotic similarity fromsampling data. Pages 195–211 in S. A. Levin, editor. Encyclo-pedia of biodiversity. Second edition. Volume 5. AcademicPress, Waltham, MA, USA.
Gotelli, N. J., and R. K. Colwell. 2011. Estimating species rich-ness. Pages 39–54 in A. Magurran, and B. McGill, editors.Biological diversity: frontiers in measurement and assess-ment. Oxford University Press, Oxford.
Gotelli, N. J., and D. J. McCabe. 2002. Species co-occurrence: ameta-analysis of J.M. Diamond’s assembly rules model. Ecol-ogy 83:2091–2096.
Gotelli, N. J., R. M. Dorazio, A. M. Ellison, and G. D. Gross-man. 2010. Detecting temporal trends in species assemblageswith bootstrapping procedures and hierarchical models. Philo-sophical Transactions of the Royal Society B 365:3621–3631.
Hortal, J., P. A. V. Borges, and C. Gaspar. 2006. Evaluating theperformance of species richness estimators: sensitivity to sam-ple grain size. Journal of Animal Ecology 75:274–287.
Hsieh, T. C., and A. Chao. 2017. Rarefaction and extrapolation:making fair comparison of abundance-sensitive phylogeneticdiversity among multiple assemblages. Systematic Biology66:100–111.
Jesus, R. M., and S. G. Rolim. 2005. Fitossociologia da florestaatlantica de tabuleiro em Linhares (ES). Boletim T�ecnico SIF19:1–149.
Jost, L., A. Chao, and R. Chazdon. 2011. Compositional simi-larity and beta diversity. Pages 66–84 in A. Magurran, andB. McGill, editors. Biological diversity: frontiers in measure-ment and assessment. Oxford University Press, Oxford, UK.
Lozupone, C., and R. Knight. 2005. UniFrac: a new phyloge-netic method for comparing microbial communities. Appliedand Environmental Microbiology 71:8228–8235.
Magnago, L. F. S., D. P. Edwards, F. A. Edwards, A. Magrach,S. V. Martins, and W. F. Laurance. 2014. Functionalattributes change but functional richness is unchanged afterfragmentation of Brazilian Atlantic forests. Journal of Ecol-ogy 102:475–485.
Magurran, A. E. 2004. Measuring biological diversity. Black-well, Oxford, UK.
Matos, F. A. R., L. F. S. Magnago, M. Gastauer, J. M. B. Car-reiras, M. Simonelli, J. A. A. Meira-Neto, and D. P. Edwards.2017. Effects of landscape configuration and composition onphylogenetic diversity of trees in a highly fragmented tropicalforest. Journal of Ecology 105:265–276.
McGrayne, S. B. 2011. The theory that would not die: howBayes’ rule cracked the enigma code, hunted down Russiansubmarines, and emerged triumphant from two centuries ofcontroversy. Yale University Press, New Haven, USA.
Pan, H. Y., A. Chao, and W. Foissner. 2009. A non-parametriclower bound for the number of species shared by multiplecommunities. Journal of Agricultural, Biological and Envi-ronmental Statistics 14:452–468.
Paula, A., and J. J. Soares. 2011. Estrutura horizontal de umtrecho de Floresta Ombr�ofila Densa das Terras Baixas naReserva Biol�ogica de Sooretama, Linhares, ES. Revista Flor-esta 41:321–334.
Rangel, T. F., R. K. Colwell, G. R. Graves, K. Fu�c�ıkov�a, C.Rahbek, and J. A. F. Diniz-Filho. 2015. Phylogenetic uncer-tainty revisited: Implications for ecological analyses. Evolu-tion 69:1301–1312.
Robbins, H. E. 1968. Estimating the total probability of theunobserved outcomes of an experiment. The Annals of Math-ematical Statistics 39:256–257.
Rolim, S. G., and H. E. M. Nascimento. 1997. An�alise dariqueza, diversidade e relac�~ao esp�ecie-abundancia de umacomunidade arb�orea tropical em diferentes intensidadesamostrais. Scientia Forestalis 52:7–16.
Schmera, D., T. Er€os, and J. Podani. 2009. A measure for assess-ing functional diversity in ecological communities. AquaticEcology 43:157–167.
Swenson, N. G., D. L. Erickson, X. Mi, N. A. Bourg, J. Forero-Monta~na, X. Ge, R. Howe, J. K. Lake, X. Liu, and K. Ma.2012. Phylogenetic and functional alpha and beta diversityin temperate and tropical tree communities. Ecology 93:S112–S125.
Tilman, D., J. Knops, D. Wedin, P. Reich, M. Ritchie, andE. Siemann. 1997. The influence of functional diversityand composition on ecosystem processes. Science 277:1300–1302.
Walker, B., A. Kinzig, and J. Langridge. 1999. Plant attributediversity, resilience, and ecosystem function: the nature andsignificance of dominant and minor species. Ecosystems2:95–113.
Webb, C. O., and M. J. Donoghue. 2005. Phylomatic: treeassembly for applied phylogenetics. Molecular Ecology Notes5:181–183.
Webb, C. O., D. D. Ackerly, M. A. McPeek, and M. J. Dono-ghue. 2002. Phylogenies and community ecology. AnnualReview of Ecology and Systematics 33:475–505.
Webb, C. O., D. D. Ackerly, and S. W. Kembel. 2008. Phylocom:software for the analysis of phylogenetic community structureand trait evolution. Bioinformatics 24:2098–2100.
SUPPORTING INFORMATION
Additional supporting information may be found in the online version of this article at http://onlinelibrary.wiley.com/doi/10.1002/ecy.2000/suppinfo
November 2017 ESTIMATING UNDETECTED DIVERSITY 2929
1
Appendix S1
Deciphering the Enigma of Undetected Species, Phylogenetic, and
Functional Diversity Based on Good-Turing Theory
Anne Chao, Chun-Huo Chiu, Robert K. Colwell, Luiz Fernando S. Magnago, Robin L.
Chazdon, and Nicholas J. Gotelli
To make the contents of this appendix self-contained, some definitions introduced in the main
text are repeated here.
APPENDIX S1: Species diversity (mathematical details)
Proof of Eq. (1d) of the main text
Assume that a random sample of n individuals is taken from an assemblage with S species and
relative abundance set ),...,,( 21 Sppp . Let Xi denote the frequency (abundance) of the i-th
species in the sample, i = 1, 2, …, S, nXix i 1
, and let fr denote the number of species
each represented by exactly r individuals in the sample, i.e.,
S
iir rXIf
1
)( , r = 0, 1, 2, …,
where I(A) is the indicator function, i.e., I(A) = 1 if the event A occurs, and 0 otherwise. Given
data, for those species that each appeared exactly r times in an incomplete sample of n
individuals, the mean of their true relative abundances/frequencies in the assemblage, αr, can
be mathematically expressed as
r
S
i iir frXIp /)(1
, r = 0, 1, 2, …
In the main text, we reviewed a very intuitive justification by Good (1983, p. 28) for the
Good-Turing frequency formula. Here present another simple proof for the following
2
improved Good-Turing formula that we adopted in our derivations:
r
rr
frn
fr
)(
)1(ˆ 1
, r = 0, 1, 2, …
Assume that the sample frequencies (X1, X2, …, XS) follow a multinomial distribution with
cell total n and cell probabilities ),...,,( 21 Sppp . Then each Xi, i = 1, 2, …, S, is a binomial
distribution B(n, pi), leading to the following formula for the expected value of fr:
S
i
rn
i
r
i
S
iir pp
r
nrXIEfE
11
)1()()( , r = 0, 1, 2, …
Based on the above formula, we can derive an approximation formula for the expected value
of
S
i ii rXIp1
)( (the numerator of αr) as follows.
)(1
rXIpE i
S
ii
S
i
rn
i
r
ii ppr
np
1
)1(
S
i
rn
i
r
i ppr
n
1
)1(1 )1(
)()(
)1()1(
1
1
11
)1(1
r
S
i
rn
i
r
i fErn
rpp
r
n
r
n
r
n
.
This implies that the numerator of αr can be estimated by )/()1( 1 rnfr r , and thus Eq.
(1d) is proved. For the special cases of r = 0 and r = 1, we have
𝛼0𝑓0 =𝑓1
𝑛; ��1 =
2𝑓2
(𝑛−1)𝑓1,
where 𝛼0𝑓0 denote the estimator of the product of α0
and f0. Then we obtain a lower bound
(Eq. 2c in the main text) for the number of undetected species by the following inequality:
𝑓0 =𝛼0𝑓0
��0≥
𝛼0𝑓0
��1=
𝑓1𝑛
2𝑓2(𝑛−1)𝑓1
=(𝑛−1)
𝑛
𝑓12
2𝑓2.
Proof of Eq. (5c) of the main text
Consider the case of two assemblages (I and II), which can differ not only in their species
richness, but also in their species composition. Assume that there are S species in the pooled
3
assemblage. Following the two-assemblage model formulation and data framework discussed
in the main text, we assume that the species sample frequencies ),...,,( 12111 SXXX of Sample
I taken from Assemblage I follow a multinomial distribution with cell total n1 and cell
probabilities ),...,,( 12111 Sppp , and the species sample frequencies ),...,,( 22212 SXXX of
Sample II taken from Assemblage II follow a multinomial distribution with cell total n2 and
cell probabilities and ),...,,( 22212 Sppp . As defined in the main text, let frv denote the number
of shared species that are observed r times in Sample I and v times in Sample II. That is,
,),(12
1 21
S
i iirv vXrXIf r, v = 0, 1, 2, …
where S12 denotes the number of species shared by the two focal assemblages. Assuming,
further, that the two samples are statistically independent, we have the following general
formula for the expected value of frv for r, v = 0, 1, 2, …
,),()( 12
1 21
S
i iirv vXrXPfE
vn
i
v
i
S
i
rn
i
r
i ppv
npp
r
n
2
12
1 )1()1( 22
2
111
1.
For any shared species that appeared exactly r times in each sample (there are frr such
shared species in the two-sample data), consider calculating the product of its true relative
abundances in the two assemblages; the mean of the products among all such shared species
can be mathematically expressed as
rr
S
i iiiirr frXrXIpp /),(12
1 2121 , r = 0, 1, 2, …
As with the one-sample Good-Turing formula, we first approximate the expected value of the
numerator of rr . Note that we can write
12
1 2121 ),(S
i iiii rXrXIppE
vn
i
v
i
S
i
rn
i
r
iii ppv
npp
r
npp
2
12
1 )1()1( 22
2
111
1
21
4
)1(
2
1
21
)1(
1
1
1
21 2
12
1 )1()1(
rn
i
r
i
S
i
rn
i
r
i ppppr
n
r
n
)(
11
1,1
21
21
rrfE
r
n
r
n
r
n
r
n
))((
)()1(
21
1,1
2
rnrn
fEr rr
.
This implies that the numerator of αrr can be estimated by )])(/[()1( 211,1
2 rnrnfr rr .
We then obtain the generalized two-assemblage Good-Turing formula:
rr
rr
rrfrnrn
fr
))((
)1(ˆ
21
1,1
2
, r = 0, 1, 2, …
In particular, we have
𝛼00𝑓00 =
𝑓11
𝑛1𝑛2; ��11 =
4𝑓22
(𝑛1−1)(𝑛2−1)𝑓11.
Thus, we have a lower bound for f00 (Eq. 7c in the main text) by the following inequality:
𝑓00 =𝛼00𝑓00
��00≥
𝛼00𝑓00
��11=
𝑓11𝑛1𝑛24𝑓22
(𝑛1−1)(𝑛2−1)𝑓11
=(𝑛1−1)
𝑛1
(𝑛2−1)
𝑛2
𝑓112
4𝑓22.
This extends the original Good-Turing one-assemblage formula to the two-assemblage case.
Reference for Appendix S1
Good, I. J. 1983. Good thinking: The foundations of probability and its applications.
University of Minnesota Press, Minneapolis, USA.
1
Appendix S2
Deciphering the Enigma of Undetected Species, Phylogenetic, and
Functional Diversity Based on Good-Turing Theory
Anne Chao, Chun-Huo Chiu, Robert K. Colwell, Luiz Fernando S. Magnago, Robin L.
Chazdon, and Nicholas J. Gotelli
To make the contents of this appendix self-contained, some definitions introduced in the main
text are repeated here.
APPENDIX S2: Phylogenetic diversity (mathematical details)
Proof of Eq. (8b) of the main text
As defined in the main text, let ai denote the node/branch relative abundance, i.e., ai
represents the total relative abundance of the species descended from the ith node/branch, i =
1, 2, …, B. The set of species relative abundances (p1, p2, …, pS) for tip nodes is expanded to
a larger relative abundance set }...,,2,1,{ Biai with (p1, p2, …, pS) as its first S elements.
For a sample of n individuals with observed species abundances ),...,,( 21 SXXX , let *
iX
be the sum of the sample species abundances for those species descended from branch i. Then
we can expand the set of observed species abundances or frequencies ),...,,( 21 SXXX to a
larger branch abundance set }...,,2,1,{ * BiX i
with ),...,,( 21 SXXX as its first S
elements. As introduced in the main text, we refer to ,*
iX Bi ...,,2,1 , as the sample
node/branch abundance of node/branch i. Let
,)(1
*
B
i iir rXILg r = 0, 1, 2, ….
be the total length of the branches with sample branch abundance r. A simple example is
2
provided in Appendix S2: Fig. 1 to illustrate these measures.
Here we first extend the Good-Turing frequency formula to its phylogenetic version for a
single assemblage. For those nodes/branches that each is with a sample abundance of r, the
branch-length-weighted mean (per unit length) of their true abundances in the entire
assemblage, λr, can be expressed as
r
B
i iiir grXIaL /)(1
*
, r = 0, 1, 2, …
We now prove the corresponding phylogenetic version of the Good-Turing one-assemblage
frequency formula:
r
rr
grn
gr
)(
)1(ˆ 1
, r = 0, 1, 2, …
All the following derivation steps are parallel to those in the original Good-Turing formula
(Appendix S1) for species richness. Under the model that the sample branch frequency *
iX
follows a binomial distribution B(n, ai), we have a general form for the expected value of gr:
B
i
rn
i
r
ii
B
iiir aa
r
nLrXILEgE
11
* )1()()( , r = 0, 1, 2, …
Then we can estimate the expected value of the numerator of r by the following:
)( *
1
rXIaLE i
B
iii
B
i
rn
i
r
iii aar
naL
1
)1(
B
i
rn
i
r
ii aar
nL
1
)1(1 )1(
)()(
)1()1(
1
1
11
)1(1
r
B
i
rn
i
r
ii gErn
raa
r
nL
r
n
r
n
.
This implies that the numerator of r can be estimated by )/()1( 1 rngr r , and we obtain
the phylogenetic version of the Good-Turing one-assemblage frequency formula. For the
special cases of r = 0 and r = 1, we have
𝜆0𝑔0 =𝑔1
𝑛; ��1 =
2𝑔2
(𝑛−1)𝑔1.
Therefore, a lower bound for the undetected PD is obtained:
3
��0 =𝜆0𝑔0
��0≥
𝜆0𝑔0
��1=
𝑔1𝑛
2𝑔2(𝑛−1)𝑔1
=(𝑛−1)
𝑛
𝑔12
2𝑔2.
Appendix S2: Table S1 summarizes the analogy between species richness estimation and PD
estimation. This analogy means that most ecological indices defined at the species level can
be converted to PD equivalents; see Faith (1992) and Chao et al. (2014).
4
Fig. S1. (a) A hypothetical tree spanned by S = 9 species (tip nodes) indexed by 1, 2, …, 9 in
an assemblage. Here the root of the entire assemblage is selected as the reference point for
illustration. (b) A sub-tree spanned by the observed 5 species (1, 2, 3, 5 and 6, black tip dots)
in a sample from the assemblage; Species 4, 7, 8 and 9 are not detected in the sample (grey tip
dots), so only a portion of the tree (solid branches in the left panels) is observed, as shown in
Panel (b). The sample abundance/frequency vector for 9 species is (X1, X2, …, X9) = (6, 1, 1,
0, 2, 1, 0, 0, 0), with f1 = 3, f2 = 1; only non-zero frequencies are observed in the sample.
Black dots in Panel (a) are nodes with sample frequency > 0; grey dots are nodes with
frequency = 0. The branch set in the assemblage includes B = 15 branches/nodes (indexed
from 1 to 15) with branch lengths )...,,,( 1521 LLL . The corresponding node/branch
frequencies are denoted by ( *
15
*
11
*
10
*
9
*
2
*
1 ...,,, , , , X X XX ..., X X ) = (6, 1, 1, 0, 2, 1, 0, 0, 0, 2, 8, 2, 1,
0, 1) with (X1, X2, …, X9) as its first 9 elements. The dotted branches in Panel (a) are not
detected in the sample, and the total length of the undetected branches is g0 = L4 + L7 + L8 + L9
+ L14. Black dots in Panel (b) are nodes in the observed tree; greys dots are not observed in
the tree. The total length of those branches with sample node/branch frequency*
iX = 1 in the
observed tree is g1 =
B
i ii XIL1
* )1( L2 + L3 + L6 + L13 + L15 (as shown by green lines in the
observed tree in Panel (b)); the total length with sample node/branch frequency*
iX = 2 in the
observed tree is g2 =
B
i ii XIL1
* )2( L5 + L10 + L12 (as shown by red lines in the observed
tree in Panel (b)). The number of nodes/branches with abundance = 1 is *
1f = 3; the number
of nodes/branches with abundance = 2 is *
2f = 2.
5
Table S1. A summary of the analogy between species richness estimation and PD
estimation. A phylo-entity (= phylogenetic entity) is defined as a branch of unit length. A
phylogenetic entity in estimating Faith’s PD plays exactly the same role as a species in
estimating species richness. fr: the number of species each represented by exactly r
individuals in the sample. gr: the total length of the branches with sample branch
abundance/frequency r.
Species richness Faith’s phylogenetic diversity (PD)
Description Formula Formula Description
Species
richness S PD=
B
i iL1
Phylo-entity
richness
Observed
species
richness
S
i i
i iobs
XI
ffS
1
0
)0(
B
i ii
i iobs
XIL
ggPD
1
*
0
)0(
Observed
phylo-entity
richness
Undetected
species
richness
S
i iXIf10 )0(
B
i ii XILg1
*
0 )0(
Undetected
phylo-entity
richness
Good-Turing
estimating
target r
S
i ii
rf
rXIp
1)(
r
B
i iii
rg
rXIaL
1
* )(
Phylogenetic
Good-Turing
estimating
target
Good-Turing
frequency
formula r
rr
frn
fr
)(
)1(ˆ 1
r
rr
grn
gr
)(
)1(ˆ 1
Phylogenetic
Good-Turing
frequency
formula
Estimator of
undetected
species
richness
2
2
10
2
)1(ˆf
f
n
nf
2
2
10
2
)1(ˆ
g
g
n
ng
Estimator of
undetected
phylo-entity
richness
Chao1
estimator 01
ˆˆ fSS obsChao PDChao1 0gSobs Chao1-PD
estimator
6
Proof of Eq. (10b) of the main text
Under the two-assemblage model formulation and data framework discussed in the main text,
the set of species relative abundances ),...,,( 12111 Sppp of Assemblage I is expanded to a
larger relative abundance set }...,,2,1,{ 1 Biai . Similarly, the set of species relative
abundances
),...,,( 22212 Sppp of Assemblage II is expanded to a larger relative abundance
set }...,,2,1,{ 2 Biai . Assume that there are B12 branches shared by two assemblages, and
without loss of generality, these shared branches are indexed from 1, 2, …, B12 for notational
simplicity. Denote the true shared PD (the total length of branch segments shared by the two
individual trees) by PDshared.
As described in the main text, the set of the observed species frequencies in Assemblage
I, ),...,,( 12111 SXXX , is expanded to a larger sample node/branch frequency set
},...,2,1,{ *
1 BiX i . Similarly, we expand the set of observed species frequencies in
Assemblage II, ),...,,( 22212 SXXX , to a larger sample branch abundance set
},...,2,1,{ *
2 BiX i . As in one-assemblage case, *
1iX follows a binomial distribution B(n1,
ai1), and *
2iX follows a binomial distribution B(n2, ai2). Let grv be the total length of those
shared branches with sample branch abundance r in Sample I and sample branch abundance v
in Sample II. That is,
,),(12
1
*
2
*
1
B
i iiirv vXrXILg r, v = 0, 1, 2…
Assuming further that the two samples are statistically independent, we have a general
formula for the expected value of grv for r, v = 0, 1, 2…
,),()( 12
1
*
2
*
1
B
i iiirv vXrXPLgE
vn
i
v
i
B
i
rn
i
r
ii aav
naa
r
nL
2
12
1 )1()1( 22
2
111
1.
7
For any shared branch segment with node/branch abundance r in each observed branch set,
consider calculating the product of its true branch relative abundances in the two assemblages.
The branch-length-weighted mean (per unit-length) of the products among all such shared
branches (their total length is rrg ) can be expressed as
rr
B
i iiiiirr grXrXIaaL /),(12
1
*
2
*
121 , r = 0, 1, 2…
We first approximate the expected value of the numerator of rr . Note that we can write
12
1
*
2
*
121 ),(B
i iiiii rXrXIaaLE
vn
i
v
i
B
i
rn
i
r
iiii aav
naa
r
naaL
2
12
1 )1()1( 22
2
111
1
21
)1(
2
1
21
)1(
1
1
1
21 2
12
1 )1()1(
rn
i
r
i
B
i
rn
i
r
ii aaaar
n
r
nL
)(
11
1,1
21
21
rrgE
r
n
r
n
r
n
r
n
))((
)()1(
21
1,1
2
rnrn
gEr rr
.
Thus, the numerator of αrr can be estimated by )])(/[()1( 211,1
2 rnrngr rr . We then
obtain the generalized two-assemblage phylogenetic Good-Turing formula:
rr
rr
rrgrnrn
gr
))((
)1(ˆ
21
1,1
2
, r = 0, 1, 2, …
In particular, we have
𝜆00𝑔00 =
𝑔11
𝑛1𝑛2; ��11 =
4𝑔22
(𝑛1−1)(𝑛2−1)𝑔11.
Then we obtain a lower bound of g00:
��00 =𝜆00𝑔00
��00≥
𝜆00𝑔00
��11=
𝑔11𝑛1𝑛24𝑔22
(𝑛1−1)(𝑛2−1)𝑔11
=(𝑛1−1)
𝑛1
(𝑛2−1)
𝑛2
𝑔112
4𝑔22.
All the above derivation steps are parallel to those in Appendix S1 for estimating shared
species richness. Appendix S2: Table S2 summarizes the analogy between shared species
8
richness estimation and shared PD estimation. The analogy lies in that any unit-length branch
shared by two assemblages plays the same role as a “shared species” in the estimation of
shared species richness. Thus, each unit-length branch segment can be regarded as a shared
phylogenetic entity. For example, a shared branch of 8-unit length is counted as 8 shared
phylogenetic entities. In the estimation of shred PD, we are measuring the total number of
shared phylogenetic entities, equivalently, the total shared branch length (because each entity
has length of unity). Using this perspective, the counts of shared branch lengths
}...,2,1,0,,{ vrgrv in estimating shared PD play a role corresponding to the frequency
counts ...},2,1,0,,{ vrf rv in estimating shared species richness. Hence, not only within-
assemblage but also between-assemblage measures defined at the species level can be directly
converted to their PD equivalents.
9
Table S2. A summary of the analogy between shared species richness estimation and
shared PD estimation. B12: number of shared branches between two assemblages. A shared
phylo-entity (= shared phylogenetic entity) is defined as a shared branch with a unit of length.
A shared phylogenetic entity in estimating shared PD plays the same role as a shared species
in estimating shared species richness. iii nnk /)1( , i = 1, 2.
Shared species richness Shared PD
Description Formula Formula Description
Shared
species
richness
sharedS or S12 12
1
B
i ishared LPD
Shared
phylo-entity
richness
Observed
shared
species
richness
12
1 21
0,,
)0,0(S
i ii
vr rvobsshared
XXI
ffS
12
1
*
2
*
1
0,,
)0,0(B
i iii
vr rvobsshared
XXIL
ggPD
Observed
shared
phylo-entity
richness
Undetected
shared
species
richness
0000 fff 0000 ggg
Undetected
shared
phylo-entity
richness
Two-
assemblage
Good-Turing
estimating
target
rr
S
i iiii
rrf
rXrXIpp
12
1 2121 ),(
rr
B
i iiiii
rrg
rXrXIaaL
12
1
*
2
*
121 ),(
Phylogenetic
two-
assemblage
Good-Turing
estimating
target
Two-
assemblage
Good-Turing
frequency
formula
rr
rr
rrfrnrn
fr
,21
1,1
2
))((
)1(ˆ
rr
rr
rrgrnrn
gr
,21
1,1
))((
)1(ˆ
Phylogenetic
two-
assemblage
Good-Turing
frequency
formula
Estimator of
undetected
shared
species
richness
22
2
1121
2
2
12
2
2
11
422 f
fkk
f
fk
f
fk
22
2
1121
2
2
12
2
2
11
422 g
gkk
g
gk
g
gk
Estimator of
undetected
shared
phylo-entity
richness
Chao1
shared
estimator
obssharedsharedChao SS ,,1ˆ +
estimator of undetected
PDShared,Chao1 obssharedPD , +
estimator of undetected
Chao1-PD
shared
estimator
10
References for Appendix S2
Chao, A., C.-H. Chiu, and L. Jost. 2014. Unifying species diversity, phylogenetic diversity,
functional diversity, and related similarity and differentiation measures through Hill
numbers. Annual Review of Ecology, Evolution, and Systematics 45:297−324.
Faith, D. P. 1992. Conservation evaluation and phylogenetic diversity. Biological
Conservation 61:1−10.
1
Appendix S3
Deciphering the Enigma of Undetected Species, Phylogenetic, and
Functional Diversity Based on Good-Turing Theory
Anne Chao, Chun-Huo Chiu, Robert K. Colwell, Luiz Fernando S. Magnago, Robin L.
Chazdon, and Nicholas J. Gotelli
To make the contents of this appendix self-contained, some definitions introduced in the main
text are repeated here.
APPENDIX S3: Functional diversity (mathematical details)
Proof of Eq. (12c) of the main text
Consider an assemblage in which all species are characterized by a set of functional traits,
which can be categorical or continuous variables. The pairwise distances are calculated by
some distance metric (e.g., Euclidean distance or Gower distance) based on the values of
species traits. Let dij be the functional distance between the ith and jth species, with dij = dji >
0. Walker et al.’s (1999) FAD in the entire assemblage is expressed as
S
ji ijdFAD1,
.
Assume that a random sample of n individuals is taken from an assemblage with S
species and relative abundance set ),...,,( 21 Sppp . Let Xi denote the sample frequency
(abundance) of the i-th species in the sample, i = 1, 2, …, S, . As defined in the
main text, let Frv denote the total distance of those species pairs with sample abundances r and
v respectively for the two species in each pair. That is,
S
jijiijrv vXrXIdF
1,
),( .
Assume that the sample frequencies (X1, X2, …, XS) follow a multinomial distribution with
cell total n and cell probabilities ),...,,( 21 Sppp . Then (Xi, Xj) follows a trinomial distribution
nXiX i 1
2
with cell total n and cell probabilities )1,,( jiji pppp , leading to a general formula for
the expected value of Frv:
S
ji jiijrv vXrXIdEFE1,
),()(
vrn
ji
v
j
S
ji
r
iij ppppvrnvr
nd
)1()!(!!
!
1,
.
Out of all the species which appeared exactly r times in the sample, consider choosing any
two to form a pair and calculating the product of their true relative abundances in the
assemblage. The distance-weighted mean (per unit-distance) of the products among all such
pairs (their total pairwise distance is Frr) can be expressed as
rr
S
ji jijiijrr FrXrXIppd /),(1,
, r = 0, 1, 2…
We first approximate the expected value of the numerator of rr . Note that we can write
S
jijijiij rXrXIppdE
1,
),(
rn
ji
r
j
S
ji
r
iij pppprnrr
nd
21
1,
1 )1()!2(!!
!
)22(1
1,
1 )1()!2(!!
!
rn
ji
r
j
S
ji
r
iij pppprnrr
nd
)(
)!22(])!1[(
!
)!2()!(
!
1,1
2
2
rrFE
rnr
n
rnr
n
)12)(2(
)()1( 1,1
2
rnrn
FEr rr.
Then we have the functional version of the Good-Turing formula:
rr
rr
rrFrnrn
Fr
)12)(2(
)1(ˆ 1,1
2
, r = 0, 1, 2, …
In particular, we have
𝜃00𝐹00 =
𝐹11
𝑛(𝑛−1); 𝜃11 =
4𝐹22
(𝑛−2)(𝑛−3)𝐹11.
Then we obtain a lower bound for F00:
3
��00 =𝜃00𝐹00
��00≥
𝜃00𝐹00
��11=
𝐹11𝑛(𝑛−1)
4𝐹22(𝑛−2)(𝑛−3)𝐹11
=(𝑛−2)
𝑛
(𝑛−3)
𝑛−1
𝐹112
4𝐹22.
Appendix S3: Table S1, below, summarizes the analog between shared species richness
estimation between two assemblages and FAD estimation in a single assemblage. A
“functional entity” (defined as a species pair with one unit of distance) plays the same role as
a “shared species” between two assemblages.
4
Table S1. A summary of the analogy between shared species richness estimation between
two assemblages and FAD estimation in a single assemblage. S12: number of shared
species between two assemblages. In the table, a func-entity (= functional entity) is defined as
a species pair with a unit of distance of dissimilarity. Functional entities play the same role as
“shared species” between two assemblages; iii nnk /)1( , i = 1, 2.
Shared species richness FAD (Functional Attribute Diversity)
Description Formula Formula Description
Shared
species
richness
sharedS or S12
S
ji ijdFAD1,
Func-entity
richness
Observed
shared
species
richness
12
1 21
0,,
)0,0(S
i ii
vr rvobsshared
XXI
ffS
S
ji jiij
vr rvobs
XXId
FFFAD
1,
0,
)0,0(
Observed
func-entity
richness
Undetected
shared
species
richness
0000 fff 0000 FFF
Undetected
func-entity
richness
Two-
assemblage
Good-Turing
estimating
target
rr
S
iiiii
rrf
rXrXIpp
12
12121 ),(
rr
S
jijijiij
rrF
rXrXIppd
1,
),(
Functional
Good-
Turing
estimating
target
Two-
assemblage
Good-Turing
frequency
formula
rr
rr
rrfrnrn
fr
))((
)1(ˆ
21
1,1
2
rr
rr
rrFrnrn
Fr
)12)(2(
)1(ˆ 1,1
2
Functional
Good-
Turing
frequency
formula
Estimator of
undetected
shared
species
richness
22
2
1121
2
2
12
2
2
11
422 f
fkk
f
fk
f
fk
22
2
11
2
2
1
2
2
1
4)1(
)3)(2(
22 F
F
nn
nn
F
Fk
F
Fk
Estimator of
undetected
func-entity
richness
Chao1
shared
estimator
sharedChaoS ,1ˆ obssharedS ,
estimator of the undetected
FADChao1
obsFAD + estimator of the
undetected
Chao1-FAD
estimator
5
Proof of Eq. (14b) of the main text
Under the two-assemblage model formulation and data framework discussed in the main text,
the total distance of the species pairs shared by two assemblages can be expressed as the sum
of 16 terms, which are shown, below, in a 4 x 4 matrix (Appendix S3: Table S2).
0,,
))((0,, mk
kmrvvr
shared FFAD
)00)(00()0)(()0)(())(( .... FFFF .
Here ))(( kmrvF
measures the total functional distance of those pairs of shared species with
abundances (r, v) in Sample I and abundances (k, m) in Sample II for the two species in each
pair. Where any index is replaced by a “+” sign, it means that those species occur at least once
in the corresponding sample. In the above equation, the first term ))(( F = FADshared,obs
denotes the observed shared FAD; the other 15 terms are unknown. The matrix in Appendix
S3: Table S2 is symmetric, so that there are only 10 unique terms. For each term, we can
apply a similar type of Good-Turing formula to obtain an estimator (given in Appendix S3:
Table S3), leading to the following Chao1-FAD-shared estimator:
FADChao1,shared )00)(00()0)(()0)(())((ˆ....ˆˆ FFFF .
A bootstrap method can be used to assess the sampling variance of the above estimator and to
obtain the associated confidence interval.
6
Table S2. The FAD shared by two assemblages can be written as the sum of 16 terms (in
colored cells). For any shared species pair (i, j) with abundances (Xi1, Xj1) in Sample I and
abundances (Xi2, Xj2) in Sample II for the two shared species in each pair, the green cell (Cell
1) denotes the case that two species i and j appear in both samples; the total distance of such
pairs is F(++)(++). Cell 3 denotes the case that the two species i and j both appear in Sample I,
but only species i appears in Sample II; the total distance of such pairs is F(++)(+0). Cell 3*
denotes the case that the two species i and j both appear in Sample I, but only species j
appears in sample II; the total distance of such pairs is F(++)(0+). Here we have F(++)(+0) =
F(++)(0+) so that the matrix is symmetric. The blue cells (Cells 2, 3, 3*) plus the green cell
denote the case that the two species i and j both appear in Sample I. The yellow cells (Cells 4,
4*, 8) plus the green cell denote the case that the two species i and j both appear in Sample II.
Other cases are in grey cells.
Shared species pair (i, j)
Xj1 > 0 Xj1 = 0
Xj2 > 0 Xj2 = 0 Xj2 > 0 Xj2 = 0
Xi1 > 0
Xi2 > 0 F(++)(++)
Cell 1
F(++)(+0)
Cell 3
F(+0)(++)
Cell 4
F(+0)(+0)
Cell 5
Xi2 = 0 F(++)(0+)
Cell 3*
F(++)(00)
Cell 2
F(+0)(0+)
Cell 6
F(+0)(00)
Cell 7
Xi1 = 0
Xi2 > 0 F(0+)(++)
Cell 4*
F(0+)(+0)
Cell 6*
F(00)(++)
Cell 8
F(00)(+0)
Cell 9
Xi2 = 0 F(0+)(0+)
Cell 5*
F(0+)(00)
Cell 7*
F(00)(0+)
Cell 9*
F(00)(00)
Cell 10
7
Table S3. The observed shared FAD (Cell 1) and the estimator for each of the 15 unknown
terms in the expression of the true FAD shared by two assemblages; see Appendix S3: Table
S2 for cell definitions.
Cell Functional distance Statistics/Estimator
1 ))(( F obssharedFADF ,))((
2 )00)((F
)22)((
2
)11)((
22
22
4)1(
)3)(2(
F
F
nn
nn
3, 3∗ )0)(()0)(( FF )2)((
2
)1)((
2
2
2
1
F
F
n
n
4, 4* ))(0())(0( FF
))(2(
2
))(1(
1
1
2
1
F
F
n
n
5, 5* )0)(0()0)(0( FF
)2)(2(
2
)1)(1(
2
2
1
1
4
)1()1(
F
F
n
n
n
n
6, 6* )0)(0()0)(0( FF )2)(2(
2
)1)(1(
2
2
1
1
4
)1()1(
F
F
n
n
n
n
7, 7∗ )00)(0()00)(0( FF )22)(2(
2
)11)(1(
22
22
1
1
8)1(
)3)(2(1
F
F
nn
nn
n
n
8 ))(00( F ))(22(
2
))(11(
11
11
4)1(
)3)(2(
F
F
nn
nn
9, 9* )0)(00()0)(00( FF )2)(22(
2
)1)(11(
2
2
11
11
8
)1(
)1(
)3)(2(
F
F
n
n
nn
nn
10 )00)(00(F )22)(22(
2
)11)(11(
22
22
11
11
16)1(
)3)(2(
)1(
)3)(2(
F
F
nn
nn
nn
nn
1
Appendix S4
Deciphering the Enigma of Undetected Species, Phylogenetic, and
Functional Diversity Based on Good-Turing Theory
Anne Chao, Chun-Huo Chiu, Robert K. Colwell, Luiz Fernando S. Magnago, Robin L.
Chazdon, and Nicholas J. Gotelli
APPENDIX S4: Simulation Results
All estimators derived from Good-Turing theory are non-parametric lower bounds of the
corresponding species, phylogenetic, and functional diversity. Here the lower bounds are
nonparametric in the sense that they are universally valid for any species abundance
distribution. Good-Turing theory reveals that each estimator is nearly unbiased if rare
species/nodes in each assemblage have approximately homogenous abundances; the abundant
species/nodes could be highly heterogeneous without affecting the estimator. We refer this
condition as the condition for unbiasedness.
Here we report some representative simulation results to numerically validate Good-
Turing theory and examine the performance of various estimators derived from the theory. In
our simulation study, we generated data from several theoretical species abundance models,
as specified below. In most natural assemblages, there are many rare species and only a few
abundant species. Mixture models (Colwell et al. 2004, Mao and Colwell 2005) have been
widely used to mimic such assemblages. We considered four mixture models. Each is a
mixture of two distributions; one for modeling the abundances of the group of rare species
and the other for modeling the abundances of the group of abundant species. Models 1A and
1B are mixtures of uniform distributions, whereas Models 2A and 2B are mixtures of
lognormal distributions. Under these four mixture models, when a sample of individuals is
2
taken from this assemblage, the singletons and undetected species in the sample are almost
surely from the rare group and thus do not greatly differ in their abundances, satisfying the
condition of unbiasedness. By contrast, we also considered a single uniform distribution
(Model 1C) and a single lognormal distribution (Model 2C). In these two models, the
abundances of singletons and undetected species in any sample span a wider range and differ
to some extent, implying the sufficient condition for unbiasedness is not satisfied. The
theoretical value of the CV (which is the ratio of standard deviation and mean) for each model
is also given to quantify the degree of heterogeneity of species relative abundances. When all
abundances are equal, CV = 0. A larger value of CV signifies a higher degree of heterogeneity
among species abundances.
The number of species in each assemblage was fixed to be 300. For each fixed model,
we first generated 300 random variables from a designated model. Then their normalized
values represent the “true” species relative abundances of the complete assemblage,
),...,,( 30021 ppp , which we regarded as fixed parameters in the simulation. Since the generated
set of abundances varies among trials, two sets of abundances generated from the same model
may differ to some extent, and the empirical CV of each generated abundance set may deviate
slightly from the theoretical CV value. Here we report the simulation results for the following
six models.
Model 1A: A mixture of two uniform distributions denoted by pi ~ c [0.9 U(0.1, 0.2) + 0.1
U(0.2, 1)], whereby c is a normalizing constant in all cases, such that
S
i ip1
1. The
theoretical CV for this model is 0.80. In this model, the assemblage comprises roughly
90% rare species, whose abundances were generated from a uniform distribution in the
range [0.1, 0.2], and 10% abundant species, whose abundances were generated from a
uniform distribution in [0.2, 1].
3
Model 1B: Same as Model 1A, but the percentages for rare and abundant species are changed
to 80% and 20% respectively. The theoretical CV for this model is 0.87.
Model 1C: A single, uniform distribution in the range [0, 1], denoted by pi ~ c U(0, 1),
whereby c is a normalizing constant in all cases, such that
S
i ip1
1. The theoretical
CV for this model is 0.58.
Model 2A: A mixture of lognormal distributions denoted by pi ~ c [0.9 lognormal (1, 0.2) +
0.1 lognormal (2, 1)], where lognormal (μ, σ) denotes a lognormal distribution with
mean μ and standard error (s.e.) σ, and the constant c is a normalizing constant. The
theoretical CV for this model is 1.56. In this model, the assemblage comprises roughly
90% rare species, whose abundances were generated from a lognormal distribution with
mean 1 and s.e. 0.2, and 10% abundant species, whose abundances were generated from a
lognormal distribution with mean 2 and s.e. 1.
Model 2B: Same as Model 2A, but the percentages for rare and abundant species are changed
to 80% and 20% respectively. The theoretical CV for this model is 1.74.
Model 2C: A single lognormal distribution with mean 1 and s.e., denoted by pi ~ c
lognormal (1, 1), i.e., species abundances were generated from a lognormal distribution,
where c is a normalizing constant in all cases, such that
S
i ip1
1. The theoretical CV
for this model is 1.31.
Species richness estimator (Appendix S4: Fig. S1; see Eq. 3a in the main text for the
estimator)
For each of the above six models (with 300 species), we considered a range of sample sizes (n
= 400 to 1000 in increments of 100). Then for each combination of abundance model and
sample size, 200 simulated data sets were generated. The empirical species richness and the
Chao1 estimator were computed based on each generated data set. The mean values of the
4
observed richness and the Chao1 estimates are plotted as a function of sample size in
Appendix S4: Fig. S1. Because the simulation procedures for the following five measures
(shared richness, PD, shared PD, FAD and shared RAD) were generally parallel, details are
omitted.
Shared species richness estimator (Appendix S4: Fig. S2; see Eq. 7d in the main text for the
estimator)
To examine the performance of the Chao1-shared species estimator, we simulated data from
two assemblages; the species relative abundances for both assemblages were generated from
one of the six models (Models 1A–1C and 2A–2C) described above. The number of species in
each assemblage was fixed to be 300, out of which 250 species were shared species and 50
were non-shared. Thus, there were a total of 350 species in the pooled assemblage.
Specifically, we first generated two sets of 300 species relative abundances from each model
and then sorted the abundances in each set from the largest to smallest. We then assigned the
first 250 species as the shared species between the two assemblages (i.e., shared species were
more abundant ones in each assemblage) and the remaining 50 species as the non-shared
species. The CV values for the two generated sets of species relative abundances are denoted
as CV1 and CV2 in each panel of Appendix S4: Fig. S2.
Phylogenetic diversity (PD) estimator (Appendix S4: Fig. S3; see Eq. 8d in the main text for
the estimator)
The one-assemblage species abundance structures were generated in the same manner as
those for species richness estimation. The rooted phylogenetic tree for 300 species in each
assemblage was generated by the R function “rtree” from the “ade4” package.
Shared PD estimator (Appendix S4: Fig. S4; see Eq. 10d in the main text for the estimator)
5
The two-assemblage abundance structures were generated in the same manner as those for
shared species richness estimation. In each assemblage, the rooted phylogenetic tree for a total
of 350 species was generated by the R function “rtree” from the “ade4” package.
Functional attribute (FAD) estimator (Appendix S4: Fig. S5; see Eq. 12d in the main text for
the estimator)
The one-assemblage species abundance structures were generated in the same manner as
those for species richness estimation. The non-diagonal elements of the symmetric species-
pairwise distance matrix was generated from a uniform U(0, 1) distribution; the diagonal
elements were fixed to be zero.
Shared FAD estimator (Appendix S4: Fig. S6; see Eq. 14b in the main text for the estimator)
The two-assemblage abundance structures were generated in the same manner as those for
shared species richness estimation. The non-diagonal elements of the symmetric species-
pairwise distance matrix was generated from a uniform U(0, 1) distribution; the diagonal
elements were fixed to be zero.
Results and Conclusion
All figures (Appendix S4: Figs. S1–S6, below) reveal consistent and similar patterns. The
empirical approach typically underestimates, as the curve for the observed diversity estimator
(dashed line in each panel) lies below the true curve (dotted line in each panel). The
magnitude of the negative bias is substantial, especially for small sample sizes. As expected,
the mean curve for each estimator derived from the Good-Turing theory (red solid line) for
Models 1A, 1B, 2A and 2B matches well with the true parameter line, implying that each
estimator is nearly unbiased. For Models 1C and 2C, although our estimator provides only a
lower bound, as predicted from theory, the lower bound always improves the estimate over
6
the empirical approach and greatly reduces under-sampling bias of the observed diversity. As
discussed in the main text, when rare species are heterogeneous in abundances, data are not
sufficient to accurately estimate the true diversity, and statistically we can, at best, provide
only a lower bound. In this case, we recommend the use of our estimators as reliable and
informative lower bounds. Chiu et al. (2014) further evaluated the maginitude of the bias for
the Chao1 lower bound and derived an approximate formula for the first-order bias. They
estimated the approximate bias using additional information (namely, the numbers of
tripletons and quadrupletons) and proposed an improved lower bound; see Chiu et al. (2014)
for details.
7
(Model 1A) 0.9 U(0.1, 0.2)+ 0.1 U(0.2, 1),
CV=0.767
(Model 2A) 0.9 lognormal (1, 0.2)+ 0.1
lognormal (2, 1), CV=1.246
(Model 1B) 0.8 U(0.1, 0.2)+ 0.2 U(0.2, 1),
CV=0.863
(Model 2B) 0.8 lognormal (1, 0.2) + 0.2
lognormal (2, 1), CV=1.916
(Model 1C) U(0, 1), CV=0.571 (Model 2C) lognormal (1, 1) CV=1.431
Fig. S1. The performance of the Chao1 species richness estimator as a function of sample size
under six abundance models (dotted line = true richness = 300, red solid line = Chao1
estimator, dashed line = observed species richness.)
400 500 600 700 800 900 1000
50
100
15
02
00
250
300
Sample size
Sp
ecie
s r
ich
ne
ss
400 500 600 700 800 900 1000
50
100
15
02
00
250
300
Sample size
Sp
ecie
s r
ich
ne
ss
400 500 600 700 800 900 1000
50
100
15
02
00
250
300
Sample size
Sp
ecie
s r
ich
ne
ss
400 500 600 700 800 900 1000
50
100
15
02
00
250
300
Sample size
Sp
ecie
s r
ich
ne
ss
400 500 600 700 800 900 1000
50
100
15
02
00
250
300
Sample size
Sp
ecie
s r
ich
ne
ss
400 500 600 700 800 900 1000
50
100
15
02
00
250
300
Sample size
Sp
ecie
s r
ich
ne
ss
8
(Model 1A) 0.9 U(0.1, 0.2)+ 0.1U(0.2, 1),
CV1=0.811, CV2=0.822
(Model 2A) 0.9 lognormal (1, 0.2)+ 0.1 lognormal
(2, 1), CV1=1.715, CV2=1.387
(Model 1B) 0.8 U(0.1, 0.2)+0.2 U(0.2, 1),
CV1=0.819, CV2=0.893
(Model 2B) 0.8 lognormal (1, 0.2)+ 0.2 lognormal
(2, 1), CV1=1.592, CV2=1.339
(Model 1C) U(0, 1), CV1=0.618, CV2=0.568 (Model 2C) Lognormal (1, 1), CV1=1.132,
CV2=1.297
Fig. S2. The performance of the Chao1-shared estimator as a function of sample size under
six abundance models (dotted line = true shared richness = 250, red solid line = Chao1-shared
estimator, and dashed line = observed shared species richness.)
400 500 600 700 800 900 1000
50
10
01
50
20
02
50
300
Sample size
Sh
are
d r
ich
ne
ss
400 500 600 700 800 900 1000
50
10
01
50
20
02
50
300
Sample size
Sh
are
d r
ich
ne
ss
400 500 600 700 800 900 1000
50
10
01
50
20
02
50
300
Sample size
Sh
are
d r
ich
ne
ss
400 500 600 700 800 900 1000
50
10
01
50
20
02
50
300
Sample size
Sh
are
d r
ich
ne
ss
400 500 600 700 800 900 1000
50
10
01
50
20
02
50
300
Sample size
Sh
are
d r
ich
ne
ss
400 500 600 700 800 900 1000
50
10
01
50
20
02
50
300
Sample size
Sh
are
d r
ich
ne
ss
9
(Model 1A) 0.9 U(0.1, 0.2)+ 0.1 U(0.2, 1),
CV=0.863
(Model 2A) 0.9 lognormal (1, 0.2)+ 0.1
lognormal (2, 1), CV=1.106
(Model 1B) 0.8 U(0.1, 0.2)+ 0.2 U(0.2, 1),
CV=0.891
(Model 2B) 0.8 lognormal (1, 0.2) + 0.2
lognormal (2, 1), CV=1.881
(Model 1C) U(0, 1), CV=0.591 (Model 2C) lognormal (1, 1), CV=1.338
Fig. S3. The performance of the Chao1-PD estimator as a function of sample size under six
abundance models and a generated phylogenetic tree (dotted line = true PD =297.5, red solid
line = Chao1-PD estimator, dashed line = observed PD.)
400 500 600 700 800 900 1000
100
15
02
00
25
030
0
Sample size
PD
400 500 600 700 800 900 1000
100
15
02
00
25
030
0
Sample size
PD
400 500 600 700 800 900 1000
100
15
02
00
25
030
0
Sample size
PD
400 500 600 700 800 900 1000
100
15
02
00
25
030
0
Sample size
PD
400 500 600 700 800 900 1000
100
15
02
00
25
030
0
Sample size
PD
400 500 600 700 800 900 1000
100
15
02
00
25
030
0
Sample size
PD
10
(Model 1A) 0.9 U(0.1, 0.2)+ 0.1 U(0.2, 1),
CV1=0.855, CV2=0.744
(Model 2A) 0.9 lognormal (1, 0.2)+ 0.1
lognormal (2, 1), CV1=1.322, CV2=1.283
(Model 1B) 0.8 U(0.1, 0.2)+0.2 U(0.2, 1),
CV1=0.902, CV2=0.874
(Model 2B) 0.8 lognormal (1, 0.2)+ 0.2
lognormal (2, 1), CV1=1.611, CV2=1.416
(Model 1C) U(0, 1), CV1=0.563, CV2=0.557 (Model 2C) Lognormal (1, 1), CV1=1.569,
CV2=1.204
Fig. S4. The performance of the Chao1-PD-shared estimator as a function of sample size
under six abundance models and a generated phylogenetic tree (dotted line = true shared PD =
250.4, red solid line = Chao1-PD-shared estimator, and dashed line = observed shared PD.)
400 500 600 700 800 900 1000
50
10
015
02
00
25
030
0
Sample size
Sha
red
PD
400 500 600 700 800 900 1000
50
10
015
02
00
25
030
0
Sample size
Sha
red
PD
400 500 600 700 800 900 1000
50
10
015
02
00
25
030
0
Sample size
Sha
red
PD
400 500 600 700 800 900 1000
50
10
015
02
00
25
030
0
Sample size
Sha
red
PD
400 500 600 700 800 900 1000
50
10
015
02
00
25
030
0
Sample size
Sha
red
PD
400 500 600 700 800 900 1000
50
10
015
02
00
25
030
0
Sample size
Sha
red
PD
11
(Model 1A) 0.9 U(0.1, 0.2)+ 0.1 U(0.2, 1), CV=0.78 (Model 2A) 0.9 lognormal (1, 0.2)+ 0.1
lognormal (2, 1), CV=1.43
(Model 1B) 0.8 U(0.1, 0.2)+ 0.2 U(0.2, 1), CV=0.87 (Model 2B) 0.8 lognormal (1, 0.2) + 0.2
lognormal (2, 1), CV=1.57
(Model 1C) U(0, 1), CV=0.58 (Model 2C) lognormal (1, 1), CV=1.33
Fig. S5. The performance of the Chao1-FAD estimator as a function of sample size under six
abundance models and a generated distance matrix (dotted line = true FAD = 44870.4, red
solid line = Chao1-FAD estimator, and dashed line = observed FAD.)
400 500 600 700 800 900 1000
01
00
00
30
00
050
000
Sample size
FA
D
400 500 600 700 800 900 1000
01
00
00
30
00
050
000
Sample size
FA
D
400 500 600 700 800 900 1000
01
00
00
30
00
050
000
Sample size
FA
D
400 500 600 700 800 900 1000
01
00
00
30
00
050
000
Sample size
FA
D
400 500 600 700 800 900 1000
01
00
00
30
00
050
000
Sample size
FA
D
400 500 600 700 800 900 1000
01
00
00
30
00
050
000
Sample size
FA
D
12
(Model 1A) 0.9 U(0.1, 0.2)+ 0.1 U(0.2, 1),
CV1=0.769, CV2=0.788
(Model 2A) 0.9 lognormal (1, 0.2)+ 0.1
lognormal (2, 1), CV1=1.629, CV2=1.177
(Model 1B) 0.8 U(0.1, 0.2)+0.2 U(0.2, 1),
CV1=0.859, CV2=0.862
(Model 2B) 0.8 lognormal (1, 0.2)+ 0.2
lognormal (2, 1), CV1=1.239, CV2=1.455
(Model 1C) U(0, 1), CV1=0.667, CV2=0.601 (Model 2C) Lognormal (1, 1), CV1=1.268,
CV2=1.333
Fig. S6. The performance of the Chao1-FAD-shared estimator as a function of sample size
under six abundance models and a generated distance matrix (dotted line = true shared FAD =
31176.7, red solid line = Chao1-FAD-shared estimator, and dashed line = observed shared
FAD.)
400 500 600 700 800 900 1000
010
000
20
000
300
00
Sample size
Sh
are
d F
AD
400 500 600 700 800 900 1000
010
000
20
000
300
00
Sample size
Sh
are
d F
AD
400 500 600 700 800 900 1000
010
000
20
000
300
00
Sample size
Sh
are
d F
AD
400 500 600 700 800 900 1000
010
000
20
000
300
00
Sample size
Sh
are
d F
AD
400 500 600 700 800 900 1000
010
000
20
000
300
00
Sample size
Sh
are
d F
AD
400 500 600 700 800 900 1000
010
000
20
000
300
00
Sample size
Sh
are
d F
AD
13
References for Appendix S4
Chiu, C.-H., Y. T. Wang, B. A. Walther, and A. Chao. 2014. An improved non-parametric
lower bound of species richness via a modified Good-Turing frequency formula. Biometrics,
70:671–682.
Colwell, R. K., C. X. Mao, and J. Chang. 2004. Interpolating, extrapolating, and comparing
incidence-based species accumulation curves. Ecology 85:2717–2727.
Mao, C. X. and R. K. Colwell. 2005. Estimation of species richness: mixture models, the role
of rare species, and inferential challenges. Ecology 86:1143–1153.