model for estimating diversity presentation

Post on 04-Jul-2015

581 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Simple probability model for estimate the population diversity. Here a hybrid model using I. Good and Toulming estimators.

TRANSCRIPT

Model for Estimating Population Diversity as the Prediction of Sample needed for full Coverage

with Applications in Bioinformatics

Torres, David A., Pericchi, Luis R.Department of Mathematics

University of Puerto Rico, Rio Piedras.

Abstract

There exist several methods for estimating community diversity using coverage (Bunge and Fitzpatrick 1993). The biologist and environmental scientist challenge the statisticians in order to solve such problem. Here we present an approach for the estimation using coverage model (Good, I. G, 1953) and a population estimator (Good, I. G. and G. H. Toulmin, 1956). We apply the method to a data given from microbial diversity presented in the crop of the hoatzin by molecular analysis of cloned 16S RNA genes.

Introduction

• Estimating the number of species in a community is a classical problem in Ecology, biogeography, and conservation biology, and parallel problems arise in many other disciplines. This research topic has been extensively discussed in the literature; see Bunge and Fitzpatrick (1993), Seber (1982, 1986, 1992) for a review of the historical and theoretical development.

• Ecologists and other biologists have long recognized that there are undiscovered species in almost every survey or species inventory. A parallel problem is tried to answer how many words did a particular author know. Efron, B., Twisted, R. (1975).

• A random sample is taken from a Community. We will refer to this sample as the basic sample.

• Our intention is calculate an estimator for coverage

of the community using the information provided in the basic sample and then estimate the number of species in the community.

• Moreover, we pretend to describe a method that present an estimator of the number of additional data needed to get a total coverage of the community .

• An example will be presented in order to apply the theory.

Methods

• A random sample of size N is drawn from a community and let be the numbers of distinct species represented exactly r times in the sample, then

1i

i

rn N∞

=

=∑

rn

• We shall be concerned with, , the community frequency of an arbitrary species that is represented r times in the basic sample.

• Let, , be the expected value of . A main result used by Good (1953) is that

(2)

where .

rq

( )rqΕ

*

( )rr

qN

Ε =

r

r

nnrr 1)1(* ++=

rq

• This can be generalized to give a higher moment of . As a matter of fact

(3)

where

and .

r

mr

m

r n

n

N

mrq +

+=Ε )(

3,2,1;3,2,1 == mr

∏+=

=t

mi

m it1

rq

• Recursively, we can rewrite (3) as

.

• Moreover, the variance of is approximately:

• Note that, then we have that

∏−+

=

Ε≈Ε1

)()(mr

rii

mr qq

rq

2 12

( 1)( 1)( 2)( ) r rr

r r

n r nr rV q

N n Nn+ + ++ += −

( )* 1 rr nr

N

+≤

1r r

i

n rn N∞

=

≤ =∑

( )1r ≤r( )1 rr n

N

+

1

1i

i r

inN

= +∑

1

2

( )11k

k

nkn

N N

=

Ε= −∑

As an estimator of the expected total change of all species that

are each represented times in the basic sample is

Also the expected total chance of all species that are represented

times or more in the sample is approximately

In particular note that the expected total change in the sample is

approximately(4)

• Hence, the total coverage of the sample (i.e. the proportion of community represented in the sample, which is the sum of the population frequencies of the species represented) is approximately.

(5)1 1( )1 1

n n

N N

Ε− = −

1x

x

d n∞

=

= ∑

1n

N

0n s d=−

The change that the next member of the community will belong to a new species is estimated as, .

Lets write the total number of distinct species in the sample as

and suppose that the total number of distinct species in the community is a known finite number s. Then the number of non-represented species in the sample is given by .

• Then let be the

population frequencies of the species. As in

Good (1953), equation (10),(6)

( )1

!( ) (1 )

! !

sr N r

r

Nn p p

r N r µ µµ

=

Ε = − −

( 1, 2,3, )pµ µ =

For the population, we have similarly,

assuming for all .1

2pµ ≤ µ

( )1

!( ( )) (1 )

! !

sr N r

r

Nn p p

r N rλ

µ µµ

λλλ

=

Ε = − −

( )

( ) ( )

( ) ( )( ) ( )( ) ( )

( 1)

1

1 0

( )

0 1

!(1 ) 1

! ! 1

! ( 1) !(1 ) (1 )

! ! ! ( 1) !

! ( 1) !(1 )

! ! ! ( 1) !

1 1 !

! !

Ns

r N r

sr N r i i

i

sr i N r i

i

ir

rr i

pNp p

r N r p

N Np p p p

r N r i N i

N Np p

r N r i N i

N N rn

r i N

λµ

µ µµ µ

µ µ µ µµ

µ µµ

λλλ λλ λλ λλ λ

λ λ

− −

=

∞− −

= =

∞+ − +

= =

++

= − + − −

− −= − −− − − −

− −= −− − − −

− − += Ε

∑ ∑

∑ ∑

( )

( ) ( ) ( )0

( )!1 1

! !

i

iir

r ii

r in

r iλ λ

+=

+≈ − − Ε∑

• For the case r = 0, we not need to assume the value of s, since this assumption is not required to write

(8)

• We may be particularly interested in the coverage of the community, then using equation (5) and (7) with r=1 we have the expected coverage is approximately

(9)

( ) ( ) ( ) 01

ˆ 1 1 ( )i i

ii

d d n s nλ λ λ∞

=

− = − − = −∑

211 2 3

11 1 [ 2( 1) 3( 1) ]n

n n nN N

λ λ− ≈ − − − + − −

• The expected number of distinct species represented is approximately

• We use the coverage to estimate the value of and straightforward the population size needed to get 100% coverage. The equation (9) is the one that is called Good-Toulmin model by the fact that is a merge between the two models proposed by them.

( ) ( ) 21 21 1d n nλ λ+ − − − +

Application

• The hoatzin is a South American leaf-eating bird and the its uniqueness lies in its particular foregut (crop), the only known for the avian class.

• Forestomach compartmentalization allows mammal herbivores to be nourished on microbial fermentation products and microbial biomass. Bacteria are largely responsible for fermentation of dietary components, and bacterial cells are themselves subject to digestion by gastric lysozyme expressed in the abomasum of ruminants.

• The evolutionary pressure towards foregut specialization in herbivores was presumably exerted by indigestible plant polymers (cellulose), so that production of microbial biomass at expenses of these indigestible materials has clear advantages.

• In the hoatzin, a preliminary characterization of the crop microflora was done by culture (Domínguez-Bello et al., 1993). In this study we aim to characterize the bacterial diversity in the crop of the hoatzin by a molecular analysis of cloned 16S rRNA genes.

Results

• For the 69 O.T.U’s obtained, Good’s method left side of equation (9)) indicated a coverage of diversity of 77%

• This means that 100% diversity will correspond to 90 O.T.U. Given that, applying the Good and Toulmin’s model (figure 2), we estimate a λ=1.5 which means that we need 98 (300-202) additional clones to obtain the 31 O.T.U’s needed to cover 100% diversity.

Conclusions (Application)

• The estimate indicates 300 clones are needed to represent 100% of sample diversity 99% of the clones and 88% of OTU analyzed are unidentified species.

• Based on 202 sequences yielding 69 O.T.U, Good and Toulmin estimator indicates a coverage of 77% of the total diversity.

Future Research

• There are many models and procedure try to calculate coverage, instead of using the Good’s estimator of coverage it will be interesting try another approach. Perhaps, using Poisson process or an Multinomial approach it’s possible to get better estimators. Another approach could be the use of Bayesian inference in the assumption of a no known distribution in a Metropolis Hasting procedure.

• The importance of this type of problem is based on the experimental designs.

• Good stated once that “I don’t believe it is usually possible to estimate the number of unseen species … but only an approximate lower bound to that number.”. We will keep on the road.

Literature cited

• Godoy Filipa1, Gao, Z. 2, Pei Z.2, Zhou M.2 ,Garcia-Amado, M.A.3,Pericchi, L.R. 4 ,Torres, D. 4 Michelangeli F.3, Blaser M.J 2 , Domínguez-Bello, M.G.1High bacterial diversity in the forestomach of the Hoatzin is revealed by molecular analysis of 16S rRNA Genes. 1Department of Biology, University of Puerto Rico, Rio Piedras, San Juan, PR 00931. 2 Departments of Medicine, Pathology and Microbiology, New York University School of Medicine, New York, NY 10016 3Venezuelan Institute of Scientific Research, CBB, Caracas, Venezuela. 4 Department of Mathematics University of Puerto Rico, Rio Piedras, San Juan, PR 00931.

• Chao,A.,Lee,S.,1992. Estimating the Number of Classes via Sample Coverage. Journal of the American Statistical Association,87: 210-217.

• Domínguez-Bello, M. G.M. Lovera, P. Suarez and F. Michelangeli, 1993, Microbial inhabitants in the crop of the hoatzin (Opisthocomus hoazin): the only foregut fermented avian. Physiol. Zool. 66: 374-383.

• Good, I. G. and G. H. Toulmin, 1956. The number of new species and the increase in population coverage when the sample is increase. Biometrika 43: 45-63.

• Good,I., 1953. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika,40: 237-264.

top related