three frameworks for statistical analysis. sample design forest, n=6 field, n=4 count ant nests per...
TRANSCRIPT
Three Frameworks for Statistical Analysis
Sample Design
Forest, N=6
Field, N=4
Count ant nests per quadrat
DataId # Habitat Number of ant nest per quadrat
1 Forest 9
2 Forest 6
3 Forest 4
4 Forest 6
5 Forest 7
6 Forest 10
7 Field 12
8 Field 9
9 Field 12
10 Field 10
Three Frameworks for Statistical Analysis
• Monte Carlo Analysis• Parametric Analysis• Bayesian Analysis
The model
• yi= is a measurement on a “continuous” scale, which belongs to an individual type of habitat “i”
• xi= is an indicator or dummy variable for groups (0,1)
• The model includes three parameters:
• α: the mean for groups
• β: the mean difference between groups, and
• The variance (σ2) of the normal distribution from which the residuals εi are assumed to have come from.
iii xy * 2,0~ NormaliFor the Parametric and Bayesian
Monte Carlo Analysis
Involves a number of methods in whichdata are randomized orreshuffled so that observationsare randomly reassigned to differenttreatment groups. This randomizationspecifies the null hypothesis underconsideration
Monte Carlo Analysis1. Specify a test statistic or index to
describe the pattern in the data2. Create a distribution of the test
statistic that would be expected under the null hypothesis
3. Decide on one- or two-tailed test4. Compare the observed test statistic
to a distribution of simulated values and estimate the appropriate P value as a tail probability
1. Specifying the Test Statistic
75.3775.10 obsDif
2. Creating the Null Distribution
3. Deciding on a One or Two tailed Test
Abs (difference) =
3.750
P =
0.036Threshold
4. Calculating the Tail Probability
Inequality N
DIFsim> DIFobs 7
DIFsim= DIFobs 29
DIFsim< DIFobs 964
36/1000=0.036
Differences between means
Difference =
3.7500
P1 =
0.0228
75.3775.10 obsdifference
Assumptions
• The data collected represent random, independent samples
• The test statistic describes the pattern of interest
• The randomization creates an appropriate null distribution for the question
Advantages
• It makes clear and explicit the underlying assumptions and the structure of the null hypothesis
• It does not require the assumption that the data are sampled from a specified probability distribution, such as the normal
Disadvantages
• It is computer intensive and is not included in most traditional statistical packages
• Different analyses of the same data set can yield slightly different answers
• The domain of inference for a Monte Carlo analysis is subtly more restrictive than that for a parametric analysis
Parametric analysis
• Refers to statistical tests built on the assumption that the data being analyzed were sampled from a specified distribution
• Most statistical tests specify the normal distribution
Parametric analysis
1. Specify the test statistic
2. Specify the null distribution
3. Calculate the tail probability
1. Specify the test statistic t test
21
21
XXs
XXt
21
222
211 11**
21 NNdf
sdfsdfs
TXX
Specify the test statistic
Null hypothesis
Forest Field
2. Specify the null distribution
Critical value
3. Calculate the tail probability: Student’s t table
df\p 0.4 0.25 0.1 0.05 0.025 0.01 0.005 0.0005
1 0.32492 1 3.077684 6.313752 12.7062 31.82052 63.65674 636.6192
2 0.288675 0.816497 1.885618 2.919986 4.30265 6.96456 9.92484 31.5991
3 0.276671 0.764892 1.637744 2.353363 3.18245 4.5407 5.84091 12.924
4 0.270722 0.740697 1.533206 2.131847 2.77645 3.74695 4.60409 8.6103
5 0.267181 0.726687 1.475884 2.015048 2.57058 3.36493 4.03214 6.8688
6 0.264835 0.717558 1.439756 1.94318 2.44691 3.14267 3.70743 5.9588
7 0.263167 0.711142 1.414924 1.894579 2.36462 2.99795 3.49948 5.4079
8 0.261921 0.706387 1.396815 1.859548 2.306 2.89646 3.35539 5.0413
http://www.statsoft.com/textbook/sttable.html#t
Results of t-test
Levene's Test for Equality of Variances t-test for Equality of Means
F Sig. t dfSig. (2-
tailed)Mean
Difference
Equal variances assumed 0.4255 0.5324 -2.96319 8 0.018 -3.75
Equal variances not assumed -3.21265 7.95 0.012 -3.75
Habitat N Mean Std. DeviationStd. Error
Mean
Forest 6 7 2.19 0.89
Field 4 10.75 1.5 0.75
Assumptions
• The data collected represent random, independent samples
• The data were sampled from a specified distribution
Advantages
• It uses a powerful framework based on known probability distributions
Disadvantages
• It may not be as powerful as sophisticated Monte Carlo models that are tailored to particular questions or data
• It rarely incorporates a priori information or results from other experiments
What About Non-Parametric Analyses?
• Essentially, these analyses give the P-values that would be obtained by ranking the observations and then performing randomization tests on the ranked data
• Like other resampling methods, non-parametric analyses do not require distributional assumptions.
• However, they have less power than the equivalent parametric tests and can only be used with simple experimental designs.
Bayesian analysis
• It includes prior information and then uses current data to build on earlier results
• It also allows us to quantify the probability of the observed difference [i.e., P(Ha|data)]
Bayesian analysis
1. Specify the hypothesis
2. Specify parameters as random variables
3. Specify the prior probability distribution
4. Calculate the likelihood
5. Calculate the posterior probability distribution
6. Interpret the results
1. Specify the hypothesis
• The primary goal of a Bayesian analysis is to determine the probability of the hypothesis given the data P(H | data)
• The hypothesis needs to be quite specific, and need to be quantitative:
P(diff>2 | diffobs =3.75)
P(hypothesis | data)
)(
)|()()|(
dataP
hypothesisdataPhypothesisPdatahypothesisP
The left hand side of the equation is called the posterior probability distribution, and is the quantity of interest
P(hypothesis | data)
)(
)|()()|(
dataP
hypothesisdataPhypothesisPdatahypothesisP
The right hand side of the equation consists of a fraction. In the numerator, the term P(hypothesis) is the prior probability distribution, and is the probability of the hypothesis of interest before you conducted the experiment
P(hypothesis | data)
)(
)|()()|(
dataP
hypothesisdataPhypothesisPdatahypothesisP
The next term in the numerator is referred as the likelihood of the data; it reflects the probability of observing the data given the hypothesis
P(hypothesis | data)
)(
)|()()|(
dataP
hypothesisdataPhypothesisPdatahypothesisP
The denominator is a normalizing constant that reflects the probability of the data given all possible hypotheses. It scales the posterior probability distribution to the range [0,1].
P(hypothesis | data)
)|()()|( hypothesisdataPhypothesisPdatahypothesisP
We can focus our attention on the numerator
2. Specify the parameters as random variables
),(~ 2 fieldfield N
),(~ 2 forestforest N
The type of random variable used for each population parameter should reflect biological reality or mathematical convenience
3. Specify the prior probability distribution
• We can combine and re-analyze data from the literature, talk to experts, etc. to come up with reasonable estimates for the density of ant nests in fields and forests
• OR, we can use an “uninformative prior”, for which we initially estimate the density of ant nests to be equal to zero and the variances to be very large
)10000,0(~ Nforest
2/1~
sigma ~ dunif(0,10)
WinBugs codemodel
{#Priorsmu1 ~ dnorm(0,0.001)delta ~ dnorm(0,0.001)tau <- 1/(sigma*sigma)sigma ~ dunif(0,10)
#Likelihoodfor (i in 1:n) { y[i]~ dnorm(mu[i],tau) mu[i] <- mu1 + delta*x[i] residual[i] <- y[i]-mu[i]}# Derived quantities mu2 <- mu1 + delta
}
Comparison between approaches
• Parametric• Null hypothesis:
• P(data | H0)
• P(tobs= 2.96 |t>F theoretical=1.86)
• Parameters are fixed
• Bayesian• Hypothesis:• P(H | data)
• P(diff> 2 | diffobs =3.75)
• Parameters are random variables
4. Calculate the likelihood
Field Forest
The likelihood is a distribution that is proportional to the probability of the observed data given the hypothesis
Maximum likelihood
Field mean
Field variance
5. Calculate the posterior probability distribution
• We multiply the prior by the likelihood, and divide by the normalizing constant
• In contrast to the results of the parametric or Monte Carlo analysis, the result of a Bayesian analysis is a probability distribution, not a single P-value
Bayesian output
[1]
[2]
box plot: a
4.0
6.0
8.0
10.0
12.0
14.0
a[1] sample: 650000
-5.0 0.0 5.0 10.0 15.0
0.0
0.2
0.4
0.6
a[2] sample: 650000
-10.0 0.0 10.0
0.0 0.1 0.2 0.3 0.4
Field
Forest
delta chains 1:3 sample: 2997
-5.0 0.0 5.0 10.0
0.0 0.1 0.2 0.3 0.4
Delta (difference)
Estimates
Estimator
Analysis delta
(slope)λForest λField σForest σField
Parametric 3.75 (1.27) 7.00 10.75 0.98 0.87
Bayesian uniformed prior
3.75 (1.61) 7.00 10.74 1.01 1.22
6. Interpreting the Results
• Given the Bayesian estimate of
mean diff= 3.698; [P(diff>2 | 3.75)=0.87
(2607/2997),
In other words, the analysis indicates that there is a P=0.87 that ant nest densities between the two habitats are different by > 2 nests.
Assumptions
• The data collected represent random, independent samples
• The parameters to be estimated are random variables with known distributions
Advantages
• It allows for the explicit incorporation of prior information, and the results from one experiment can be used to inform subsequent experiments
• The results are interpreted in an intuitively straightforward way, and the inferences are conditional on both the observed data and the prior information
Disadvantages
• It has computational challenges and the requirement to condition the hypothesis on the data
• Potential lack of objectivity, because different results will be obtained using different priors