biol 582
DESCRIPTION
BIOL 582. Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence. The first two examples included one frequency distribution and some known or true expectation. - PowerPoint PPT PresentationTRANSCRIPT
BIOL 582
Lecture Set 17Analysis of frequency and categorical data
Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence
BIOL 582 Expansion of Goodness of Fit Tests
• The first two examples included one frequency distribution and some known or true expectation.
• The first two examples included categorical data• There are two different ways we can (and will) go
1. Goodness of Fit tests for continuous frequency data2. Goodness of fit tests for more than one distribution
• We have to start with one of these, so let’s start with 1.
• Before proceeding, it is important to establish two different hypotheses that are used as “null” model for frequency expectations. In the previous two examples, the expected frequencies were established by theory (expected genotypes) or a larger empirical pool of information (species proportions). These are extrinsic hypotheses for the basis of expected frequencies. Intrinsic hypotheses can also be used for estimating frequencies. For example, if we wish to test if a continuous frequency distribution is normal, we can generate expected frequencies but would first need to know the mean and variance of the sample. Thus, the degrees of freedom for the test are reduced by these additional parameter estimates.
BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies
• We have done these types of tests nearly all semester!• Kolmogorov-Smirnov and Shapiro-Wilk are such tests
• An old-fashioned way to do GOF tests was to break the continuous data into classes, find the expected frequency of each class, and proceed as we just learned.
• For example, we could get compare the heights of the columns (observed) to the point on the red lines at the center of each column (expected) to test if the continuous frequency distribution for this test statistic is normally distributed.
Red notches indicate the height of the curve at the centers of bins, which would indicated expected frequencies/densities
BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies
• We have done these types of tests nearly all semester!• Kolmogorov-Smirnov and Shapiro-Wilk are such tests
• An old-fashioned way to do GOF tests was to break the continuous data into classes, find the expected frequency of each class, and proceed as we just learned.
• For example, we could get compare the heights of the columns (observed) to the point on the red lines at the center of each column (expected) to test if the continuous frequency distribution for this test statistic is normally distributed.
• This method is no longer considered appropriate (as changing the number of columns can change the outcome)
• We will use the K-S test as a standard, non-parametric GOF between one distribution and either an intrinsic or extrinsic expectation of its frequency.
BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov
• What is the K-S test in a nutshell?• The K-S test orders data from lowest to highest• The observed “cumulative relative frequency” distribution is
calculated by dividing rank by n. (I.e., 1/n, 2/n, 3/n, …. n/n)• In a stepwise fashion, a cumulative Frequency function produces
the expected cumulative relative frequency for every 1/n steps• The difference between observed and expected frequencies is
measured at each step• The largest absolute (vertical) distance is used as a test statistic.• This distance is compared to critical values from a Kolmogorov
distribution (you can see what that is on your own)• There might be some “adjustments” made along the way to estimate
the expected frequencies. Just assume that the canned function knows when to make such adjustments.
BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov
• A good way to understand K-S, test normality of residuals from a previous example. This is an intrinsic test.
> # Residuals from an anlysis> > snake<-read.csv("snake.data.csv")> attach(snake)> Sex<-as.factor(Sex)> > lm.snake<-lm(HS ~ log(SVL) + Sex)> r<-resid(lm.snake)> r<-r/var(r) # make residuals into standardized residuals> r<-sort(r) # sorts residuals for small to large > n<-length(r)> > # Creating expected frequencies> o<-array(1:n)/n # observed frequencies (densities)> e<-pnorm(r,mean=mean(r),sd=sd(r)) # expected frequencies (densities)
BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov
• A good way to understand K-S, test normality of residuals from a previous example. This is an intrinsic test.
> # Evaluation> max(abs(o-e))
[1] 0.1032246> > plot(r,o,ylab="Cumulative relative frequency",> xlab="Standardized Residuals",> main="Circles = observed; Line = expected")> points(r,e,type="l")> > ks.test(r,'pnorm',mean(r),sd(r)) > # indicates to get the cumulative area under > # a curve (p) from a normal distribution
One-sample Kolmogorov-Smirnov test
data: r D = 0.1032, p-value = 0.749alternative hypothesis: two-sided
BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov
• Let’s repeat the test with a different model on the same data, but where the residuals are a little less normal> lm.snake<-lm(HS ~ sqrt(SVL+0.5*SVL^2))> r<-resid(lm.snake)> r<-r/var(r) > r<-sort(r) > n<-length(r)> > # Creating expected frequencies> o<-array(1:n)/n > e<-pnorm(r,mean=mean(r),sd=sd(r)) > > # Evaluation> max(abs(o-e))
[1] 0.1616094> > plot(r,o,ylab="Cumulative relative frequency",xlab="Standardized Residuals",> main="Circles = observed; Line = expected")> points(r,e,type="l")> > ks.test(r,'pnorm',mean(r),sd(r))
One-sample Kolmogorov-Smirnov test
data: r D = 0.1616, p-value = 0.2217alternative hypothesis: two-sided
BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov
• Let’s look at a process that produces one type of data tested against other distributions> # Generate data from a log-normal distribution> y<-rlnorm(50,meanlog=2.5,sdlog=0.6)> y<-sort(y)> r<-(y-mean(y))/sd(y)> n<-length(r)> o<-array(1:n)/n > > # Expected densities from three distributions> e.norm<-pnorm(y,mean=mean(y),sd=sd(y)) > e.poisson<-ppois(y,lambda=mean(y))> e.log.norm<-plnorm(y,meanlog=mean(log(y)),sdlog=sd(log(y)))> > par(mfrow=c(1,3))> plot(y,o, main ="Compared to Normal",ylab="Density")> points(y,e.norm,type='l')> plot(y,o, main ="Compared to Poisson",ylab="Density")> points(y,e.poisson,type='l')> plot(y,o, main ="Compared to Lognormal",ylab="Density")> points(y,e.log.norm,type='l’)
BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov
• Let’s look at a process that produces one type of data tested against other distributions> ks.test(y,'pnorm',(mean(y)),(sd(y)))
One-sample Kolmogorov-Smirnov test
data: y D = 0.2198, p-value = 0.01335alternative hypothesis: two-sided
> ks.test(y,'ppois',(mean(y)))
One-sample Kolmogorov-Smirnov test
data: y D = 0.4004, p-value = 9.531e-08alternative hypothesis: two-sided
> ks.test(y,'plnorm',(mean(log(y))),sd(log(y)))
One-sample Kolmogorov-Smirnov test
data: y D = 0.096, p-value = 0.7096alternative hypothesis: two-sided
BIOL 582 Tests of independence
• Now let’s consider the case where we have two sets of categorical frequencies, and we wish to compare them to determine if the they have the same distributions of proportional outcomes (irrespective of the sample size)
• We have done this already: Contingency Table analysis• Often Contingency tables are called Two-way or Multi-way tables
because the sample size (n) can be partitioned in two, or more ways• Sokal and Rohlf (2011) also describe and recommend the following
• The following examples are also from Sokal and Rohlf (2011)* Chi-square tests are also applicable
Model Frequency Totals Recommended Test
I Not fixed G-test for independence*
II Fixed for one criterion G-test for independence*
III Fixed for both criteria Fisher’s Exact Test
BIOL 582 Tests of independence
• Example Model I (Box 17.6 Sokal and Rohlf 2011)• A plant ecologist samples 100 trees of a rare species in a 400 square-
mile area• He records for each tree if it is rooted in serpentine soil, and whether its
leaves are pubescent or smooth• Question: Do trees grown in serpentine soils have different ratios of
smooth: pubescent leaves?• H0: Ratios are equal
Soil Pubescent Smooth Total Ratio
Serpentine 12 22 34 1.833:1
Not Serpentine 16 34 66 2.125:1
total 28 72 100
BIOL 582 Tests of independence
• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• For a two-way table
• The probability of observing the cell frequencies, a, b, c, and d, is computed as
• Via some steps reserved for additional reading,
• And G is -2lnL
Category 1 Category 2 Total
Sample 1 a b a + b
Sample 2 c d c + d
Total a + c b + d a + b + c +d
Computationally
Easier
BIOL 582 Tests of independence
• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• Observed
• G components
Soil Pubescent Smooth Total Ratio
Serpentine 12 22 34 1.833:1
Not Serpentine 16 34 66 2.125:1
total 28 72 100
Soil Pubescent Smooth Total
Serpentine 12ln12 22ln22 34ln34
Not Serpentine 16ln16 34ln34 66ln66
total 28ln28 72ln72 100ln100
BIOL 582 Tests of independence
• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• Observed
• G components (add these)
Soil Pubescent Smooth Total Ratio
Serpentine 12 22 34 1.833:1
Not Serpentine 16 34 66 2.125:1
total 28 72 100
Soil Pubescent Smooth Total
Serpentine 12ln12 22ln22 34ln34
Not Serpentine 16ln16 34ln34 66ln66
total 28ln28 72ln72 100ln100
BIOL 582 Tests of independence
• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• Observed
• G components (then add these)
Soil Pubescent Smooth Total Ratio
Serpentine 12 22 34 1.833:1
Not Serpentine 16 34 66 2.125:1
total 28 72 100
Soil Pubescent Smooth Total
Serpentine 12ln12 22ln22 34ln34
Not Serpentine 16ln16 34ln34 66ln66
total 28ln28 72ln72 100ln100
BIOL 582 Tests of independence
• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• Observed
• G components
Soil Pubescent Smooth Total Ratio
Serpentine 12 22 34 1.833:1
Not Serpentine 16 34 66 2.125:1
total 28 72 100
Soil Pubescent Smooth Total
Serpentine 12ln12 22ln22 34ln34
Not Serpentine 16ln16 34ln34 66ln66
total 28ln28 72ln72 100ln100
BIOL 582 Tests of independence
• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• Model I Two-way tables have type I error rates that are higher than
intended. The William's Correction is recommended
• Which for the current example is
• Thus
• The df is equal to (r-1)(c-1) = 1, for two rows and two columns• The probability of finding a value of 1.30277 or higher from a Chi-square distribution
with 1 df is 0.253708; thus do not reject the null hypothesis of same ratios (accept null hypothesis of independence leaf type is independent of soil type)
BIOL 582 Tests of independence
• The general formula for the G stat is from now on
• Also, unless otherwise stated, this is the same
• As the base is not given, so the log is assumed natural
BIOL 582 Tests of independence
• Example Model II (Sokal and Rohlf 2011)• An immunology experiment involved inoculating 111 mice with a
pathogenic bacteria• 57 mice were also given antiserum• After a sufficient amount of time, the number of dead mice was
compared between the two treatments• This is Model II because the number of mice in the treatments was fixed.• H0: Ratios are equal
Treatment Dead Alive Total Ratio
Bacteria + Antiserum 13 44 57 0.29545:1
Bacteria only 25 29 54 0.86201:1
total 38 73 111
BIOL 582 Tests of independence
• Example Model II (Sokal and Rohlf 2011)• Observed
• G Components
• G = 2[377.97216 – 897.29807 + 522.75785] = 6.97927704• Gadj = 6.97927704/1.15658 = 5.9470375
• P-value = 0.01474; reject null hypothesis ratios are different
Treatment Dead Alive Total Ratio
Bacteria + Antiserum 13 44 57 0.29545:1
Bacteria only 25 29 54 0.86201:1
total 38 73 111
Treatment Dead Alive Total
Bacteria + Antiserum
33.34434 166.50434 230.45392
Bacteria only 80.47190 97.65158 215.40514
total 138.22827 313.20354 522.75785
BIOL 582 Tests of independence
• Next time… (Or next two times)• Model III and Fisher’s Exact test• More than 2 rows and columns• Odds-ratios for proportions• Logistic Regression