biol 582

BIOL 582

Lecture Set 17Analysis of frequency and categorical data

Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence

BIOL 582 Expansion of Goodness of Fit Tests

• The first two examples included one frequency distribution and some known or true expectation.

• The first two examples included categorical data• There are two different ways we can (and will) go

1. Goodness of Fit tests for continuous frequency data2. Goodness of fit tests for more than one distribution

• We have to start with one of these, so let’s start with 1.

• Before proceeding, it is important to establish two different hypotheses that are used as “null” model for frequency expectations. In the previous two examples, the expected frequencies were established by theory (expected genotypes) or a larger empirical pool of information (species proportions). These are extrinsic hypotheses for the basis of expected frequencies. Intrinsic hypotheses can also be used for estimating frequencies. For example, if we wish to test if a continuous frequency distribution is normal, we can generate expected frequencies but would first need to know the mean and variance of the sample. Thus, the degrees of freedom for the test are reduced by these additional parameter estimates.

BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies

• We have done these types of tests nearly all semester!• Kolmogorov-Smirnov and Shapiro-Wilk are such tests

• An old-fashioned way to do GOF tests was to break the continuous data into classes, find the expected frequency of each class, and proceed as we just learned.

• For example, we could get compare the heights of the columns (observed) to the point on the red lines at the center of each column (expected) to test if the continuous frequency distribution for this test statistic is normally distributed.

Red notches indicate the height of the curve at the centers of bins, which would indicated expected frequencies/densities

BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies

• We have done these types of tests nearly all semester!• Kolmogorov-Smirnov and Shapiro-Wilk are such tests

• An old-fashioned way to do GOF tests was to break the continuous data into classes, find the expected frequency of each class, and proceed as we just learned.

• For example, we could get compare the heights of the columns (observed) to the point on the red lines at the center of each column (expected) to test if the continuous frequency distribution for this test statistic is normally distributed.

• This method is no longer considered appropriate (as changing the number of columns can change the outcome)

• We will use the K-S test as a standard, non-parametric GOF between one distribution and either an intrinsic or extrinsic expectation of its frequency.

BIOL 582 Goodness of fit: continuous frequency distributions with intrinsic expected frequencies: Kolmogorov-Smirnov

• What is the K-S test in a nutshell?• The K-S test orders data from lowest to highest• The observed “cumulative relative frequency” distribution is

calculated by dividing rank by n. (I.e., 1/n, 2/n, 3/n, …. n/n)• In a stepwise fashion, a cumulative Frequency function produces

the expected cumulative relative frequency for every 1/n steps• The difference between observed and expected frequencies is

measured at each step• The largest absolute (vertical) distance is used as a test statistic.• This distance is compared to critical values from a Kolmogorov

distribution (you can see what that is on your own)• There might be some “adjustments” made along the way to estimate

the expected frequencies. Just assume that the canned function knows when to make such adjustments.


• A good way to understand K-S, test normality of residuals from a previous example. This is an intrinsic test.

> # Residuals from an anlysis> > snake<-read.csv("snake.data.csv")> attach(snake)> Sex<-as.factor(Sex)> > lm.snake<-lm(HS ~ log(SVL) + Sex)> r<-resid(lm.snake)> r<-r/var(r) # make residuals into standardized residuals> r<-sort(r) # sorts residuals for small to large > n<-length(r)> > # Creating expected frequencies> o<-array(1:n)/n # observed frequencies (densities)> e<-pnorm(r,mean=mean(r),sd=sd(r)) # expected frequencies (densities)


• A good way to understand K-S, test normality of residuals from a previous example. This is an intrinsic test.

> # Evaluation> max(abs(o-e))

[1] 0.1032246> > plot(r,o,ylab="Cumulative relative frequency",> xlab="Standardized Residuals",> main="Circles = observed; Line = expected")> points(r,e,type="l")> > ks.test(r,'pnorm',mean(r),sd(r)) > # indicates to get the cumulative area under > # a curve (p) from a normal distribution

One-sample Kolmogorov-Smirnov test

data: r D = 0.1032, p-value = 0.749alternative hypothesis: two-sided


• Let’s repeat the test with a different model on the same data, but where the residuals are a little less normal> lm.snake<-lm(HS ~ sqrt(SVL+0.5*SVL^2))> r<-resid(lm.snake)> r<-r/var(r) > r<-sort(r) > n<-length(r)> > # Creating expected frequencies> o<-array(1:n)/n > e<-pnorm(r,mean=mean(r),sd=sd(r)) > > # Evaluation> max(abs(o-e))

[1] 0.1616094> > plot(r,o,ylab="Cumulative relative frequency",xlab="Standardized Residuals",> main="Circles = observed; Line = expected")> points(r,e,type="l")> > ks.test(r,'pnorm',mean(r),sd(r))


data: r D = 0.1616, p-value = 0.2217alternative hypothesis: two-sided


• Let’s look at a process that produces one type of data tested against other distributions> # Generate data from a log-normal distribution> y<-rlnorm(50,meanlog=2.5,sdlog=0.6)> y<-sort(y)> r<-(y-mean(y))/sd(y)> n<-length(r)> o<-array(1:n)/n > > # Expected densities from three distributions> e.norm<-pnorm(y,mean=mean(y),sd=sd(y)) > e.poisson<-ppois(y,lambda=mean(y))> e.log.norm<-plnorm(y,meanlog=mean(log(y)),sdlog=sd(log(y)))> > par(mfrow=c(1,3))> plot(y,o, main ="Compared to Normal",ylab="Density")> points(y,e.norm,type='l')> plot(y,o, main ="Compared to Poisson",ylab="Density")> points(y,e.poisson,type='l')> plot(y,o, main ="Compared to Lognormal",ylab="Density")> points(y,e.log.norm,type='l’)


• Let’s look at a process that produces one type of data tested against other distributions> ks.test(y,'pnorm',(mean(y)),(sd(y)))


data: y D = 0.2198, p-value = 0.01335alternative hypothesis: two-sided

> ks.test(y,'ppois',(mean(y)))


data: y D = 0.4004, p-value = 9.531e-08alternative hypothesis: two-sided

> ks.test(y,'plnorm',(mean(log(y))),sd(log(y)))


data: y D = 0.096, p-value = 0.7096alternative hypothesis: two-sided

BIOL 582 Tests of independence

• Now let’s consider the case where we have two sets of categorical frequencies, and we wish to compare them to determine if the they have the same distributions of proportional outcomes (irrespective of the sample size)

• We have done this already: Contingency Table analysis• Often Contingency tables are called Two-way or Multi-way tables

because the sample size (n) can be partitioned in two, or more ways• Sokal and Rohlf (2011) also describe and recommend the following

• The following examples are also from Sokal and Rohlf (2011)* Chi-square tests are also applicable

Model Frequency Totals Recommended Test

I Not fixed G-test for independence*

II Fixed for one criterion G-test for independence*

III Fixed for both criteria Fisher’s Exact Test


• Example Model I (Box 17.6 Sokal and Rohlf 2011)• A plant ecologist samples 100 trees of a rare species in a 400 square-

mile area• He records for each tree if it is rooted in serpentine soil, and whether its

leaves are pubescent or smooth• Question: Do trees grown in serpentine soils have different ratios of

smooth: pubescent leaves?• H0: Ratios are equal

Soil Pubescent Smooth Total Ratio

Serpentine 12 22 34 1.833:1

Not Serpentine 16 34 66 2.125:1

total 28 72 100


• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• For a two-way table

• The probability of observing the cell frequencies, a, b, c, and d, is computed as

• Via some steps reserved for additional reading,

• And G is -2lnL

Category 1 Category 2 Total

Sample 1 a b a + b

Sample 2 c d c + d

Total a + c b + d a + b + c +d

Computationally

Easier


• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• Observed

• G components


Serpentine 12 22 34 1.833:1


total 28 72 100

Soil Pubescent Smooth Total

Serpentine 12ln12 22ln22 34ln34

Not Serpentine 16ln16 34ln34 66ln66

total 28ln28 72ln72 100ln100



• G components (add these)


Serpentine 12 22 34 1.833:1


total 28 72 100




total 28ln28 72ln72 100ln100



• G components (then add these)


Serpentine 12 22 34 1.833:1


total 28 72 100




total 28ln28 72ln72 100ln100



• G components


Serpentine 12 22 34 1.833:1


total 28 72 100




total 28ln28 72ln72 100ln100


• Example Model I (Box 17.6 Sokal and Rohlf 2011)• Expected values are based on a multinomial distribution:• Model I Two-way tables have type I error rates that are higher than

intended. The William's Correction is recommended

• Which for the current example is

• Thus

• The df is equal to (r-1)(c-1) = 1, for two rows and two columns• The probability of finding a value of 1.30277 or higher from a Chi-square distribution

with 1 df is 0.253708; thus do not reject the null hypothesis of same ratios (accept null hypothesis of independence leaf type is independent of soil type)


• The general formula for the G stat is from now on

• Also, unless otherwise stated, this is the same

• As the base is not given, so the log is assumed natural


• Example Model II (Sokal and Rohlf 2011)• An immunology experiment involved inoculating 111 mice with a

pathogenic bacteria• 57 mice were also given antiserum• After a sufficient amount of time, the number of dead mice was

compared between the two treatments• This is Model II because the number of mice in the treatments was fixed.• H0: Ratios are equal

Treatment Dead Alive Total Ratio

Bacteria + Antiserum 13 44 57 0.29545:1

Bacteria only 25 29 54 0.86201:1

total 38 73 111


• Example Model II (Sokal and Rohlf 2011)• Observed

• G Components

• G = 2[377.97216 – 897.29807 + 522.75785] = 6.97927704• Gadj = 6.97927704/1.15658 = 5.9470375

• P-value = 0.01474; reject null hypothesis ratios are different

Treatment Dead Alive Total Ratio

Bacteria + Antiserum 13 44 57 0.29545:1

Bacteria only 25 29 54 0.86201:1

total 38 73 111

Treatment Dead Alive Total

Bacteria + Antiserum

33.34434 166.50434 230.45392

Bacteria only 80.47190 97.65158 215.40514

total 138.22827 313.20354 522.75785


• Next time… (Or next two times)• Model III and Fisher’s Exact test• More than 2 rows and columns• Odds-ratios for proportions• Logistic Regression

biol 582

Documents

expected frequency

continuous frequency

goodness of fit tests

analysis of frequency

frequency expectations

continuous data

gof tests

types of tests