sociology 6z03 mcmaster university - mcmaster faculty of
Embed Size (px)
TRANSCRIPT
John Fox
McMaster University
Fall 2016
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 1 / 38
Outline: Statistical Inference for Contingency Tables
Introduction
The Chi-Square Test for r × c Contingency Tables
The Chi-Square Goodness-of-Fit Test
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 2 / 38
Introduction
Last lecture’s difference-of-proportions z-test can be thought of as a method for testing the statistical significance of the relationship between two categorical variables, each with two categories — that is, a contingency table in which the explanatory variable and the response variable each have two categories.
Here is an example that we looked at much earlier in the course, for white and black defendants convicted of murdering white victims (in a study of the application of the death penalty in the U.S.):
Death Penalty? Race of Defendant Yes No Total
White 19 132 151 Black 11 52 63
Total 30 184 214
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 3 / 38
Introduction
Recall that contingency tables are percentaged within categories of the explanatory variable (race of defendant) and across categories of the response (death penalty):
Death Penalty? Race of Defendant Yes No Total Number
White 12.6 87.4 100.0 151 Black 17.5 82.5 100.0 63
We want to test the null hypothesis that white and black defendants convicted of killing whites are equally likely to receive the death penalty
H0: p1 = p2 or H0: p1 − p2 = 0
against the alternative hypothesis that whites are less likely than blacks to receive the death penalty
Ha: p1 < p2 or Ha: p1 − p2 < 0
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 4 / 38
Introduction
In this case, the pooled sample proportion receiving the death penalty is
p = 19 + 11
z = p1 − p2√
( 1
151 +
1
63
) = −0.94
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 5 / 38
Introduction
Thought Question
The one-sided P-value for z = −0.94 is P = .1736. TRUE or FALSE: Consequently, we do not have strong evidence against the null hypothesis.
A TRUE.
B FALSE.
C I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 6 / 38
Introduction
Suppose, however, that we have more than two samples to compare, or that the response variable has more than two categories.
In this case, a simple difference-of-proportions test will not suffice
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 7 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence
Here is another way of calculating a difference-of-proportions test that will generalize to any number of categories for the explanatory and response variables.
The first step is to find the expected count in each of the four cells of the table under the assumption that the null hypothesis is true and the row and column variables in the table are independent (unrelated).
Under this assumption of independence, we would expect the proportion receiving the death penalty to be the same for white and black defendants. We estimate this common quantity to be
p = 19 + 11
214 = 0.140
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 8 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence
Then we would expect
151× .140 = 151× 30
214 = 21.2
of the 151 white defendants to receive the death penalty, and the remainder,
151× .860 = 151× 184
not to receive the death penalty.
These expected counts follow from the mean of a binomial random variable.
For example, in 151 independent trials when p = .140, the expected number of “successes” (death sentences) is 151× .140 = 21.2.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 9 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence
The total number of blacks is 63, the proportion overall getting the death penalty is .140 = 30/214, and the proportion not getting the death penalty is .860 = 184/214.
Thought Question
(A) TRUE, (B) FALSE, or (C) I don’t know: The expected counts for blacks are
63× .140 = 63× 30
63× .860 = 63× 184
expected not to receive the death penalty.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 10 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence
These expected counts are the counts that we would observe on average over many samples of 151 whites and 63 blacks if the probability of receiving the death penalty were p = .14 for both black and white defendants.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 11 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence: Formula
Here is a simple formula for calculating the expected count in each cell of the table:
expected count = row total × column total
table total
The table total is just the overall sample size, n.
A note on terminology: The expected counts under independence are more accurately called estimated expected counts, because they are based on the sample (not population) marginal distributions for the two variables.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 12 / 38
The Chi-Square Test for Two Proportions Expected Counts by the Multiplication Rule for Independent Events
We can also derive the formula from the multiplication rule for independent events. For example:
The proportion overall receiving the death penalty is
pD = 30
214 = .140
pW = 151
214 = .706
If the two events are independent, then the joint proportion would be
pDW = pD pW = .140× .706 = .0988
and the expected count under independence of whites receiving the death penalty is
npDW = 214× .0988 = 21.2
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 13 / 38
The Chi-Square Test for Two Proportions Chi-Square Test Statistic
The chi-square test statistic compares the expected counts to the observed counts:
X 2 = ∑ all cells
(observed count − expected count)2
expected count
The name of the test comes from the Greek letter χ (chi), which looks like the Roman letter X .
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 14 / 38
The Chi-Square Test for Two Proportions Chi-Square Test Statistic: Example
For our example, we have the following computation:
observed expected obs. − exp. (obs. − exp.)2
exp. 19 21.2 −2.2 0.23
132 129.8 2.2 0.04 11 8.8 2.2 0.55 52 54.2 −2.2 0.09
214 214.0 √
X 2 = 0.91
The P-value for the test is P = .3422; I’ll explain later how the P-value is obtained.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 15 / 38
The Chi-Square Test for Two Proportions Chi-Square Test vs. Difference-of-Proportions Test
The chi-square test for a two-by-two table is really just a more complicated way of calculating the difference-of-proportions z-test: In this instance, X 2 = z2.
Note that this is true for the example (within rounding error): −0.942 = 0.88.
The only difference between the two tests is that the z-test can be used for a directional alternative hypothesis, but the chi-square test is inherently nondirectional.
This is why in this instance the P-value for the chi-square test is twice (within rounding error) the one-sided P-value for the z-test.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 16 / 38
The Chi-Square Test for r x c Contingency Tables
More generally, a contingency table has r rows and c columns.
Recall, for example, the following contingency table of counts, using data from the U.S. General Social Surveys:
Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total
Not Wrong 2423 3348 3647 4547 13 965 Sometimes Wrong 1692 1789 1797 1738 7016 Wrong 3207 3017 3035 3114 12 373
Total 7322 8154 8479 9399 33 354
In this case, r = 3 and c = 4.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 17 / 38
The Chi-Square Test for r x c Contingency Tables
To interpret this table, we percentaged it by columns (because the column variable — date — is the explanatory variable):
Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s
Not Wrong 33.1 41.1 43.0 48.4 Sometimes Wrong 23.1 21.9 21.2 18.5 Wrong 43.8 37.0 35.8 33.1
Total 100.0 100.0 100.0 100.0 Number 7322 8154 8479 9399
We cannot summarize the distribution for each decade with a single percentage because the response variable (attitude towards premarital sex) has more than two categories.
A difference-of-proportions z-test would therefore be inappropriate here, even if there were only two categories of the explanatory variable (and there are four).
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 18 / 38
The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence
We can, however, calculate expected counts assuming that date and attitude towards premarital sex are unrelated, and then use the chi-square test statistic to compare the expected and observed counts.
The expected count for those who think premartial sex is “not wrong” in the 1970s, for example, is
expected count = row total × column total
table total
= 13, 965× 7322
33, 354 = 3065.7
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 19 / 38
The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence
Thought Question
What is the expected count for “sometimes wrong” in the 1980s?
A 7016× 7322/33, 354.
C 7016× 8154/33, 354.
D I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 20 / 38
The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence
The complete table of expected counts is as follows:
Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total
Not Wrong 3065.7 3414.0 3550.1 3935.3 13 965.1 Sometimes Wrong 1540.2 1715.2 1783.5 1977.1 7016.0 Wrong 2716.2 3024.8 3145.4 3486.7 12 373.1
Total 7322.1 8154.0 8479.0 9399.1 33 354.2
Note that, within rounding error, the expected counts sum to the observed row and column marginals.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 21 / 38
The Chi-Square Test for r x c Contingency Tables Chi-Square Test Statistic
The chi-square test statistic is calculated as before:
X 2 = ∑ all cells
(observed count − expected count)2
+ · · ·+ (3114− 3486.7)2
3486.7 = 413.3
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 22 / 38
The Chi-Square Test for r x c Contingency Tables The Family of Chi-Square Distributions
Like the t-distributions, the chi-square distributions are a family of density curves indexed by degrees of freedom.
There is a different chi-square distribution for each degrees of freedom, 1, 2, 3, ... .
Unlike the t and normal distributions, the chi-square distribution is positively skewed.
As degrees of freedom grow, the distribution becomes less skewed.
The average value of a chi-square random variable is equal to the number of degrees of freedom, E (X 2) = df ; the variance is twice the number of degrees of freedom, Var(X 2) = 2× df .
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 23 / 38
The Chi-Square Test for r x c Contingency Tables The Family of Chi-Square Distributions: Some Examples
0 5 10 15 20
0. 0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
D en
si ty
df = 1
df = 2
df = 5
df = 10
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 24 / 38
The Chi-Square Test for r x c Contingency Tables The Chi-Square Distribution and the Chi-Square Test Statistic
Chi-square values cannot be smaller than zero, but they can be indefinitely large.
The chi-square test statistic is calculated from squared differences between observed and expected frequencies, so it can’t be negative.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 25 / 38
title
(A) TRUE, (B) FALSE, or (C) I don’t know:
If the observed and expected frequencies are identical then the value of the chi-square statistic is 0.
The larger the difference between observed and expected frequencies (that is, the larger the departure from independence), the smaller the value of the chi-square test statistic.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 26 / 38
The Chi-Square Test for r x c Contingency Tables The Chi-Square Distribution and the Chi-Square Test Statistic
The chi-square test is therefore inherently one-tailed: We reject the null hypothesis of no relationship between the two variables in a table when the value of the test statistic is sufficiently large.
The alternative hypothesis, however, is nondirectional — a relationship of any sort between the two variables.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 27 / 38
The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test
The degrees of freedom for a chi-square test of the hypothesis of independence (no relationship) in a two-way contingency table are
df = (r − 1)(c − 1)
In our first example, therefore, where r = 2 and c = 2, df = (2− 1)(2− 1) = 1.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 28 / 38
The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test
Thought Question
In the second example, where r = 3 and c = 4, what are the df for the chi-square test statistic?
A 12.
B 6.
C 4.
D 1.
E I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 29 / 38
The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test
The degrees of freedom for the chi-square test of independence follow from the constraint that the expected frequencies sum to the same row and column marginals as the observed frequencies.
This means that if we fill in all but the last row and column of the expected counts, the remaining values can be calculated by subtraction.
For the second example, we only have to fill in six expected counts before all are known:
Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total
Not Wrong 3065.7 3414.0 3550.1 ? 13 965 Sometimes Wrong 1540.2 1715.2 1783.5 ? 7016 Wrong ? ? ? ? 12 373
Total 7322 8154 8479 9399 33 354
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 30 / 38
The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test
P-values for chi-square statistics can be obtained from many computer programs or, approximately, from the chi-square table.
The chi-square table (Table E in the text) is set up like the t table, with degrees of freedom (df ) down the side and right-tail probabilities (p) across the top.
Here’s an extract from the table for one degree of freedom:
p
df .25 .20 .15
1 1.32 1.64 2.07
Thus, in our first example, where X 2 = 0.91 with 1 df , P > .25.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 31 / 38
The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test
Here’s another extract from the table with six degrees of freedom:
p
df .0025 .001 .0005
6 20.25 22.46 24.10
For our second example, X 2 = 413.3 with 6 df , and so P .0005.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 32 / 38
The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test
Thought Question
TRUE or FALSE: Given this very small P-value, P .0005, we can reject the null hypothesis of independence and conclude that attitude towards premarital sex changed over time.
A TRUE.
B FALSE.
C I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 33 / 38
The Chi-Square Test for r x c Contingency Tables When is the Chi-Square Test Appropriate?
The chi-square test is appropriate when we want to test association in a contingency table.
We should have independent simple random samples from the populations defined by the categories of the explanatory variable, or an SRS from the entire population, classified by the explanatory and response variables.
For the P-value for the chi-square test to be accurate, the population should be at least 10 times larger than the sample.
Chi-square tests are also appropriate when subjects are randomly assigned to experimental treatments and when the response variable is categorical.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 34 / 38
The Chi-Square Test for r x c Contingency Tables When is the Chi-Square Test Appropriate?
Many statisticians also argue that significance tests (including the chi-square test) are appropriate when we have data on a whole population and want to test whether an observed pattern of association could easily have been the product of chance.
I agree with this argument, by the way.
The chi-square distribution is an approximation to the exact distribution of the X 2 test statistic.
The approximation is good as long as the expected counts are not too small: No more than 20 percent of the expected counts should be smaller than 5, and no expected count should be smaller than 1. This requirement was met for both of our examples
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 35 / 38
The Chi-Square Goodness-of-Fit Test
The chi-square statistic can also be used to test whether the distribution of a categorical variable is consistent with some probability distribution.
Suppose that the categorical variable in question has k categories; the null hypothesis gives specific probabilities for all of the categories of the variable:
H0: p1 = p1,0, p2 = p2,0, . . . , pk = pk,0
where ∑ pi ,0 = 1.
Suppose that the observed counts are n1, n2, . . . , nk , and that n = ∑ ni . Then expected counts are each npi ,0.
The chi-square goodness-of-fit test statistic is
X 2 = ∑ (observed count − expected count)2
expected count
Under the null hypothesis, this statistic has a chi-square distribution with k − 1 degrees of freedom.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 36 / 38
The Chi-Square Goodness-of-Fit Test Example
A gambler wants to test whether or not a die is fair, and throws the die 100 times, obtaining the following distribution of observed counts:
Number of Dots 1 2 3 4 5 6 Total
16 9 24 22 11 18 100
The null and alternative hypotheses are
H0: p1 = p2 = p3 = p4 = p5 = p6 = 1 6 ≈ .1667
Ha: not all pi = 1 6
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 37 / 38
The Chi-Square Goodness-of-Fit Test Example
Here, all 6 expected counts are npi ,0 = 100× 1 6 = 16.67, and the chi-square test statistic
is
16.67 + · · ·+ (18− 16.67)2
16.67 = 10.52 with df = 6− 1 = 5, for which .05 < P < .10
We therefore have only weak evidence against the null hypothesis that the die is fair.
Note: The data were generated by using R to simulate throwing a fair die 100 times; thus H0 is true.
Important Point
Do not confuse the chi-square goodness-of-fit test with the chi-square test of independence in a two-way contingency table.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 38 / 38
Statistical Inference for Contingency Tables
Outline: Statistical Inference for Proportions
Introduction
The Chi-Square Test for rc Contingency Tables
The Chi-Square Goodness-of-Fit Test
McMaster University
Fall 2016
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 1 / 38
Outline: Statistical Inference for Contingency Tables
Introduction
The Chi-Square Test for r × c Contingency Tables
The Chi-Square Goodness-of-Fit Test
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 2 / 38
Introduction
Last lecture’s difference-of-proportions z-test can be thought of as a method for testing the statistical significance of the relationship between two categorical variables, each with two categories — that is, a contingency table in which the explanatory variable and the response variable each have two categories.
Here is an example that we looked at much earlier in the course, for white and black defendants convicted of murdering white victims (in a study of the application of the death penalty in the U.S.):
Death Penalty? Race of Defendant Yes No Total
White 19 132 151 Black 11 52 63
Total 30 184 214
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 3 / 38
Introduction
Recall that contingency tables are percentaged within categories of the explanatory variable (race of defendant) and across categories of the response (death penalty):
Death Penalty? Race of Defendant Yes No Total Number
White 12.6 87.4 100.0 151 Black 17.5 82.5 100.0 63
We want to test the null hypothesis that white and black defendants convicted of killing whites are equally likely to receive the death penalty
H0: p1 = p2 or H0: p1 − p2 = 0
against the alternative hypothesis that whites are less likely than blacks to receive the death penalty
Ha: p1 < p2 or Ha: p1 − p2 < 0
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 4 / 38
Introduction
In this case, the pooled sample proportion receiving the death penalty is
p = 19 + 11
z = p1 − p2√
( 1
151 +
1
63
) = −0.94
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 5 / 38
Introduction
Thought Question
The one-sided P-value for z = −0.94 is P = .1736. TRUE or FALSE: Consequently, we do not have strong evidence against the null hypothesis.
A TRUE.
B FALSE.
C I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 6 / 38
Introduction
Suppose, however, that we have more than two samples to compare, or that the response variable has more than two categories.
In this case, a simple difference-of-proportions test will not suffice
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 7 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence
Here is another way of calculating a difference-of-proportions test that will generalize to any number of categories for the explanatory and response variables.
The first step is to find the expected count in each of the four cells of the table under the assumption that the null hypothesis is true and the row and column variables in the table are independent (unrelated).
Under this assumption of independence, we would expect the proportion receiving the death penalty to be the same for white and black defendants. We estimate this common quantity to be
p = 19 + 11
214 = 0.140
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 8 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence
Then we would expect
151× .140 = 151× 30
214 = 21.2
of the 151 white defendants to receive the death penalty, and the remainder,
151× .860 = 151× 184
not to receive the death penalty.
These expected counts follow from the mean of a binomial random variable.
For example, in 151 independent trials when p = .140, the expected number of “successes” (death sentences) is 151× .140 = 21.2.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 9 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence
The total number of blacks is 63, the proportion overall getting the death penalty is .140 = 30/214, and the proportion not getting the death penalty is .860 = 184/214.
Thought Question
(A) TRUE, (B) FALSE, or (C) I don’t know: The expected counts for blacks are
63× .140 = 63× 30
63× .860 = 63× 184
expected not to receive the death penalty.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 10 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence
These expected counts are the counts that we would observe on average over many samples of 151 whites and 63 blacks if the probability of receiving the death penalty were p = .14 for both black and white defendants.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 11 / 38
The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence: Formula
Here is a simple formula for calculating the expected count in each cell of the table:
expected count = row total × column total
table total
The table total is just the overall sample size, n.
A note on terminology: The expected counts under independence are more accurately called estimated expected counts, because they are based on the sample (not population) marginal distributions for the two variables.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 12 / 38
The Chi-Square Test for Two Proportions Expected Counts by the Multiplication Rule for Independent Events
We can also derive the formula from the multiplication rule for independent events. For example:
The proportion overall receiving the death penalty is
pD = 30
214 = .140
pW = 151
214 = .706
If the two events are independent, then the joint proportion would be
pDW = pD pW = .140× .706 = .0988
and the expected count under independence of whites receiving the death penalty is
npDW = 214× .0988 = 21.2
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 13 / 38
The Chi-Square Test for Two Proportions Chi-Square Test Statistic
The chi-square test statistic compares the expected counts to the observed counts:
X 2 = ∑ all cells
(observed count − expected count)2
expected count
The name of the test comes from the Greek letter χ (chi), which looks like the Roman letter X .
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 14 / 38
The Chi-Square Test for Two Proportions Chi-Square Test Statistic: Example
For our example, we have the following computation:
observed expected obs. − exp. (obs. − exp.)2
exp. 19 21.2 −2.2 0.23
132 129.8 2.2 0.04 11 8.8 2.2 0.55 52 54.2 −2.2 0.09
214 214.0 √
X 2 = 0.91
The P-value for the test is P = .3422; I’ll explain later how the P-value is obtained.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 15 / 38
The Chi-Square Test for Two Proportions Chi-Square Test vs. Difference-of-Proportions Test
The chi-square test for a two-by-two table is really just a more complicated way of calculating the difference-of-proportions z-test: In this instance, X 2 = z2.
Note that this is true for the example (within rounding error): −0.942 = 0.88.
The only difference between the two tests is that the z-test can be used for a directional alternative hypothesis, but the chi-square test is inherently nondirectional.
This is why in this instance the P-value for the chi-square test is twice (within rounding error) the one-sided P-value for the z-test.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 16 / 38
The Chi-Square Test for r x c Contingency Tables
More generally, a contingency table has r rows and c columns.
Recall, for example, the following contingency table of counts, using data from the U.S. General Social Surveys:
Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total
Not Wrong 2423 3348 3647 4547 13 965 Sometimes Wrong 1692 1789 1797 1738 7016 Wrong 3207 3017 3035 3114 12 373
Total 7322 8154 8479 9399 33 354
In this case, r = 3 and c = 4.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 17 / 38
The Chi-Square Test for r x c Contingency Tables
To interpret this table, we percentaged it by columns (because the column variable — date — is the explanatory variable):
Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s
Not Wrong 33.1 41.1 43.0 48.4 Sometimes Wrong 23.1 21.9 21.2 18.5 Wrong 43.8 37.0 35.8 33.1
Total 100.0 100.0 100.0 100.0 Number 7322 8154 8479 9399
We cannot summarize the distribution for each decade with a single percentage because the response variable (attitude towards premarital sex) has more than two categories.
A difference-of-proportions z-test would therefore be inappropriate here, even if there were only two categories of the explanatory variable (and there are four).
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 18 / 38
The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence
We can, however, calculate expected counts assuming that date and attitude towards premarital sex are unrelated, and then use the chi-square test statistic to compare the expected and observed counts.
The expected count for those who think premartial sex is “not wrong” in the 1970s, for example, is
expected count = row total × column total
table total
= 13, 965× 7322
33, 354 = 3065.7
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 19 / 38
The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence
Thought Question
What is the expected count for “sometimes wrong” in the 1980s?
A 7016× 7322/33, 354.
C 7016× 8154/33, 354.
D I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 20 / 38
The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence
The complete table of expected counts is as follows:
Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total
Not Wrong 3065.7 3414.0 3550.1 3935.3 13 965.1 Sometimes Wrong 1540.2 1715.2 1783.5 1977.1 7016.0 Wrong 2716.2 3024.8 3145.4 3486.7 12 373.1
Total 7322.1 8154.0 8479.0 9399.1 33 354.2
Note that, within rounding error, the expected counts sum to the observed row and column marginals.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 21 / 38
The Chi-Square Test for r x c Contingency Tables Chi-Square Test Statistic
The chi-square test statistic is calculated as before:
X 2 = ∑ all cells
(observed count − expected count)2
+ · · ·+ (3114− 3486.7)2
3486.7 = 413.3
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 22 / 38
The Chi-Square Test for r x c Contingency Tables The Family of Chi-Square Distributions
Like the t-distributions, the chi-square distributions are a family of density curves indexed by degrees of freedom.
There is a different chi-square distribution for each degrees of freedom, 1, 2, 3, ... .
Unlike the t and normal distributions, the chi-square distribution is positively skewed.
As degrees of freedom grow, the distribution becomes less skewed.
The average value of a chi-square random variable is equal to the number of degrees of freedom, E (X 2) = df ; the variance is twice the number of degrees of freedom, Var(X 2) = 2× df .
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 23 / 38
The Chi-Square Test for r x c Contingency Tables The Family of Chi-Square Distributions: Some Examples
0 5 10 15 20
0. 0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
D en
si ty
df = 1
df = 2
df = 5
df = 10
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 24 / 38
The Chi-Square Test for r x c Contingency Tables The Chi-Square Distribution and the Chi-Square Test Statistic
Chi-square values cannot be smaller than zero, but they can be indefinitely large.
The chi-square test statistic is calculated from squared differences between observed and expected frequencies, so it can’t be negative.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 25 / 38
title
(A) TRUE, (B) FALSE, or (C) I don’t know:
If the observed and expected frequencies are identical then the value of the chi-square statistic is 0.
The larger the difference between observed and expected frequencies (that is, the larger the departure from independence), the smaller the value of the chi-square test statistic.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 26 / 38
The Chi-Square Test for r x c Contingency Tables The Chi-Square Distribution and the Chi-Square Test Statistic
The chi-square test is therefore inherently one-tailed: We reject the null hypothesis of no relationship between the two variables in a table when the value of the test statistic is sufficiently large.
The alternative hypothesis, however, is nondirectional — a relationship of any sort between the two variables.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 27 / 38
The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test
The degrees of freedom for a chi-square test of the hypothesis of independence (no relationship) in a two-way contingency table are
df = (r − 1)(c − 1)
In our first example, therefore, where r = 2 and c = 2, df = (2− 1)(2− 1) = 1.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 28 / 38
The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test
Thought Question
In the second example, where r = 3 and c = 4, what are the df for the chi-square test statistic?
A 12.
B 6.
C 4.
D 1.
E I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 29 / 38
The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test
The degrees of freedom for the chi-square test of independence follow from the constraint that the expected frequencies sum to the same row and column marginals as the observed frequencies.
This means that if we fill in all but the last row and column of the expected counts, the remaining values can be calculated by subtraction.
For the second example, we only have to fill in six expected counts before all are known:
Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total
Not Wrong 3065.7 3414.0 3550.1 ? 13 965 Sometimes Wrong 1540.2 1715.2 1783.5 ? 7016 Wrong ? ? ? ? 12 373
Total 7322 8154 8479 9399 33 354
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 30 / 38
The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test
P-values for chi-square statistics can be obtained from many computer programs or, approximately, from the chi-square table.
The chi-square table (Table E in the text) is set up like the t table, with degrees of freedom (df ) down the side and right-tail probabilities (p) across the top.
Here’s an extract from the table for one degree of freedom:
p
df .25 .20 .15
1 1.32 1.64 2.07
Thus, in our first example, where X 2 = 0.91 with 1 df , P > .25.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 31 / 38
The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test
Here’s another extract from the table with six degrees of freedom:
p
df .0025 .001 .0005
6 20.25 22.46 24.10
For our second example, X 2 = 413.3 with 6 df , and so P .0005.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 32 / 38
The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test
Thought Question
TRUE or FALSE: Given this very small P-value, P .0005, we can reject the null hypothesis of independence and conclude that attitude towards premarital sex changed over time.
A TRUE.
B FALSE.
C I don’t know.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 33 / 38
The Chi-Square Test for r x c Contingency Tables When is the Chi-Square Test Appropriate?
The chi-square test is appropriate when we want to test association in a contingency table.
We should have independent simple random samples from the populations defined by the categories of the explanatory variable, or an SRS from the entire population, classified by the explanatory and response variables.
For the P-value for the chi-square test to be accurate, the population should be at least 10 times larger than the sample.
Chi-square tests are also appropriate when subjects are randomly assigned to experimental treatments and when the response variable is categorical.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 34 / 38
The Chi-Square Test for r x c Contingency Tables When is the Chi-Square Test Appropriate?
Many statisticians also argue that significance tests (including the chi-square test) are appropriate when we have data on a whole population and want to test whether an observed pattern of association could easily have been the product of chance.
I agree with this argument, by the way.
The chi-square distribution is an approximation to the exact distribution of the X 2 test statistic.
The approximation is good as long as the expected counts are not too small: No more than 20 percent of the expected counts should be smaller than 5, and no expected count should be smaller than 1. This requirement was met for both of our examples
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 35 / 38
The Chi-Square Goodness-of-Fit Test
The chi-square statistic can also be used to test whether the distribution of a categorical variable is consistent with some probability distribution.
Suppose that the categorical variable in question has k categories; the null hypothesis gives specific probabilities for all of the categories of the variable:
H0: p1 = p1,0, p2 = p2,0, . . . , pk = pk,0
where ∑ pi ,0 = 1.
Suppose that the observed counts are n1, n2, . . . , nk , and that n = ∑ ni . Then expected counts are each npi ,0.
The chi-square goodness-of-fit test statistic is
X 2 = ∑ (observed count − expected count)2
expected count
Under the null hypothesis, this statistic has a chi-square distribution with k − 1 degrees of freedom.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 36 / 38
The Chi-Square Goodness-of-Fit Test Example
A gambler wants to test whether or not a die is fair, and throws the die 100 times, obtaining the following distribution of observed counts:
Number of Dots 1 2 3 4 5 6 Total
16 9 24 22 11 18 100
The null and alternative hypotheses are
H0: p1 = p2 = p3 = p4 = p5 = p6 = 1 6 ≈ .1667
Ha: not all pi = 1 6
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 37 / 38
The Chi-Square Goodness-of-Fit Test Example
Here, all 6 expected counts are npi ,0 = 100× 1 6 = 16.67, and the chi-square test statistic
is
16.67 + · · ·+ (18− 16.67)2
16.67 = 10.52 with df = 6− 1 = 5, for which .05 < P < .10
We therefore have only weak evidence against the null hypothesis that the die is fair.
Note: The data were generated by using R to simulate throwing a fair die 100 times; thus H0 is true.
Important Point
Do not confuse the chi-square goodness-of-fit test with the chi-square test of independence in a two-way contingency table.
John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 38 / 38
Statistical Inference for Contingency Tables
Outline: Statistical Inference for Proportions
Introduction
The Chi-Square Test for rc Contingency Tables
The Chi-Square Goodness-of-Fit Test