# sociology 6z03 mcmaster university - mcmaster faculty of

Embed Size (px)

TRANSCRIPT

John Fox

McMaster University

Fall 2016

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 1 / 38

Outline: Statistical Inference for Contingency Tables

Introduction

The Chi-Square Test for r × c Contingency Tables

The Chi-Square Goodness-of-Fit Test

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 2 / 38

Introduction

Last lecture’s difference-of-proportions z-test can be thought of as a method for testing the statistical significance of the relationship between two categorical variables, each with two categories — that is, a contingency table in which the explanatory variable and the response variable each have two categories.

Here is an example that we looked at much earlier in the course, for white and black defendants convicted of murdering white victims (in a study of the application of the death penalty in the U.S.):

Death Penalty? Race of Defendant Yes No Total

White 19 132 151 Black 11 52 63

Total 30 184 214

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 3 / 38

Introduction

Recall that contingency tables are percentaged within categories of the explanatory variable (race of defendant) and across categories of the response (death penalty):

Death Penalty? Race of Defendant Yes No Total Number

White 12.6 87.4 100.0 151 Black 17.5 82.5 100.0 63

We want to test the null hypothesis that white and black defendants convicted of killing whites are equally likely to receive the death penalty

H0: p1 = p2 or H0: p1 − p2 = 0

against the alternative hypothesis that whites are less likely than blacks to receive the death penalty

Ha: p1 < p2 or Ha: p1 − p2 < 0

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 4 / 38

Introduction

In this case, the pooled sample proportion receiving the death penalty is

p = 19 + 11

z = p1 − p2√

( 1

151 +

1

63

) = −0.94

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 5 / 38

Introduction

Thought Question

The one-sided P-value for z = −0.94 is P = .1736. TRUE or FALSE: Consequently, we do not have strong evidence against the null hypothesis.

A TRUE.

B FALSE.

C I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 6 / 38

Introduction

Suppose, however, that we have more than two samples to compare, or that the response variable has more than two categories.

In this case, a simple difference-of-proportions test will not suffice

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 7 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence

Here is another way of calculating a difference-of-proportions test that will generalize to any number of categories for the explanatory and response variables.

The first step is to find the expected count in each of the four cells of the table under the assumption that the null hypothesis is true and the row and column variables in the table are independent (unrelated).

Under this assumption of independence, we would expect the proportion receiving the death penalty to be the same for white and black defendants. We estimate this common quantity to be

p = 19 + 11

214 = 0.140

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 8 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence

Then we would expect

151× .140 = 151× 30

214 = 21.2

of the 151 white defendants to receive the death penalty, and the remainder,

151× .860 = 151× 184

not to receive the death penalty.

These expected counts follow from the mean of a binomial random variable.

For example, in 151 independent trials when p = .140, the expected number of “successes” (death sentences) is 151× .140 = 21.2.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 9 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence

The total number of blacks is 63, the proportion overall getting the death penalty is .140 = 30/214, and the proportion not getting the death penalty is .860 = 184/214.

Thought Question

(A) TRUE, (B) FALSE, or (C) I don’t know: The expected counts for blacks are

63× .140 = 63× 30

63× .860 = 63× 184

expected not to receive the death penalty.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 10 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence

These expected counts are the counts that we would observe on average over many samples of 151 whites and 63 blacks if the probability of receiving the death penalty were p = .14 for both black and white defendants.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 11 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence: Formula

Here is a simple formula for calculating the expected count in each cell of the table:

expected count = row total × column total

table total

The table total is just the overall sample size, n.

A note on terminology: The expected counts under independence are more accurately called estimated expected counts, because they are based on the sample (not population) marginal distributions for the two variables.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 12 / 38

The Chi-Square Test for Two Proportions Expected Counts by the Multiplication Rule for Independent Events

We can also derive the formula from the multiplication rule for independent events. For example:

The proportion overall receiving the death penalty is

pD = 30

214 = .140

pW = 151

214 = .706

If the two events are independent, then the joint proportion would be

pDW = pD pW = .140× .706 = .0988

and the expected count under independence of whites receiving the death penalty is

npDW = 214× .0988 = 21.2

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 13 / 38

The Chi-Square Test for Two Proportions Chi-Square Test Statistic

The chi-square test statistic compares the expected counts to the observed counts:

X 2 = ∑ all cells

(observed count − expected count)2

expected count

The name of the test comes from the Greek letter χ (chi), which looks like the Roman letter X .

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 14 / 38

The Chi-Square Test for Two Proportions Chi-Square Test Statistic: Example

For our example, we have the following computation:

observed expected obs. − exp. (obs. − exp.)2

exp. 19 21.2 −2.2 0.23

132 129.8 2.2 0.04 11 8.8 2.2 0.55 52 54.2 −2.2 0.09

214 214.0 √

X 2 = 0.91

The P-value for the test is P = .3422; I’ll explain later how the P-value is obtained.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 15 / 38

The Chi-Square Test for Two Proportions Chi-Square Test vs. Difference-of-Proportions Test

The chi-square test for a two-by-two table is really just a more complicated way of calculating the difference-of-proportions z-test: In this instance, X 2 = z2.

Note that this is true for the example (within rounding error): −0.942 = 0.88.

The only difference between the two tests is that the z-test can be used for a directional alternative hypothesis, but the chi-square test is inherently nondirectional.

This is why in this instance the P-value for the chi-square test is twice (within rounding error) the one-sided P-value for the z-test.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 16 / 38

The Chi-Square Test for r x c Contingency Tables

More generally, a contingency table has r rows and c columns.

Recall, for example, the following contingency table of counts, using data from the U.S. General Social Surveys:

Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 2423 3348 3647 4547 13 965 Sometimes Wrong 1692 1789 1797 1738 7016 Wrong 3207 3017 3035 3114 12 373

Total 7322 8154 8479 9399 33 354

In this case, r = 3 and c = 4.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 17 / 38

The Chi-Square Test for r x c Contingency Tables

To interpret this table, we percentaged it by columns (because the column variable — date — is the explanatory variable):

Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s

Not Wrong 33.1 41.1 43.0 48.4 Sometimes Wrong 23.1 21.9 21.2 18.5 Wrong 43.8 37.0 35.8 33.1

Total 100.0 100.0 100.0 100.0 Number 7322 8154 8479 9399

We cannot summarize the distribution for each decade with a single percentage because the response variable (attitude towards premarital sex) has more than two categories.

A difference-of-proportions z-test would therefore be inappropriate here, even if there were only two categories of the explanatory variable (and there are four).

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 18 / 38

The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence

We can, however, calculate expected counts assuming that date and attitude towards premarital sex are unrelated, and then use the chi-square test statistic to compare the expected and observed counts.

The expected count for those who think premartial sex is “not wrong” in the 1970s, for example, is

expected count = row total × column total

table total

= 13, 965× 7322

33, 354 = 3065.7

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 19 / 38

The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence

Thought Question

What is the expected count for “sometimes wrong” in the 1980s?

A 7016× 7322/33, 354.

C 7016× 8154/33, 354.

D I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 20 / 38

The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence

The complete table of expected counts is as follows:

Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 3065.7 3414.0 3550.1 3935.3 13 965.1 Sometimes Wrong 1540.2 1715.2 1783.5 1977.1 7016.0 Wrong 2716.2 3024.8 3145.4 3486.7 12 373.1

Total 7322.1 8154.0 8479.0 9399.1 33 354.2

Note that, within rounding error, the expected counts sum to the observed row and column marginals.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 21 / 38

The Chi-Square Test for r x c Contingency Tables Chi-Square Test Statistic

The chi-square test statistic is calculated as before:

X 2 = ∑ all cells

(observed count − expected count)2

+ · · ·+ (3114− 3486.7)2

3486.7 = 413.3

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 22 / 38

The Chi-Square Test for r x c Contingency Tables The Family of Chi-Square Distributions

Like the t-distributions, the chi-square distributions are a family of density curves indexed by degrees of freedom.

There is a different chi-square distribution for each degrees of freedom, 1, 2, 3, ... .

Unlike the t and normal distributions, the chi-square distribution is positively skewed.

As degrees of freedom grow, the distribution becomes less skewed.

The average value of a chi-square random variable is equal to the number of degrees of freedom, E (X 2) = df ; the variance is twice the number of degrees of freedom, Var(X 2) = 2× df .

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 23 / 38

The Chi-Square Test for r x c Contingency Tables The Family of Chi-Square Distributions: Some Examples

0 5 10 15 20

0. 0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

D en

si ty

df = 1

df = 2

df = 5

df = 10

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 24 / 38

The Chi-Square Test for r x c Contingency Tables The Chi-Square Distribution and the Chi-Square Test Statistic

Chi-square values cannot be smaller than zero, but they can be indefinitely large.

The chi-square test statistic is calculated from squared differences between observed and expected frequencies, so it can’t be negative.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 25 / 38

title

(A) TRUE, (B) FALSE, or (C) I don’t know:

If the observed and expected frequencies are identical then the value of the chi-square statistic is 0.

The larger the difference between observed and expected frequencies (that is, the larger the departure from independence), the smaller the value of the chi-square test statistic.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 26 / 38

The Chi-Square Test for r x c Contingency Tables The Chi-Square Distribution and the Chi-Square Test Statistic

The chi-square test is therefore inherently one-tailed: We reject the null hypothesis of no relationship between the two variables in a table when the value of the test statistic is sufficiently large.

The alternative hypothesis, however, is nondirectional — a relationship of any sort between the two variables.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 27 / 38

The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test

The degrees of freedom for a chi-square test of the hypothesis of independence (no relationship) in a two-way contingency table are

df = (r − 1)(c − 1)

In our first example, therefore, where r = 2 and c = 2, df = (2− 1)(2− 1) = 1.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 28 / 38

The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test

Thought Question

In the second example, where r = 3 and c = 4, what are the df for the chi-square test statistic?

A 12.

B 6.

C 4.

D 1.

E I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 29 / 38

The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test

The degrees of freedom for the chi-square test of independence follow from the constraint that the expected frequencies sum to the same row and column marginals as the observed frequencies.

This means that if we fill in all but the last row and column of the expected counts, the remaining values can be calculated by subtraction.

For the second example, we only have to fill in six expected counts before all are known:

Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 3065.7 3414.0 3550.1 ? 13 965 Sometimes Wrong 1540.2 1715.2 1783.5 ? 7016 Wrong ? ? ? ? 12 373

Total 7322 8154 8479 9399 33 354

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 30 / 38

The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test

P-values for chi-square statistics can be obtained from many computer programs or, approximately, from the chi-square table.

The chi-square table (Table E in the text) is set up like the t table, with degrees of freedom (df ) down the side and right-tail probabilities (p) across the top.

Here’s an extract from the table for one degree of freedom:

p

df .25 .20 .15

1 1.32 1.64 2.07

Thus, in our first example, where X 2 = 0.91 with 1 df , P > .25.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 31 / 38

The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test

Here’s another extract from the table with six degrees of freedom:

p

df .0025 .001 .0005

6 20.25 22.46 24.10

For our second example, X 2 = 413.3 with 6 df , and so P .0005.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 32 / 38

The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test

Thought Question

TRUE or FALSE: Given this very small P-value, P .0005, we can reject the null hypothesis of independence and conclude that attitude towards premarital sex changed over time.

A TRUE.

B FALSE.

C I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 33 / 38

The Chi-Square Test for r x c Contingency Tables When is the Chi-Square Test Appropriate?

The chi-square test is appropriate when we want to test association in a contingency table.

We should have independent simple random samples from the populations defined by the categories of the explanatory variable, or an SRS from the entire population, classified by the explanatory and response variables.

For the P-value for the chi-square test to be accurate, the population should be at least 10 times larger than the sample.

Chi-square tests are also appropriate when subjects are randomly assigned to experimental treatments and when the response variable is categorical.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 34 / 38

The Chi-Square Test for r x c Contingency Tables When is the Chi-Square Test Appropriate?

Many statisticians also argue that significance tests (including the chi-square test) are appropriate when we have data on a whole population and want to test whether an observed pattern of association could easily have been the product of chance.

I agree with this argument, by the way.

The chi-square distribution is an approximation to the exact distribution of the X 2 test statistic.

The approximation is good as long as the expected counts are not too small: No more than 20 percent of the expected counts should be smaller than 5, and no expected count should be smaller than 1. This requirement was met for both of our examples

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 35 / 38

The Chi-Square Goodness-of-Fit Test

The chi-square statistic can also be used to test whether the distribution of a categorical variable is consistent with some probability distribution.

Suppose that the categorical variable in question has k categories; the null hypothesis gives specific probabilities for all of the categories of the variable:

H0: p1 = p1,0, p2 = p2,0, . . . , pk = pk,0

where ∑ pi ,0 = 1.

Suppose that the observed counts are n1, n2, . . . , nk , and that n = ∑ ni . Then expected counts are each npi ,0.

The chi-square goodness-of-fit test statistic is

X 2 = ∑ (observed count − expected count)2

expected count

Under the null hypothesis, this statistic has a chi-square distribution with k − 1 degrees of freedom.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 36 / 38

The Chi-Square Goodness-of-Fit Test Example

A gambler wants to test whether or not a die is fair, and throws the die 100 times, obtaining the following distribution of observed counts:

Number of Dots 1 2 3 4 5 6 Total

16 9 24 22 11 18 100

The null and alternative hypotheses are

H0: p1 = p2 = p3 = p4 = p5 = p6 = 1 6 ≈ .1667

Ha: not all pi = 1 6

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 37 / 38

The Chi-Square Goodness-of-Fit Test Example

Here, all 6 expected counts are npi ,0 = 100× 1 6 = 16.67, and the chi-square test statistic

is

16.67 + · · ·+ (18− 16.67)2

16.67 = 10.52 with df = 6− 1 = 5, for which .05 < P < .10

We therefore have only weak evidence against the null hypothesis that the die is fair.

Note: The data were generated by using R to simulate throwing a fair die 100 times; thus H0 is true.

Important Point

Do not confuse the chi-square goodness-of-fit test with the chi-square test of independence in a two-way contingency table.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 38 / 38

Statistical Inference for Contingency Tables

Outline: Statistical Inference for Proportions

Introduction

The Chi-Square Test for rc Contingency Tables

The Chi-Square Goodness-of-Fit Test

McMaster University

Fall 2016

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 1 / 38

Outline: Statistical Inference for Contingency Tables

Introduction

The Chi-Square Test for r × c Contingency Tables

The Chi-Square Goodness-of-Fit Test

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 2 / 38

Introduction

Last lecture’s difference-of-proportions z-test can be thought of as a method for testing the statistical significance of the relationship between two categorical variables, each with two categories — that is, a contingency table in which the explanatory variable and the response variable each have two categories.

Here is an example that we looked at much earlier in the course, for white and black defendants convicted of murdering white victims (in a study of the application of the death penalty in the U.S.):

Death Penalty? Race of Defendant Yes No Total

White 19 132 151 Black 11 52 63

Total 30 184 214

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 3 / 38

Introduction

Recall that contingency tables are percentaged within categories of the explanatory variable (race of defendant) and across categories of the response (death penalty):

Death Penalty? Race of Defendant Yes No Total Number

White 12.6 87.4 100.0 151 Black 17.5 82.5 100.0 63

We want to test the null hypothesis that white and black defendants convicted of killing whites are equally likely to receive the death penalty

H0: p1 = p2 or H0: p1 − p2 = 0

against the alternative hypothesis that whites are less likely than blacks to receive the death penalty

Ha: p1 < p2 or Ha: p1 − p2 < 0

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 4 / 38

Introduction

In this case, the pooled sample proportion receiving the death penalty is

p = 19 + 11

z = p1 − p2√

( 1

151 +

1

63

) = −0.94

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 5 / 38

Introduction

Thought Question

The one-sided P-value for z = −0.94 is P = .1736. TRUE or FALSE: Consequently, we do not have strong evidence against the null hypothesis.

A TRUE.

B FALSE.

C I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 6 / 38

Introduction

Suppose, however, that we have more than two samples to compare, or that the response variable has more than two categories.

In this case, a simple difference-of-proportions test will not suffice

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 7 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence

Here is another way of calculating a difference-of-proportions test that will generalize to any number of categories for the explanatory and response variables.

The first step is to find the expected count in each of the four cells of the table under the assumption that the null hypothesis is true and the row and column variables in the table are independent (unrelated).

Under this assumption of independence, we would expect the proportion receiving the death penalty to be the same for white and black defendants. We estimate this common quantity to be

p = 19 + 11

214 = 0.140

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 8 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence

Then we would expect

151× .140 = 151× 30

214 = 21.2

of the 151 white defendants to receive the death penalty, and the remainder,

151× .860 = 151× 184

not to receive the death penalty.

These expected counts follow from the mean of a binomial random variable.

For example, in 151 independent trials when p = .140, the expected number of “successes” (death sentences) is 151× .140 = 21.2.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 9 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence

The total number of blacks is 63, the proportion overall getting the death penalty is .140 = 30/214, and the proportion not getting the death penalty is .860 = 184/214.

Thought Question

(A) TRUE, (B) FALSE, or (C) I don’t know: The expected counts for blacks are

63× .140 = 63× 30

63× .860 = 63× 184

expected not to receive the death penalty.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 10 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence

These expected counts are the counts that we would observe on average over many samples of 151 whites and 63 blacks if the probability of receiving the death penalty were p = .14 for both black and white defendants.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 11 / 38

The Chi-Square Test for Two Proportions Finding Expected Counts Under Independence: Formula

Here is a simple formula for calculating the expected count in each cell of the table:

expected count = row total × column total

table total

The table total is just the overall sample size, n.

A note on terminology: The expected counts under independence are more accurately called estimated expected counts, because they are based on the sample (not population) marginal distributions for the two variables.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 12 / 38

The Chi-Square Test for Two Proportions Expected Counts by the Multiplication Rule for Independent Events

We can also derive the formula from the multiplication rule for independent events. For example:

The proportion overall receiving the death penalty is

pD = 30

214 = .140

pW = 151

214 = .706

If the two events are independent, then the joint proportion would be

pDW = pD pW = .140× .706 = .0988

and the expected count under independence of whites receiving the death penalty is

npDW = 214× .0988 = 21.2

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 13 / 38

The Chi-Square Test for Two Proportions Chi-Square Test Statistic

The chi-square test statistic compares the expected counts to the observed counts:

X 2 = ∑ all cells

(observed count − expected count)2

expected count

The name of the test comes from the Greek letter χ (chi), which looks like the Roman letter X .

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 14 / 38

The Chi-Square Test for Two Proportions Chi-Square Test Statistic: Example

For our example, we have the following computation:

observed expected obs. − exp. (obs. − exp.)2

exp. 19 21.2 −2.2 0.23

132 129.8 2.2 0.04 11 8.8 2.2 0.55 52 54.2 −2.2 0.09

214 214.0 √

X 2 = 0.91

The P-value for the test is P = .3422; I’ll explain later how the P-value is obtained.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 15 / 38

The Chi-Square Test for Two Proportions Chi-Square Test vs. Difference-of-Proportions Test

The chi-square test for a two-by-two table is really just a more complicated way of calculating the difference-of-proportions z-test: In this instance, X 2 = z2.

Note that this is true for the example (within rounding error): −0.942 = 0.88.

The only difference between the two tests is that the z-test can be used for a directional alternative hypothesis, but the chi-square test is inherently nondirectional.

This is why in this instance the P-value for the chi-square test is twice (within rounding error) the one-sided P-value for the z-test.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 16 / 38

The Chi-Square Test for r x c Contingency Tables

More generally, a contingency table has r rows and c columns.

Recall, for example, the following contingency table of counts, using data from the U.S. General Social Surveys:

Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 2423 3348 3647 4547 13 965 Sometimes Wrong 1692 1789 1797 1738 7016 Wrong 3207 3017 3035 3114 12 373

Total 7322 8154 8479 9399 33 354

In this case, r = 3 and c = 4.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 17 / 38

The Chi-Square Test for r x c Contingency Tables

To interpret this table, we percentaged it by columns (because the column variable — date — is the explanatory variable):

Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s

Not Wrong 33.1 41.1 43.0 48.4 Sometimes Wrong 23.1 21.9 21.2 18.5 Wrong 43.8 37.0 35.8 33.1

Total 100.0 100.0 100.0 100.0 Number 7322 8154 8479 9399

We cannot summarize the distribution for each decade with a single percentage because the response variable (attitude towards premarital sex) has more than two categories.

A difference-of-proportions z-test would therefore be inappropriate here, even if there were only two categories of the explanatory variable (and there are four).

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 18 / 38

The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence

We can, however, calculate expected counts assuming that date and attitude towards premarital sex are unrelated, and then use the chi-square test statistic to compare the expected and observed counts.

The expected count for those who think premartial sex is “not wrong” in the 1970s, for example, is

expected count = row total × column total

table total

= 13, 965× 7322

33, 354 = 3065.7

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 19 / 38

The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence

Thought Question

What is the expected count for “sometimes wrong” in the 1980s?

A 7016× 7322/33, 354.

C 7016× 8154/33, 354.

D I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 20 / 38

The Chi-Square Test for r x c Contingency Tables Finding Expected Counts Under Independence

The complete table of expected counts is as follows:

Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 3065.7 3414.0 3550.1 3935.3 13 965.1 Sometimes Wrong 1540.2 1715.2 1783.5 1977.1 7016.0 Wrong 2716.2 3024.8 3145.4 3486.7 12 373.1

Total 7322.1 8154.0 8479.0 9399.1 33 354.2

Note that, within rounding error, the expected counts sum to the observed row and column marginals.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 21 / 38

The Chi-Square Test for r x c Contingency Tables Chi-Square Test Statistic

The chi-square test statistic is calculated as before:

X 2 = ∑ all cells

(observed count − expected count)2

+ · · ·+ (3114− 3486.7)2

3486.7 = 413.3

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 22 / 38

The Chi-Square Test for r x c Contingency Tables The Family of Chi-Square Distributions

Like the t-distributions, the chi-square distributions are a family of density curves indexed by degrees of freedom.

There is a different chi-square distribution for each degrees of freedom, 1, 2, 3, ... .

Unlike the t and normal distributions, the chi-square distribution is positively skewed.

As degrees of freedom grow, the distribution becomes less skewed.

The average value of a chi-square random variable is equal to the number of degrees of freedom, E (X 2) = df ; the variance is twice the number of degrees of freedom, Var(X 2) = 2× df .

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 23 / 38

The Chi-Square Test for r x c Contingency Tables The Family of Chi-Square Distributions: Some Examples

0 5 10 15 20

0. 0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

D en

si ty

df = 1

df = 2

df = 5

df = 10

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 24 / 38

The Chi-Square Test for r x c Contingency Tables The Chi-Square Distribution and the Chi-Square Test Statistic

Chi-square values cannot be smaller than zero, but they can be indefinitely large.

The chi-square test statistic is calculated from squared differences between observed and expected frequencies, so it can’t be negative.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 25 / 38

title

(A) TRUE, (B) FALSE, or (C) I don’t know:

If the observed and expected frequencies are identical then the value of the chi-square statistic is 0.

The larger the difference between observed and expected frequencies (that is, the larger the departure from independence), the smaller the value of the chi-square test statistic.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 26 / 38

The Chi-Square Test for r x c Contingency Tables The Chi-Square Distribution and the Chi-Square Test Statistic

The chi-square test is therefore inherently one-tailed: We reject the null hypothesis of no relationship between the two variables in a table when the value of the test statistic is sufficiently large.

The alternative hypothesis, however, is nondirectional — a relationship of any sort between the two variables.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 27 / 38

The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test

The degrees of freedom for a chi-square test of the hypothesis of independence (no relationship) in a two-way contingency table are

df = (r − 1)(c − 1)

In our first example, therefore, where r = 2 and c = 2, df = (2− 1)(2− 1) = 1.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 28 / 38

The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test

Thought Question

In the second example, where r = 3 and c = 4, what are the df for the chi-square test statistic?

A 12.

B 6.

C 4.

D 1.

E I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 29 / 38

The Chi-Square Test for r x c Contingency Tables Degress of Freedom for the Chi-Square Test

The degrees of freedom for the chi-square test of independence follow from the constraint that the expected frequencies sum to the same row and column marginals as the observed frequencies.

This means that if we fill in all but the last row and column of the expected counts, the remaining values can be calculated by subtraction.

For the second example, we only have to fill in six expected counts before all are known:

Date of Survey Premaritial Sex 1970s 1980s 1990s 2000s Total

Not Wrong 3065.7 3414.0 3550.1 ? 13 965 Sometimes Wrong 1540.2 1715.2 1783.5 ? 7016 Wrong ? ? ? ? 12 373

Total 7322 8154 8479 9399 33 354

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 30 / 38

The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test

P-values for chi-square statistics can be obtained from many computer programs or, approximately, from the chi-square table.

The chi-square table (Table E in the text) is set up like the t table, with degrees of freedom (df ) down the side and right-tail probabilities (p) across the top.

Here’s an extract from the table for one degree of freedom:

p

df .25 .20 .15

1 1.32 1.64 2.07

Thus, in our first example, where X 2 = 0.91 with 1 df , P > .25.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 31 / 38

The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test

Here’s another extract from the table with six degrees of freedom:

p

df .0025 .001 .0005

6 20.25 22.46 24.10

For our second example, X 2 = 413.3 with 6 df , and so P .0005.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 32 / 38

The Chi-Square Test for r x c Contingency Tables Finding the P-Value for the Chi-Square Test

Thought Question

TRUE or FALSE: Given this very small P-value, P .0005, we can reject the null hypothesis of independence and conclude that attitude towards premarital sex changed over time.

A TRUE.

B FALSE.

C I don’t know.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 33 / 38

The Chi-Square Test for r x c Contingency Tables When is the Chi-Square Test Appropriate?

The chi-square test is appropriate when we want to test association in a contingency table.

We should have independent simple random samples from the populations defined by the categories of the explanatory variable, or an SRS from the entire population, classified by the explanatory and response variables.

For the P-value for the chi-square test to be accurate, the population should be at least 10 times larger than the sample.

Chi-square tests are also appropriate when subjects are randomly assigned to experimental treatments and when the response variable is categorical.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 34 / 38

The Chi-Square Test for r x c Contingency Tables When is the Chi-Square Test Appropriate?

Many statisticians also argue that significance tests (including the chi-square test) are appropriate when we have data on a whole population and want to test whether an observed pattern of association could easily have been the product of chance.

I agree with this argument, by the way.

The chi-square distribution is an approximation to the exact distribution of the X 2 test statistic.

The approximation is good as long as the expected counts are not too small: No more than 20 percent of the expected counts should be smaller than 5, and no expected count should be smaller than 1. This requirement was met for both of our examples

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 35 / 38

The Chi-Square Goodness-of-Fit Test

The chi-square statistic can also be used to test whether the distribution of a categorical variable is consistent with some probability distribution.

Suppose that the categorical variable in question has k categories; the null hypothesis gives specific probabilities for all of the categories of the variable:

H0: p1 = p1,0, p2 = p2,0, . . . , pk = pk,0

where ∑ pi ,0 = 1.

Suppose that the observed counts are n1, n2, . . . , nk , and that n = ∑ ni . Then expected counts are each npi ,0.

The chi-square goodness-of-fit test statistic is

X 2 = ∑ (observed count − expected count)2

expected count

Under the null hypothesis, this statistic has a chi-square distribution with k − 1 degrees of freedom.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 36 / 38

The Chi-Square Goodness-of-Fit Test Example

A gambler wants to test whether or not a die is fair, and throws the die 100 times, obtaining the following distribution of observed counts:

Number of Dots 1 2 3 4 5 6 Total

16 9 24 22 11 18 100

The null and alternative hypotheses are

H0: p1 = p2 = p3 = p4 = p5 = p6 = 1 6 ≈ .1667

Ha: not all pi = 1 6

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 37 / 38

The Chi-Square Goodness-of-Fit Test Example

Here, all 6 expected counts are npi ,0 = 100× 1 6 = 16.67, and the chi-square test statistic

is

16.67 + · · ·+ (18− 16.67)2

16.67 = 10.52 with df = 6− 1 = 5, for which .05 < P < .10

We therefore have only weak evidence against the null hypothesis that the die is fair.

Note: The data were generated by using R to simulate throwing a fair die 100 times; thus H0 is true.

Important Point

Do not confuse the chi-square goodness-of-fit test with the chi-square test of independence in a two-way contingency table.

John Fox (McMaster University) Soc 6Z03: Inference for Contingency Tables Fall 2016 38 / 38

Statistical Inference for Contingency Tables

Outline: Statistical Inference for Proportions

Introduction

The Chi-Square Test for rc Contingency Tables

The Chi-Square Goodness-of-Fit Test