two way tables and the chi-square test ● here we study relationships between two categorical...

12
Two Way Tables and the Chi-Square Test Here we study relationships between two categorical variables. The data can be displayed in a two way table (also called a contingency table), showing the counts or percents of individuals that fall into various categories. How to test whether there's a relationship between the independent variable and the dependent variable? A special case of hypothesis testing---all general ideas apply. Make use of the Chi-square distribution and the Chi-square statistic

Upload: collin-peters

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Two Way Tables and the Chi-Square Test

● Here we study relationships between two categorical variables.– The data can be displayed in a two way table (also

called a contingency table), showing the counts or percents of individuals that fall into various categories.

● How to test whether there's a relationship between the independent variable and the dependent variable?– A special case of hypothesis testing---all general

ideas apply.

– Make use of the Chi-square distribution and the Chi-square statistic

● Each row represents a value of the independent (or explanatory) variable; Each column represents a value of the dependent variable

– Not an iron rule. Some like to do the opposite. Doesn't matter as long as you know what you are doing and make it clear to the reader

● The number of observations falling into each combination of categories is entered into each cell of the table– A visual idea of the (lack of) relationship between

the two variables are based on the percents from the counts in the table (using raw count data can be misleading due to unequal sample sizes for different groups)

Two-Way Tables

Two-Way Table: Example

Religiosity

Low 44 313 445 802

(5.5%) (39%) (55.5%) (100%)

Moderate 41 218 186 445

(9.2%) (49%) (41.8%) (100%)

High 115 237 80 432

(26.6%) (54.9%) (18.5%) (100%)

Total 200 768 711 1,679

(11.9%) (45.7%) (42.4%) (100%)

(Abortion opinions, by level of attendance at religious services)

Abortion opinion

Never allow Depends Personal choice Total

The Hypotheses● The null hypothesis: Religiosity has no effect on Abortion

opinion

● The alternative: Religiosity has an effect on Abortion opinion

● Key idea: Under the null hypothesis, the distribution over the different values of Abortion opinion should be the same regardless of the value of Religiosity, and is approximately given by the last row of the table: (11.9%, 45.7%, 42.4%)

– So ask: if the null hypothesis were true, what should be the expected counts in the cells for each value of Religiosity?

● e.g., for Religiosity=“low”, there are a total of 802 observations. How many of these should say “Never allow”, “Depends”, and “Personal choice”? Answer:

– 11.9% * 802, 45.7% * 802, 42.4% * 802,– i.e., (95, 367, 340)

● This is what we'd expect if the null is true. What we actually observe is instead (44, 313, 445)

● Are the differences like these due to random chance? Or they are “significant” so that we would reject the null hypothesis?

Two-Way Table Example:

Religiosity

Low 44 (95) 313 (367) 445 (340) 802

(5.5%) (39%) (55.5%) (100%)

Moderate 41 (53) 218 (204) 186 (188) 445

(9.2%) (49%) (41.8%) (100%)

High 115 (52) 237 (198) 80 (183) 432

(26.6%) (54.9%) (18.5%) (100%)

Total 200 768 711 1,679

(11.9%) (45.7%) (42.4%) (100%)

Showing Expected Counts

Abortion opinion

Never allow Depends Personal choice Total

Testing the Hypotheses

● Now recall the logic of hypothesis testing: We need to find the probability of observing the test statistic (or something more extreme) if the null is true (p-value). If p-value “small enough”, we reject the null.

● In our current situation, what is the test statistic? What is its distribution?

● Intuitively, the test statistic would involve all the differences between the observed counts and the expected counts.

● Surely that's the case! And it turns out that our test statistic follows a different distribution from the normal, a so called “Chi-Square distribution.” But that's about it. All the rest of the ideas are the same as in the tests we discussed before.

The Chi-square Statistic

Sampling Distribution of the Chi-square Statistic

The Family of Chi-square Distributions

Table of Critical Values for Chi-Square Test

● For 2X2 tables, the “Magic number” = 3.84: If the Chi-Square statistic is greater than or equal to this value, the relationship is considered significant at the =0.05 level.

● What's the magic number for our 3X3 example? (9.49)● Our computed Chi-Square statistic in the above

example is 216.2. What's our conclusion?● Stata:

example: sysuse nlsw88; tab race married, chi2

Making the Decision: Is the Relationship Statistically Significant?

● The Chi-Square test can easily be adapted to test whether the distribution of a single variable departs from some expected distribution.

● e.g. Are some months more popular for giving births than others?

● We have observations on # of births in each of 12 months● Under the null hypothesis that all months are equally

popular, the expected distribution is 1/12 for each of the 12 months.

● Compute Chi-Square the usual way, comparing the observed and the expected data. Degree of freedom is 12-1=11.

● Another view: ”dependent variable” has 12 values (two way table has 12 columns)

● Imagine an “independent variable” that takes two values: “observed” and “expected”, for the latter the cell counts are the same as the expected distribution.

● Compute Chi-Square the usual way

degree of freedom=(12-1)(2-1)=11

Chi-Square Test for a Single Variable