contingency tables - faculty.nps.edu
TRANSCRIPT
Contingency Tables!
Professor Ron Fricker!Naval Postgraduate School!
Monterey, California!
8/25/12 1
Reading Assignment:!None!
Goals for this Lecture!
• Understand and be able to conduct tests for discrete contingency table data!– One-way chi-square goodness-of-fit tests!
• Homogeneity!• Other distributions!
– Two-way chi-square tests !• Independence!• Homogeneity !
• All assuming SRS and no fpc!
8/25/12 2
One-Way Classifications!
• Each item classified into one (and only one) of k categories (cells)!– Denote counts as x1, x2, …,
xk with x1+ x2 + … + xk = n!
8/25/12 3
Population
Random sample of size n
Category k Cell frequency xk
Classify
Category 1 Cell frequency x1
Category 2 Cell frequency x2
One-Way Tables in R!
• Just use table() or xtabs() on one variable!– E.g., tabulating Q1 in the New Student Survey:!
8/25/12 4
* Data from 2008 survey of NPS new students
Two-Way Contingency Tables!
• A two-way contingency table (or cross tabulation) gives counts by all pairwise combinations of variable levels!
8/25/12 5
Variable 1
Variable 2
“A” “B”
“X”
“Y”
# or %
# or %
# or %
# or %
# or %
# or %
# or % # or %
Number or percent of obs that are both “X” and “B”
Number or percent of obs that are “Y”
Two-Way Tables in R!
• Just use table() or xtabs() on two variable!– E.g., tabulating Q1 by gender in the New Student
Survey:!
8/25/12 6
* Data from 2008 survey of NPS new students
Higher-Way Tables in R!
• Just keep adding variables…!– E.g., Q1 by gender by country:!
8/25/12 7
* Data from 2008 survey of NPS new students
One-Way Goodness-of-Fit Test!
• Have counts for k categories, x1, x2, …, xk, with x1+ x2 + … + xk = n!
• (Unknown) population cell probabilities denoted p1, p2, …, pk with p1+ p2 +…+ pk = 1
• Estimate each cell probability from the observed counts: !
• The hypotheses to be tested are!!
8/25/12 8
ˆ / , 1,2,...,i ip x n i k= =
* * *0 1 1 2 2
*
: , ,...,
: at least one k k
a i i
H p p p p p pH p p
= = =
≠
Goodness-of-Fit Test for Homogeneity!
• Null hypothesis is the probability of each category is equally likely:!– I.e., the distribution of category characteristics is
homogeneous in the population!• If the null is true, in each cell (in a perfect
world) we would expect to observe counts!
• So, how to do a statistical test that assesses how “far away” the ei expected counts are from the xi observed counts?!
!8/25/12 9
* 1/ , 1,2,...,ip k i k= =
*i ie np=
Answer: Chi-square Test!
• Idea: Look at how far off table counts are from what is expected under the null!
• Reject if chi-square statistic too large!– Assess “too large” using chi-squared distribution!
8/25/12 10
22
1
2
1
(observed expected)expected
( )
k
i
ki i
i i
- X
x - ee
=
=
=
=
∑
∑
Conducting the Test!
• First calculate X 2 statistic!• Then calculate the p-value:!
• is the chi-square distribution with k-1 degrees of freedom!
• Reject null if p-value < , for some pre-determined significance level !
8/25/12 11
21kχ −
2 21-value Pr( )kp Xχ −= ≥
αα
Example!
• In Excel:!
• In R, use the chisq.test() function!– Default is the GoF test for homogeneity!
8/25/12 12
* Data from 2008 survey of NPS new students; remember, here we are assuming SRS and no fpc, which is actually not true for this data
Goodness-of-Fit Test for Other Distributions!
• Homogeneity is just a special case!• Can test whether the s are anything as long
as!
• Might have some theory that says what the distribution should be, for example!
• Remember, don’t look at that data first and then specify the probabilities… !
8/25/12 13
*ip
*
11
k
iip
=
=∑
Example!
• In Excel:!
• In R, again use chisq.test() function!– Now, add a vector for the probabilities!
8/25/12 14
* Data from 2008 survey of NPS new students; remember, here we are assuming SRS and no fpc, which is actually not true for this data
A Note!
• Pearson chi-square test depends on all cells having sufficiently large expected counts:!– If not, collapse across some categories!– E.g., !
15
* 5i ie np= ≥
8/25/12
Count and probability for “Strongly Disagree” and “Disagree” aggregated!
* Data from 2008 survey of NPS new students; remember, here we are assuming SRS and no fpc, which is actually not true for this data
Some Notation for Two-Way Contingency Tables !
• Table has r rows and c columns!• Observed cell counts are xij, with!
• Denote row sums:!
• Denote column sums:!
8/25/12 16
1, 1,...,
r
j iji
x x j c•=
= =∑1
, 1,...,c
i ijj
x x i r•=
= =∑1 1
r c
iji j
x n= =
=∑∑
Chi-square Test for Independence!
• Independence means the probability of being in any cell is the product of the row and column probabilities!
8/25/12 17
Variable 1
Variable 2
“A” “B”
“X”
“Y”
Pr(X) x Pr(A) Pr(X)
Pr(Y)
Pr(A) Pr(B)
Pr(X) x Pr(B)
Pr(Y) x Pr(A) Pr(Y) x Pr(B)
Probability that a random obs is a “Y”
Probability that an obs is both “X” and “B”
The Hypotheses!
• Independence means, for all cells in the table, where!– is the probability of having row i characteristic !– is the probability of having column j
characteristic!• The hypotheses to be tested are!!!
!
8/25/12 18
0 : , 1,2,..., ; 1,2,...,
: , for some and ij i j
a ij i j
H p p p i r j cH p p p i j
• •
• •
= = =
≠
ij i jp p p• •=ip •
p• j
Chi-square Test Statistic!
• Test statistic: !
• Under the null, the expected count is calculated as!
8/25/12 19
22
1 1
( )r cij ij
i j ij
x - eX
e= =
=∑∑
ˆ ˆ ˆ jiij ij i
j
j
i
xxe np np px x
nn n
n
••• •
• •
= = =×
=
Conducting the Test!
• Now, proceed as with the goodness-of-fit test!– Except degrees of freedom are !
• Large values of the chi-square statistic are evidence that the null is false!
• We’ll let R do the p-value calculation!– Reject null if p-value < , for some pre-determined
significance level !!
8/25/12 20
( 1)( 1)r cν = − −
αα
Example: Mobile Learning Survey!
• In mobile learning devices survey, is there an association between those who own a smartphone and those who own a PDA?!– “Do you own a smartphone (such as iPhone, Android, and
Blackberry)?” (yes/no)!– “Do you own a PDA (such as iPad, Zune HD, iPod Touch,
Palm, excluding previously mentioned devices)?” (yes/no)!
!
• Conclusion: The two sets of responses are not independent, so yes there is an association!
8/25/12 21
* Data from 2010 mobile learning devices survey of NPS students (again, assuming SRS and no fpc)
What’s the Connection?!
• Those who do not own a smartphone are also slightly more likely not to own a PDA!
• Similarly, those who own a smartphone are slightly more likely to own a PDA!– Perhaps not a big surprise…!
8/25/12 22
• Data from 2010 mobile learning devices survey of NPS students (again, assuming SRS and no fpc, and data cleaned up for convenience)
Chi-square Test for Homogeneity!
• The question: Is the distribution of a variable (say on a Likert scale) the same for two or more row categories?!
• Idea: Each row is a population and proportion that falls in each column category is the same!
• Good news: Calculation is exactly the same as test for independence!!
8/25/12 23
Example: Mobile Learning Survey!
• In mobile learning devices survey, is the age distribution different for resident and DL students?!
• Sure looks different, so let’s test it formally:!
8/25/12 24
• Data from 2010 mobile learning devices survey of NPS students (again, assuming SRS and no fpc, and data cleaned up for convenience)
What We Have Just Learned!
• Discussed tests for contingency tables!– One-way chi-square goodness-of-fit tests!
• Homogeneity!• Other distributions!
– Two-way chi-square tests !• Independence!• Homogeneity !
• All can be useful for analyzing Likert scale and other categorical survey data!
• Next class, will learn how to modify for complex sampling situations!
8/25/12 25