1
Topic 2
LOGIT analysis of contingency tables
2
Contingency table
a cross classification
Table containing two or more variables of classification, and the purpose is to determin if these variables are related.
Change in stock prices in yearChange in stock prices
in January
UP
DOWN
TOTAL
UP DOWN TOTAL
22 (16.1) 1 (6.9) 23
6 (11.9) 11 (5.1) 17
28 12 40
3
A table of this sort can be used to test whether, as some financial analysts suggest, January is a good prediction of whether stock prices will go up or down in the entire year H0 : whether or not stock prices go up in the entire
year is the same regardless of the behaviour in January
H1 : otherwise
Expected frequencies are shown in parentheses in the table
4
Pearson’s Chi-square statistic
where r and c are respectively the numbers of rows and columns in the table
2)1)(1(
1
2
~)(
cr
n
i i
ii
e
efP
5
In our example,
Nowwe rejected the null. In other words, based on
this evidence the probability that stock prices will go up during the whole year does not seem to be independent of whether or not they go up in January
96.161.5
)1.511(
9.11
)9.116(
9.6
)9.61(
1.16
)1.1622()(
2
2224
1
2
i i
ii
e
ef
84.32)05.0,1(
6
DATA STOCK;INPUT F YP JP;DATALINES;22 1 16 1 01 0 111 0 0;PROC FREQ DATA=STOCK;WEIGHT F;TABLES YP*JP/CHISQ CMH;RUN;
7
8
9
Two Way Table
Consider the following SAS program and OUTPUT:
DATA PENALTY;
INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';
INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;
PROC GENMOD DATA=PENALTY DESCENDING;
MODEL DEATH=BLACKD/D=B;
RUN;
10
11
But suppose we don’t have individual level data. All we have is the following table
Blacks Nonblacks Total
Death 28 22 50
Life 45 52 97
Total 73 74 147
12
DATA CONT1;INPUT F BLACKD DEATH;DATALINES;22 0 128 1 152 0 045 1 0;PROC GENMOD DATA=CONT1 DESCENDING;FREQ F;MODEL DEATH=BLACKD/D=B;RUN;
13
14
Results are identical to those obtained previously Alternatively, we can run the programDATA CONT1;INPUT DEATH TOTAL BLACKD;DATALINES;22 74 028 73 1;PROC GENMOD DATA=CONT1;MODEL DEATH/TOTAL=BLACKD/D=B;RUN;
15
And obtain output
16
Points to note: Instead of replicating the observations, GENMOD
treats the variable DEATH as having a Binomial distribution with the number of trials given by TOTAL.
Deviance is 0. Why?
Note that the deviance is a likelihood ratio test that compares the fitted model with a saturated model. In the previous case, the saturated model is also the fitted model, with two parameter for two data lines.
17
Three Way Table
Consider the cross classification table of race, gender and possession of a driver’s license for a sample of 17 and 18 year old kids.
Drivers’ License
Race Gender Yes No
White Male 43 134
Female 26 149
Black Male 29 23
Female 22 36
18
DATA DRIVER;INPUT WHITE MALE YES NO;TOTAL = YES+NO;DATALINES;1 1 43 1341 0 26 1490 1 29 230 0 22 36;PROC GENMOD DATA=DRIVER;MODEL YES/TOTAL=WHITE MALE/D=B;RUN;
19
20
Deviance = 0.0583 with a
p-value of 0.8092033193 It can be obtained by executing the SAS program:
DATA;
CHI = 1 – PROBCHI(0.0583,1);
PUT CHI;
RUN; So there is no evidence of an interaction between
the explanatory variables.
21
To see this more explicitly, let us fit the model with interaction
DATA DRIVER;INPUT WHITE MALE YES NO;TOTAL = YES+NO;DATALINES;1 1 43 1341 0 26 1490 1 29 230 0 22 36;PROC GENMOD DATA=DRIVER;MODEL YES/TOTAL=WHITE MALE WHITE*MALE/D=B;RUN;
22
23
Interpretation
Coefficient of MALE is 0.6478
Exponentiating the coefficient yields 1.91
=> the estimated odds of having a driver’s license are nearly twice as large for males as for females, after adjusting for racial differences.
24
For WHITE, the highly significant, adjusted odds ratio is exp[-1.3135]=0.269, indicating that the odds of having a driver’s license for whites is a little more than ¼ the odds of blacks.
25
Four Way Table
Slightly more complicated with four-way tables because more interactions are possible
Consider the following table Our goal is to estimate a LOGIT model for the
dependence of working class identification on the other three variables.
26
Identifies with the
Working class
Country Occupation Fathers’ Occupation Yes No Total
France Manual Manual 85 22 107
Non-Manual 44 21 65
Non-Manual Manual 24 42 66
Non-Manual 17 154 171
U.S. Manual Manual 24 63 87
Non-Manual 22 43 65
Non-Manual Manual 1 84 85
Non-Manual 6 142 148
27
DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL/D=B;RUN;
28
29
The missing variables are the interaction terms: 3 2-way interactions and 1 3-way interaction. Because 3-way interactions cannot be interpreted easily, let’s see if we can get by with just the 2-way interactions.
30
DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL FRANCE*MANUAL FRANCE*FAMANUAL MANUAL*FAMANUAL/D=B;RUN;
31
32
Examining the Wald Chi-squares, we find that FRANCE*FAMANUAL is highly significant, but other interaction variables are not so significant.
33
DATA WORKING;INPUT FRANCE MANUAL FAMANUAL TOTAL WORKING;DATALINES;1 1 1 107 851 1 0 65 441 0 1 66 241 0 0 171 170 1 1 87 240 1 0 65 220 0 1 85 10 0 0 148 6;PROC GENMOD DATA=WORKING;MODEL WORKING/TOTAL = FRANCE MANUAL FAMANUAL
FRANCE*FAMANUAL/D=B;RUN;
34
35
Interpretations of results
Coefficient for MANUAL:
exp(2.5155) = 12.4
=> Manual workers have an odds of identification with the working class that is more than 12 times the odds for non-manual workers
Coefficient for FRANCE*FAMANUAL:
)*5061.13802.0(.)( FRANCEfFAMANUAL
Pi
36
If FRANCE=0, then f(.)[-0.3802] represents the effect of FAMANUAL when the respondent lives in the U.S.
If FRANCE=1, then f(.)[1.13] represents the effect of RAMANUAL when the respondent lives in France, exp[1.13]=3.1
In France, the men whose fathers had a manual occupation have an odds of identification that is more than three times the odds for men whose fathers did not have a manual occupation.
37
Overdispersion
Refers to the situation of lack of fit Causes of overdispersion:
Incorrectly specified model: more interactions or nonlinearity are needed in the model.
Lack of independence of observations due to unobserved heterogeneity at group level.
38
DATA POSTDOC;INPUT NIH DOCS PDOCS;DATALINES;.5 8 1.5 9 3.835 16 1.998 13 61.027 8 22.036 9 22.106 29 10...2.329 5 213.749 12 714.367 29 2114.698 19 515.440 10 617.417 10 818.635 14 921.524 18 16;PROC GENMOD DATA=POSTDOC;MODEL PDOCS/DOCS=NIH /D=B;RUN;
39
40
Note that the deviance and Pearson 2 clearly indicate model mis-specification
Because there’s only one independent variable, we don’t have the option of putting in interactions
One can try allowing for nonlinearity by including powers of NIH in the model by that won’t help.
It is quite possible that lack of fit is due to a lack of independence in the observations
41
There are many characteristics of biochemistry departments besides NIH funding that may have some bearings on whether their graduates seek and get postdoctoral training Examples are prestiage of the department, whether the department is in an agricultural or medical school, the age of the department and so on.
Lack of independence of this kind produces what is called extra-binomial variation. The variance of the dependent variable will be greater than what is expected under the assumption of a binomial distribution.
42
Besides producing a large deviance, extra-binomial variation can result in underestimates of the standard errors and overestimates of the Chi-square statistics. Method of adjustment: take the square root of the Pearson Chi-square statistic and multiply all the standard errors by that number.
43
DATA POSTDOC;INPUT NIH DOCS PDOCS;DATALINES;.5 8 1.5 9 3.835 16 1.998 13 61.027 8 22.036 9 2...13.749 12 714.367 29 2114.698 19 515.440 10 617.417 10 818.635 14 921.524 18 16;PROC GENMOD DATA=POSTDOC;MODEL PDOCS/DOCS=NIH /D=B PSCALE;RUN;
44