lecture 07 category shaoqi rao rev
DESCRIPTION
TRANSCRIPT
1
Chapter 6
Chi-Square Test for Categorical Variable
Shaoqi Rao, PhD
2009.11.9
Slides adapted from Dr. Zhang Jinxin’s
2
6.1 Basic logic of 6.1 Basic logic of 22 test testGiven a set of observed frequency distribution
A1, A2, A3 …
to test whether the data follow certain theory.If the theory is true, then we will have a set
of theoretical frequency distribution:
T1, T2, T3 …
Comparing A1, A2, A3 … and T1, T2, T3 …
If they are quite different, then the theory might not be true;
Otherwise, the theory is acceptable.
3
6.1.16.1.1 Chi-square distribution Chi-square distribution
~ 2 distribution
—— Agreement between observed and expected frequencies
k
i i
iiP e
ef
1
22 )(
DF=k-1-# parameters estimating fi
For a contingency table,
DF=(# rows-1)(# columns-1)
4
22 distributiondistribution
0 2 4 6 8 100.0
0.1
0.2
0.3
5
6.1.2 χ2Test for Goodness of Fit (Large Sample)
Table1 Frequency distribution and goodness of fit based on 136 measurements to the phantom( 体模 )
intervals A Φ(X1) Φ(X2) P(X) T=n* P(X) (A-T)2/T
1.228- 2 0.00069 0.00466 0.00397 0.5405 3.94143
1.234- 2 0.00466 0.02275 0.01809 2.4601 0.08605
1.240- 7 0.02275 0.08076 0.05801 7.8889 0.10016
1.246- 17 0.08076 0.21186 0.13110 17.8294 0.03859
1.252- 25 0.21186 0.42074 0.20888 28.4083 0.40892
1.258- 37 0.42074 0.65542 0.23468 31.9167 0.80961
1.264- 25 0.65542 0.84134 0.18592 25.2855 0.00322
1.270- 16 0.84134 0.94520 0.10386 14.1244 0.24906
1.276- 4 0.94520 0.98610 0.04090 5.5618 0.43858
1.282- 1 0.98610 0.99744 0.01135 1.5434 0.19130
合 计 - - - - - 6.26692
00.201.0
26.1240.1
Z 40.1
01.0
26.1246.1
Z
6
1. Setting up hypothesesH0 : the population follows N(1.26,0.012)
H1 : the population doesn’t follow N(1.26,0.012) α=0.05
2. Calculation of the statistic :
3. P-value : ν=k-1-2=10-1-2=7
4. Conclusion : With significance level α=0.05, H0 is not rejected. The measurement follows the normal distribution.
27.6
22
T
TA
5.0
07.14,35.62
7,5.02
27,05.0
27,5.0
P
7
6.2 Comparison between Two Independent
Sample Proportions
In chapter 4 the Z test can only be used
for comparing with a given 0 (one sample)
or comparing 1 with 2 (two samples).
If we need to compare more than two
samples, Chi-square test is widely used.
8
Example 6.1Example 6.1 In a clinical survey, 215 patients with pulmona
ry heart disease ( 肺心病 ) in a hospital were collected , of which 164 patients have taken digitalis ( 洋地黄 ) and 51 patients haven’t taken it. Each of them received an ECG examination. The results are listed in Table 6.2.
9
Table 6.2 Data of patients of pulmonary heart disease with arrhythmia(心律失常)
ECG
Arrhythmia Normal Total Arrhythmia rate (%)
With digitalis 81 83 164 49.39
Without digitalis 19 32 51 37.25
Total 100 115 215 46.51
10
Table 6.2 Data of patients of pulmonary heart disease with arrhythmia(心律失常)
ECG
Arrhythmia Normal Total Arrhythmia rate (%)
With digitalis 81(76.28) 83(87.72) 164 49.39
Without digitalis 19(23.72) 32(27.28) 51 37.25
Total 100 115 215 46.51
11
2P =
11510051164
215)19833281( 2
=2.3028
2
1
2
1
22 )(
i j ij
ijijP e
ef
3028.228.27
)28.2732(
72.23
)72.2319(
72.87
)72.8783(
28.76
)28.7681( 22222
p
2121
2211222112 )(
ccrrp nnnn
nffff
ν = 1
12
22 test and test and ZZ test test
According to (4.25)
5175.1)51/1164/1)(215/115)(215/100(
51/19164/81
Z
3028.22 Z)( 25.4
11)1(
2100
21
nnPP
PPz
13
Correction for continuityCorrection for continuity
When n≥40, if there happens 1≤eij<5,
2
1
2
1
2
2)5.0(
i j ij
ijij
P e
ef
2121
2211222112 )2/(
ccrrP nnnn
nnffff
14
Fisher’s exact testFisher’s exact test
When n<40, or eij<1, with SPSS, 2 test is not proper then. An exact P value will be obtained for us to give conclusion.
This can be easily fulfilled in SPSS.
15
Example 6.9Example 6.9
Table 6.14 The results of treatment to embolic angitis(栓塞性脉管炎)patients Groups Recovery No recovery Total
New treatment 6(a) 1(b) 7(nr1) Control 1(c) 4(d) 5(nr2)
Total 7(nc1) 5(nc2) 12(n)
16
Statistical descriptionStatistical description
group * result Crosstabulation
6 1 7
4.1 2.9 7.0
85.7% 14.3% 100.0%
1 4 5
2.9 2.1 5.0
20.0% 80.0% 100.0%
7 5 12
7.0 5.0 12.0
58.3% 41.7% 100.0%
Count
Expected Count
% within group
Count
Expected Count
% within group
Count
Expected Count
% within group
new treatment
control
group
Total
recovery no recovery
result
Total
17
Chi-Square Tests
5.182b 1 .023
2.831 1 .092
5.555 1 .018
.072 .045
4.750 1 .029
12
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-LinearAssociation
N of Valid Cases
Value dfAsymp. Sig.
(2-sided)Exact Sig.(2-sided)
Exact Sig.(1-sided)
Computed only for a 2x2 tablea.
4 cells (100.0%) have expected count less than 5. The minimum expected count is2.08.
b.
Statistical inferenceStatistical inference
18
6.3 The 6.3 The 22 Tests for Binary Tests for Binary Variable under a Paired DesignVariable under a Paired Design
Example 6.2 There are 260 serum ( 血清 ) samples. Each sample is divided into two and tested by two different methods of immunological test of rheumatoid factor( 类风湿因子 ) respectively. The results are listed in Table 6.4. Now the question is that results of two methods are independent or not.
19
test for independence between test for independence between two binary variablestwo binary variables
Table 6.4 The results of two immunological tests B
A + -
Total
+ 172 8 180 - 12 68 80
Total 184 76 260
2121
2211222112 )2/(
ccrrP nnnn
nnffff 2 2 =173.74=173.74
Example 6.2Example 6.2
12/80=15%12/80=15%172/180=95%172/180=95%
20
6.3.2 Comparison between 6.3.2 Comparison between two sample proportionstwo sample proportions
McNemar testMcNemar test
2112
22112 )(
ff
ff
2 2 ==
21
H0: 1=2, H1: 1≠2, α=0.05When H0 is true,
For large sample (b+c>40)
If the 2 > 2 , then reject H0
221
cbTT
cb
cbcb
cbc
cb
cbb
222
2 )(
2
)2
(
2
)2
(
0.05
22
The Probability ExpressionsThe Probability Expressions
Trt A Trt B Total
+ -
+ 11 (a) 12 (b) r1
- 21 (c) 22 (d) r2
Total c1 c2 1.0
H0: c1= r1 H1: c1 r1
Since c1= 11+ 21, r1= 11+ 12,
This test becomes: H0: 12= 21, H1: 12 21
23
Correction to McNemar testCorrection to McNemar test((ff1212++ff2121<40)<40)
2112
22112 )1(
ff
ff
2 2 == 112
128
)1128( 2
2 2 = =0.45= =0.45
24
6.4 The 6.4 The 22 Test for R×C Test for R×C Contingency TableContingency Table
Table 6.6 Blood types of patient suffering from different diseases Blood type Total
Disease status A B O
Digestive ulcer 679 134 983 1796 Stomach cancer 416 84 383 883
Control 2625 570 2892 6087 Total 3720 788 4258 8766
25
The statistic for hypothesis testThe statistic for hypothesis test
i j cjri
ij
nn
fn )1(
2
2 2 ==
543.40142586087
2892
37201796
6798766
222
P
4)1()1( CR
205.0 =9.488=9.488
26
6.4.2 Multiple comparison6.4.2 Multiple comparison for R×C Table for R×C Table
group + -
I
II
III
IV
V
…
…
…
…
…
…
…
…
…
…
VI … …controlcontrol
27
6.4.3 6.4.3 Measurement of Measurement of association for R×C tableassociation for R×C table
Table 6.11 Blood type of 1043 patients MN system ABO
system M N MN Total
O 85 100 150 335
A 56 78 120 254
B 98 132 170 400 AB 23 25 6 54
Total 262 335 446 1043
28
Pearson contingency coefficientPearson contingency coefficient
2
2
P
PP nr
156.0925.251043
925.25
Pr
29
Pre-requisite for 2 test
By experience, The theoretical frequencies should be grea
ter than 5 in more than 4/5 cells; The theoretical frequency in any cell shoul
d be greater than 1.
Otherwise, we need to use Fisher exact test.
30