lecture 07 category shaoqi rao rev

1

Chapter 6

Chi-Square Test for Categorical Variable

Shaoqi Rao, PhD

2009.11.9

Slides adapted from Dr. Zhang Jinxin’s

2

6.1 Basic logic of 6.1 Basic logic of 22 test testGiven a set of observed frequency distribution

A1, A2, A3 …

to test whether the data follow certain theory.If the theory is true, then we will have a set

of theoretical frequency distribution:

T1, T2, T3 …

Comparing A1, A2, A3 … and T1, T2, T3 …

If they are quite different, then the theory might not be true;

Otherwise, the theory is acceptable.

3

6.1.16.1.1 Chi-square distribution Chi-square distribution

~ 2 distribution

—— Agreement between observed and expected frequencies

k

i i

iiP e

ef

1

22 )(

DF=k-1-# parameters estimating fi

For a contingency table,

DF=(# rows-1)(# columns-1）

4

22 distributiondistribution

0 2 4 6 8 100.0

0.1

0.2

0.3

5

6.1.2 χ2Test for Goodness of Fit (Large Sample)

Table1 Frequency distribution and goodness of fit based on 136 measurements to the phantom( 体模 )

intervals A Φ(X1) Φ(X2) P(X) T=n* P(X) (A-T)2/T

1.228- 2 0.00069 0.00466 0.00397 0.5405 3.94143

1.234- 2 0.00466 0.02275 0.01809 2.4601 0.08605

1.240- 7 0.02275 0.08076 0.05801 7.8889 0.10016

1.246- 17 0.08076 0.21186 0.13110 17.8294 0.03859

1.252- 25 0.21186 0.42074 0.20888 28.4083 0.40892

1.258- 37 0.42074 0.65542 0.23468 31.9167 0.80961

1.264- 25 0.65542 0.84134 0.18592 25.2855 0.00322

1.270- 16 0.84134 0.94520 0.10386 14.1244 0.24906

1.276- 4 0.94520 0.98610 0.04090 5.5618 0.43858

1.282- 1 0.98610 0.99744 0.01135 1.5434 0.19130

合计 - - - - - 6.26692

00.201.0

26.1240.1

Z 40.1

01.0

26.1246.1

Z

6

1. Setting up hypothesesH0 ： the population follows N(1.26,0.012)

H1 ： the population doesn’t follow N(1.26,0.012) α=0.05

2. Calculation of the statistic ：

3. P-value ： ν=k-1-2=10-1-2=7

4. Conclusion ： With significance level α=0.05, H0 is not rejected. The measurement follows the normal distribution.

27.6

22

T

TA

5.0

07.14,35.62

7,5.02

27,05.0

27,5.0

P

7

6.2 Comparison between Two Independent

Sample Proportions

In chapter 4 the Z test can only be used

for comparing with a given 0 (one sample)

or comparing 1 with 2 (two samples).

If we need to compare more than two

samples, Chi-square test is widely used.

8

Example 6.1Example 6.1 In a clinical survey, 215 patients with pulmona

ry heart disease ( 肺心病 ) in a hospital were collected , of which 164 patients have taken digitalis ( 洋地黄 ) and 51 patients haven’t taken it. Each of them received an ECG examination. The results are listed in Table 6.2.

9

Table 6.2 Data of patients of pulmonary heart disease with arrhythmia（心律失常）

ECG

Arrhythmia Normal Total Arrhythmia rate (%)

With digitalis 81 83 164 49.39

Without digitalis 19 32 51 37.25

Total 100 115 215 46.51

10

Table 6.2 Data of patients of pulmonary heart disease with arrhythmia（心律失常）

ECG

Arrhythmia Normal Total Arrhythmia rate (%)

With digitalis 81(76.28) 83(87.72) 164 49.39

Without digitalis 19(23.72) 32(27.28) 51 37.25

Total 100 115 215 46.51

11

2P ＝

11510051164

215)19833281( 2

=2.3028

2

1

2

1

22 )(

i j ij

ijijP e

ef

3028.228.27

)28.2732(

72.23

)72.2319(

72.87

)72.8783(

28.76

)28.7681( 22222

p

2121

2211222112 )(

ccrrp nnnn

nffff

ν ＝ 1

12

22 test and test and ZZ test test

According to (4.25)

5175.1)51/1164/1)(215/115)(215/100(

51/19164/81

Z

3028.22 Z）（ 25.4

11)1(

2100

21

nnPP

PPz

13

Correction for continuityCorrection for continuity

When n≥40, if there happens 1≤eij<5,

2

1

2

1

2

2)5.0(

i j ij

ijij

P e

ef

2121

2211222112 )2/(

ccrrP nnnn

nnffff

14

Fisher’s exact testFisher’s exact test

When n<40, or eij<1, with SPSS, 2 test is not proper then. An exact P value will be obtained for us to give conclusion.

This can be easily fulfilled in SPSS.

15

Example 6.9Example 6.9

Table 6.14 The results of treatment to embolic angitis（栓塞性脉管炎）patients Groups Recovery No recovery Total

New treatment 6(a) 1(b) 7(nr1) Control 1(c) 4(d) 5(nr2)

Total 7(nc1) 5(nc2) 12(n)

16

Statistical descriptionStatistical description

group * result Crosstabulation

6 1 7

4.1 2.9 7.0

85.7% 14.3% 100.0%

1 4 5

2.9 2.1 5.0

20.0% 80.0% 100.0%

7 5 12

7.0 5.0 12.0

58.3% 41.7% 100.0%

Count

Expected Count

% within group

Count

Expected Count

% within group

Count

Expected Count

% within group

new treatment

control

group

Total

recovery no recovery

result

Total

17

Chi-Square Tests

5.182b 1 .023

2.831 1 .092

5.555 1 .018

.072 .045

4.750 1 .029

12

Pearson Chi-Square

Continuity Correctiona

Likelihood Ratio

Fisher's Exact Test

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

4 cells (100.0%) have expected count less than 5. The minimum expected count is2.08.

b.

Statistical inferenceStatistical inference

18

6.3 The 6.3 The 22 Tests for Binary Tests for Binary Variable under a Paired DesignVariable under a Paired Design

Example 6.2 There are 260 serum ( 血清 ) samples. Each sample is divided into two and tested by two different methods of immunological test of rheumatoid factor( 类风湿因子 ) respectively. The results are listed in Table 6.4. Now the question is that results of two methods are independent or not.

19

test for independence between test for independence between two binary variablestwo binary variables

Table 6.4 The results of two immunological tests B

A ＋－

Total

＋ 172 8 180 － 12 68 80

Total 184 76 260

2121

2211222112 )2/(

ccrrP nnnn

nnffff 2 2 =173.74=173.74

Example 6.2Example 6.2

12/80=15%12/80=15%172/180=95%172/180=95%

20

6.3.2 Comparison between 6.3.2 Comparison between two sample proportionstwo sample proportions

McNemar testMcNemar test

2112

22112 )(

ff

ff

2 2 ==

21

H0: 1=2, H1: 1≠2, α=0.05When H0 is true,

For large sample (b+c>40)

If the 2 > 2 , then reject H0

221

cbTT

cb

cbcb

cbc

cb

cbb

222

2 )(

2

)2

(

2

)2

(

0.05

22

The Probability ExpressionsThe Probability Expressions

Trt A Trt B Total

+ -

+ 11 (a) 12 (b) r1

- 21 (c) 22 (d) r2

Total c1 c2 1.0

H0: c1= r1 H1: c1 r1

Since c1= 11+ 21, r1= 11+ 12,

This test becomes: H0: 12= 21, H1: 12 21

23

Correction to McNemar testCorrection to McNemar test((ff1212++ff2121<40)<40)

2112

22112 )1(

ff

ff

2 2 == 112

128

)1128( 2

2 2 = =0.45= =0.45

24

6.4 The 6.4 The 22 Test for R×C Test for R×C Contingency TableContingency Table

Table 6.6 Blood types of patient suffering from different diseases Blood type Total

Disease status A B O

Digestive ulcer 679 134 983 1796 Stomach cancer 416 84 383 883

Control 2625 570 2892 6087 Total 3720 788 4258 8766

25

The statistic for hypothesis testThe statistic for hypothesis test

i j cjri

ij

nn

fn )1(

2

2 2 ==

543.40142586087

2892

37201796

6798766

222

P

4)1()1( CR

205.0 =9.488=9.488

26

6.4.2 Multiple comparison6.4.2 Multiple comparison for R×C Table for R×C Table

group + －

I

II

III

IV

V

…

…

…

…

…

…

…

…

…

…

VI … …controlcontrol

27

6.4.3 6.4.3 Measurement of Measurement of association for R×C tableassociation for R×C table

Table 6.11 Blood type of 1043 patients MN system ABO

system M N MN Total

O 85 100 150 335

A 56 78 120 254

B 98 132 170 400 AB 23 25 6 54

Total 262 335 446 1043

28

Pearson contingency coefficientPearson contingency coefficient

2

2

P

PP nr

156.0925.251043

925.25

Pr

29

Pre-requisite for 2 test

By experience, The theoretical frequencies should be grea

ter than 5 in more than 4/5 cells; The theoretical frequency in any cell shoul

d be greater than 1.

Otherwise, we need to use Fisher exact test.

lecture 07 category shaoqi rao rev

Business