stats 3000 week 1 - winter 2011

3. Inferences About Categorical Data

a) Goodness of Fit Ch. 6

b) Test of Independence Ch. 6

Many experiments result in measurements that are qualitative or categorical rather than quantitative.

People classified by ethnic origin

Cars classified by color

M&M®s classified by type (plain or peanut)

mm

m m

m

m

Inferences about categorical data

.Categorical variables: They have only a few values that are stated in words

rather than in numbers.

.Count the number of observations in each level of the categorical variable

(frequency distribution)

.

Chi-square statistic to analyse count data

A. Goodness of fit (One categorical variable; One

dimensional table)

B. Test of independence (Two categorical variables;

two dimensional or contingency tables)

A goodness-of-fit test: is an inferential procedure used to determine whether a frequency

distribution in the population follows a specific distribution (No Preference-uniform distribution; No difference from another known population).

is used to make inferences about the population frequency distribution from the sample frequency distribution.

determines how well a frequency distribution for a sample fits a distribution predicted by the null hypothesis (about the population frequency distribution).

We consider sample data consisting of observed frequency counts

arranged in a single row or column (called a one-way frequency table or

one dimensional table).

We will test the claim that the observed frequency counts agree with some

claimed distribution (specified by the null hypothesis), so that there is a

good fit of the observed data with the claimed distribution.

Expected frequencies:

Frequency values specified by the

null hypothesis, for each level of the

categorical variable. They are the

frequencies you would expect if the

null hypothesis were true

Observed frequencies: Actual

frequency of responses obtained for

each level of the categorical

variable.

Expected frequency counts.

The expected frequency counts at each level of the categorical variable are equal to the sample size times the hypothesized proportion (in the population) at each level of the categorical variable from the null hypothesis

E = np **************** where E is the expected frequency count for a specific level of the categorical variable, n is the total sample size, and p is the hypothesized proportion of observations for a specific level of the categorical variable

Test statistic

22 (0 E)

E

2 is small if the 2 sets of frequencies

are similar (Good Fit)

2 is large if the 2 sets of frequencies

are quite different (Bad Fit)

In the null hypothesis we specify a good fit (small 2)

The alternative hypothesis predicts a 2 value GREATER

than the value specified by the null

Goodness-of-fit hypothesis tests are always right-tailed.

Example:

Expected proportions

(No preference)

RED GREEN YELLOW BLUE

.25 .25 .25 .25

We select 400 (n) people

Expected frequencies (400 x 0.25)


100 100 100 100

EXPECTED FREQUENCY at each level of the categorical variable E = np ****************

Where:

n = sample size

p = proportion for a given level of the categorical variable


79 106 87 128

Observed Frequencies

Does the distribution specified by the null

correspond to the observed distribution?

Category O E (O-E) (O-E)2 (O-E)2/E

Red 79 100 -21 441 4.41

Green 106 100 6 36 .36

Yellow 87 100 -13 169 1.69

Blue 128 100 28 784 7.84

E

)E0(2

= 14.3 = 2

..

A large disagreement between observed and

expected values will lead to a large value of 2

A close agreement between observed and

expected values will lead to a small value of 2

2 is a measure of discrepancy between the 2 distributions.

In our example, there are differences between the

observed and expected frequencies.

Category O E

Red 79 100

Green 106 100

Yellow 87 100

Blue 128 100

But, even if Ho were true, sampling variability would

guarantee some difference between the distributions.

Is a difference of 14.3 (between the observed and

expected distributions) likely to be observed if Ho is

true?

Is observing a 2 = 14.3 too improbable an event for us to

maintain the null?

Is the P-value of a 2 of 14.3 smaller than ?

“If the P is low (small), the null must go.”

•If the P-value is small, reject the null

hypothesis that the distribution is as

claimed.

•A significantly large value of 2 (small/low P-value)

will cause a rejection of the null hypothesis of no

difference between the observed and the expected

frequencies.

1. It is not symmetric.

2. The shape of the chi-square distribution depends on the degrees of freedom, just like Student’s t-distribution.

3. As the number of degrees of freedom increases, the chi-square distribution becomes more nearly symmetric.

4. The values of 2 are nonnegative, i.e., the values of 2 are greater than or equal to 0.

Characteristics of the Chi-Square Distribution

Degrees of freedom


(K - 1) = degrees of freedom *****************

K = number of levels of the categorical variable

4 - 1 = 3

.

P-Values

1.Obtain an approximate P-value by determining the area under the chi-square distribution with k-1 degrees of freedom to the right of the test statistic from Appendix 2 page 697

OR

2. Use computer software to obtain more precise P-value

The area to the right of a 2value is P-value. Gives the probability of a value at least this extreme, if the null hypothesis were true

2obs =14.3

df=3

P-value of 2obs is smaller than , reject the null

at the 0.01 level of significance.

P-value=1-CDF.chisq(quant,df)

Finding P-values using EXCEL

..

Relationships Among the 2Test Statistic, P-Value,

and Goodness-of-Fit

STEPS

1. State the null and alternative hypotheses

Ho : Each of the K cell frequencies is as specified

Ha : At least one of the cell frequencies differs from the

hypothesized value

2. Determine degrees of freedom

d.f. = K - 1

K = # of levels of the categorical variable

3. Compute the expected frequencies

E= np

4. Compute 2obs.

2

2(0 E)

E

5. Obtain P-value (appendix or SPSS or Excel)

6. Reject Ho if P-value < α (.01, .05 or .1).

7. State the significance level

Requirements for the validity of the goodness of fit

1. Random sampling

2. Independence of observations. Do not apply to repeated

measures

3. Large Expected frequencies.

.No expected frequency is less than 1

.At most 20% of the expected frequencies are less than 5

(No more than 20% of the cells should have frequencies

lower than 5.

3. Inferences About Categorical Data

a) Goodness of Fit Ch. 6

b) Test of Independence Ch. 6

TEST OF INDEPENDENCE

To determine if 2 categorical variables are independent of

each other in the population.

Involves bi-variate frequency distributions.

Does the distribution of subjects across the levels of one variable change as a function of levels of the other variable?

• Yes

• No

•Depends on the subject matter

•Do you feel re-makes of old films are a good idea?

Contingency Tables

A two-way table showing the association between 2

categorical variables.

One variable has r levels (row)

The other variable has c levels (columns)

Contingency tables have at least two

rows and at least two columns.

Contingency Tables Contingency tables are described by the number of

levels of each categorical variable

E.g., a 2 2 table summarizes frequency counts for two dichotomous variables (e.g., male/female,

smoker/nonsmoker)

The number of cells in the table is the product of the number of levels for each variable:

2 2 table = 4 cells

3 3 table = 9 cells

The null hypothesis is that the

variables are not associated; in other

words, they are independent.

The alternative hypothesis is that

the variables are associated, or

dependent.

Beer Type

Gender Male 20 40 20 80

Female 90 10 20 120

110 50 40 200

N D L

There is a Relationship between gender and type of beer prefered

(i.e, the variables are dependent)

Male 20 40 20 80

Female 90 10 20 120

110 50 40 200

N D L

N D L

________________________________

Male .25 .5 .25 1

________________________________ Female .75 .08 .17 1

________________________________

.55 .25 .2 1

No Relationship (variables are independent)

Male 40 20 20 80

Female 60 30 30 120

100 50 50 200

N D L

N D L

________________________________

Male .5 .25 .25 1 ________________________________

Female .5 .25 .25 1 ________________________________

.5 .25 .25 1

The idea behind test of independence is to compare actual frequency counts to the counts we would expect if the null hypothesis were true . If a significant difference between the actual counts and expected counts exists, we would take this as evidence against the null hypothesis (stating that the variables are independent).

A researcher wanted to investigate whether there was a relationship between personality type (introvert, extrovert) and choice of recreational activity (going to an amusement park, taking 1 day meditation retreat).

We want to know whether personality and preferred recreational activity are dependent or independent so the hypotheses are:

H0: personality and preferred recreational activity are independent

H1: personality and preferred recreational activity are dependent

Expected Frequencies in a Chi-Square Test for

Independence

To find the expected frequencies in a cell when performing a chi-square independence test, multiply the row total of the row containing the cell by the column total of the column containing the cell and divide this result by the grand total. That is,

i jij

(row total)(column total)Expected frequency =

total number

R CE

N

40x5522

100

60x5533

100

40x4518

100

60x4527

100

The test statistic is

2 2 2 2

2 12 22 43 33 28 18 17 27

22 33 18 27

16.835

What are the degrees of freedom?

(R - 1)(C - 1) = d.f.********************

(2 - 1)(2 - 1) = 1

where R is the number of rows and C is the number

of columns in the contingency table.

There are R = 2 rows and C =2 columns so we find the P-value using (2-1)(2-1) = 1 degrees of freedom. The P-value is the area under the chi-square distribution with 1 degrees of freedom to the right of which is smaller than 0.005.

216.835

P-value=0.000041

Since the P-value is less than the level of significance = 0.01, we reject the null hypothesis.

There is sufficient evidence to conclude that personality and activity preference are dependent at the = 0.01 level of significance.

Relative Risk Analysis after a significant Chi-Square test of independence (2 x2 tables) : risk ratio

Does the distribution of subjects across the levels of one variable change as a function of levels of the other variable?

Are introvert distributed across activity the same way as extrovert? Look at proportion of row totals.

For people who prefer amusement park, is the proportion of introvert the same as the proportion of extrovert?

Are the proportion of people who prefer amusement park the same for introvert and extrovert? Comparing 30% vs 71.7% (risk ratio).3/.717=0.4186=relative risk for preferring amusement park

relative risk= the probability of preferring amusement park if you are introvert (12/40) is 0.418 times lower than the probability of preferring amusement park if you are extrovert (43/60). Is this a significant difference?

Are the proportion of people who prefer retreat the same for introvert and extrovert? Comparing 70% vs 28.3% (risk ratio).7/.283=2.47=relative risk for preferring retreat

relative risk= the probability of preferring retreat if you are introvert (28/40) is 2.47 times higher than the probability of preferring amusement park if you are extrovert (17/60).Is this a significant difference?

1240 0.4186

4360

Note:•Exhibit 6.1d from page 150 textbook is incorrect. Disregard.

•Office hours Tuesdays 3-4pm and Thursdays from 12-1pm

Exercises

Exercises 12 and 13: Webct exercise folder

Readings for next class:

Chapter 9Pages 246-247Sections: 9.1, 9.2, 9.5, 9.7, 9.3, 9.4

stats 3000 week 1 - winter 2011

Education

observed distribution

sample frequency distribution

chisquare distribution

specific distribution

claimed distribution

distribution of subjects

precise pvalue

smalllow pvalue