stats 3000 week 1 - winter 2011

76
3. Inferences About Categorical Data a) Goodness of Fit Ch. 6 b) Test of Independence Ch. 6

Upload: lauren-crosby

Post on 14-Jul-2015

168 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Stats 3000 Week 1 - Winter 2011

3. Inferences About Categorical Data

a) Goodness of Fit Ch. 6

b) Test of Independence Ch. 6

Page 2: Stats 3000 Week 1 - Winter 2011

Many experiments result in measurements that are qualitative or categorical rather than quantitative.

People classified by ethnic origin

Cars classified by color

M&M®s classified by type (plain or peanut)

mm

m m

m

m

Inferences about categorical data

Page 3: Stats 3000 Week 1 - Winter 2011

.Categorical variables: They have only a few values that are stated in words

rather than in numbers.

.Count the number of observations in each level of the categorical variable

(frequency distribution)

.

Page 4: Stats 3000 Week 1 - Winter 2011

Chi-square statistic to analyse count data

A. Goodness of fit (One categorical variable; One

dimensional table)

B. Test of independence (Two categorical variables;

two dimensional or contingency tables)

Page 5: Stats 3000 Week 1 - Winter 2011

A goodness-of-fit test: is an inferential procedure used to determine whether a frequency

distribution in the population follows a specific distribution (No Preference-uniform distribution; No difference from another known population).

is used to make inferences about the population frequency distribution from the sample frequency distribution.

determines how well a frequency distribution for a sample fits a distribution predicted by the null hypothesis (about the population frequency distribution).

Page 6: Stats 3000 Week 1 - Winter 2011

We consider sample data consisting of observed frequency counts

arranged in a single row or column (called a one-way frequency table or

one dimensional table).

We will test the claim that the observed frequency counts agree with some

claimed distribution (specified by the null hypothesis), so that there is a

good fit of the observed data with the claimed distribution.

Page 7: Stats 3000 Week 1 - Winter 2011

Expected frequencies:

Frequency values specified by the

null hypothesis, for each level of the

categorical variable. They are the

frequencies you would expect if the

null hypothesis were true

Observed frequencies: Actual

frequency of responses obtained for

each level of the categorical

variable.

Page 8: Stats 3000 Week 1 - Winter 2011

Expected frequency counts.

The expected frequency counts at each level of the categorical variable are equal to the sample size times the hypothesized proportion (in the population) at each level of the categorical variable from the null hypothesis

E = np **************** where E is the expected frequency count for a specific level of the categorical variable, n is the total sample size, and p is the hypothesized proportion of observations for a specific level of the categorical variable

Page 9: Stats 3000 Week 1 - Winter 2011

Test statistic

22 (0 E)

E

Page 10: Stats 3000 Week 1 - Winter 2011

2 is small if the 2 sets of frequencies

are similar (Good Fit)

2 is large if the 2 sets of frequencies

are quite different (Bad Fit)

In the null hypothesis we specify a good fit (small 2)

The alternative hypothesis predicts a 2 value GREATER

than the value specified by the null

Goodness-of-fit hypothesis tests are always right-tailed.

Page 11: Stats 3000 Week 1 - Winter 2011

Example:

Expected proportions

(No preference)

RED GREEN YELLOW BLUE

.25 .25 .25 .25

We select 400 (n) people

Expected frequencies (400 x 0.25)

RED GREEN YELLOW BLUE

100 100 100 100

Page 12: Stats 3000 Week 1 - Winter 2011

EXPECTED FREQUENCY at each level of the categorical variable E = np ****************

Where:

n = sample size

p = proportion for a given level of the categorical variable

Page 13: Stats 3000 Week 1 - Winter 2011

RED GREEN YELLOW BLUE

79 106 87 128

Observed Frequencies

Does the distribution specified by the null

correspond to the observed distribution?

Page 14: Stats 3000 Week 1 - Winter 2011

Category O E (O-E) (O-E)2 (O-E)2/E

Red 79 100 -21 441 4.41

Green 106 100 6 36 .36

Yellow 87 100 -13 169 1.69

Blue 128 100 28 784 7.84

E

)E0(2

= 14.3 = 2

Page 15: Stats 3000 Week 1 - Winter 2011

..

A large disagreement between observed and

expected values will lead to a large value of 2

A close agreement between observed and

expected values will lead to a small value of 2

2 is a measure of discrepancy between the 2 distributions.

Page 16: Stats 3000 Week 1 - Winter 2011

In our example, there are differences between the

observed and expected frequencies.

Category O E

Red 79 100

Green 106 100

Yellow 87 100

Blue 128 100

But, even if Ho were true, sampling variability would

guarantee some difference between the distributions.

Is a difference of 14.3 (between the observed and

expected distributions) likely to be observed if Ho is

true?

Is observing a 2 = 14.3 too improbable an event for us to

maintain the null?

Is the P-value of a 2 of 14.3 smaller than ?

Page 17: Stats 3000 Week 1 - Winter 2011

“If the P is low (small), the null must go.”

•If the P-value is small, reject the null

hypothesis that the distribution is as

claimed.

•A significantly large value of 2 (small/low P-value)

will cause a rejection of the null hypothesis of no

difference between the observed and the expected

frequencies.

Page 18: Stats 3000 Week 1 - Winter 2011

1. It is not symmetric.

2. The shape of the chi-square distribution depends on the degrees of freedom, just like Student’s t-distribution.

3. As the number of degrees of freedom increases, the chi-square distribution becomes more nearly symmetric.

4. The values of 2 are nonnegative, i.e., the values of 2 are greater than or equal to 0.

Characteristics of the Chi-Square Distribution

Page 19: Stats 3000 Week 1 - Winter 2011
Page 20: Stats 3000 Week 1 - Winter 2011

Degrees of freedom

RED GREEN YELLOW BLUE

(K - 1) = degrees of freedom *****************

K = number of levels of the categorical variable

4 - 1 = 3

Page 21: Stats 3000 Week 1 - Winter 2011

.

P-Values

1.Obtain an approximate P-value by determining the area under the chi-square distribution with k-1 degrees of freedom to the right of the test statistic from Appendix 2 page 697

OR

2. Use computer software to obtain more precise P-value

The area to the right of a 2value is P-value. Gives the probability of a value at least this extreme, if the null hypothesis were true

Page 22: Stats 3000 Week 1 - Winter 2011

2obs =14.3

df=3

P-value of 2obs is smaller than , reject the null

at the 0.01 level of significance.

Page 23: Stats 3000 Week 1 - Winter 2011
Page 24: Stats 3000 Week 1 - Winter 2011
Page 25: Stats 3000 Week 1 - Winter 2011

P-value=1-CDF.chisq(quant,df)

Page 26: Stats 3000 Week 1 - Winter 2011
Page 27: Stats 3000 Week 1 - Winter 2011
Page 28: Stats 3000 Week 1 - Winter 2011
Page 29: Stats 3000 Week 1 - Winter 2011
Page 30: Stats 3000 Week 1 - Winter 2011

Finding P-values using EXCEL

Page 31: Stats 3000 Week 1 - Winter 2011
Page 32: Stats 3000 Week 1 - Winter 2011
Page 33: Stats 3000 Week 1 - Winter 2011
Page 34: Stats 3000 Week 1 - Winter 2011
Page 35: Stats 3000 Week 1 - Winter 2011

..

Relationships Among the 2Test Statistic, P-Value,

and Goodness-of-Fit

Page 36: Stats 3000 Week 1 - Winter 2011

STEPS

1. State the null and alternative hypotheses

Ho : Each of the K cell frequencies is as specified

Ha : At least one of the cell frequencies differs from the

hypothesized value

2. Determine degrees of freedom

d.f. = K - 1

K = # of levels of the categorical variable

Page 37: Stats 3000 Week 1 - Winter 2011

3. Compute the expected frequencies

E= np

4. Compute 2obs.

2

2(0 E)

E

5. Obtain P-value (appendix or SPSS or Excel)

6. Reject Ho if P-value < α (.01, .05 or .1).

7. State the significance level

Page 38: Stats 3000 Week 1 - Winter 2011

Requirements for the validity of the goodness of fit

1. Random sampling

2. Independence of observations. Do not apply to repeated

measures

3. Large Expected frequencies.

.No expected frequency is less than 1

.At most 20% of the expected frequencies are less than 5

(No more than 20% of the cells should have frequencies

lower than 5.

Page 39: Stats 3000 Week 1 - Winter 2011
Page 40: Stats 3000 Week 1 - Winter 2011
Page 41: Stats 3000 Week 1 - Winter 2011
Page 42: Stats 3000 Week 1 - Winter 2011
Page 43: Stats 3000 Week 1 - Winter 2011

3. Inferences About Categorical Data

a) Goodness of Fit Ch. 6

b) Test of Independence Ch. 6

Page 44: Stats 3000 Week 1 - Winter 2011

TEST OF INDEPENDENCE

To determine if 2 categorical variables are independent of

each other in the population.

Involves bi-variate frequency distributions.

Page 45: Stats 3000 Week 1 - Winter 2011

Does the distribution of subjects across the levels of one variable change as a function of levels of the other variable?

• Yes

• No

•Depends on the subject matter

•Do you feel re-makes of old films are a good idea?

Page 46: Stats 3000 Week 1 - Winter 2011

Contingency Tables

A two-way table showing the association between 2

categorical variables.

One variable has r levels (row)

The other variable has c levels (columns)

Contingency tables have at least two

rows and at least two columns.

Page 47: Stats 3000 Week 1 - Winter 2011

Contingency Tables Contingency tables are described by the number of

levels of each categorical variable

E.g., a 2 2 table summarizes frequency counts for two dichotomous variables (e.g., male/female,

smoker/nonsmoker)

The number of cells in the table is the product of the number of levels for each variable:

2 2 table = 4 cells

3 3 table = 9 cells

Page 48: Stats 3000 Week 1 - Winter 2011

The null hypothesis is that the

variables are not associated; in other

words, they are independent.

The alternative hypothesis is that

the variables are associated, or

dependent.

Page 49: Stats 3000 Week 1 - Winter 2011

Beer Type

Gender Male 20 40 20 80

Female 90 10 20 120

110 50 40 200

N D L

Page 50: Stats 3000 Week 1 - Winter 2011

There is a Relationship between gender and type of beer prefered

(i.e, the variables are dependent)

Male 20 40 20 80

Female 90 10 20 120

110 50 40 200

N D L

N D L

________________________________

Male .25 .5 .25 1

________________________________ Female .75 .08 .17 1

________________________________

.55 .25 .2 1

Page 51: Stats 3000 Week 1 - Winter 2011

No Relationship (variables are independent)

Male 40 20 20 80

Female 60 30 30 120

100 50 50 200

N D L

N D L

________________________________

Male .5 .25 .25 1 ________________________________

Female .5 .25 .25 1 ________________________________

.5 .25 .25 1

Page 52: Stats 3000 Week 1 - Winter 2011

The idea behind test of independence is to compare actual frequency counts to the counts we would expect if the null hypothesis were true . If a significant difference between the actual counts and expected counts exists, we would take this as evidence against the null hypothesis (stating that the variables are independent).

Page 53: Stats 3000 Week 1 - Winter 2011

A researcher wanted to investigate whether there was a relationship between personality type (introvert, extrovert) and choice of recreational activity (going to an amusement park, taking 1 day meditation retreat).

Page 54: Stats 3000 Week 1 - Winter 2011

We want to know whether personality and preferred recreational activity are dependent or independent so the hypotheses are:

H0: personality and preferred recreational activity are independent

H1: personality and preferred recreational activity are dependent

Page 55: Stats 3000 Week 1 - Winter 2011

Expected Frequencies in a Chi-Square Test for

Independence

To find the expected frequencies in a cell when performing a chi-square independence test, multiply the row total of the row containing the cell by the column total of the column containing the cell and divide this result by the grand total. That is,

i jij

(row total)(column total)Expected frequency =

total number

R CE

N

Page 56: Stats 3000 Week 1 - Winter 2011

40x5522

100

60x5533

100

40x4518

100

60x4527

100

Page 57: Stats 3000 Week 1 - Winter 2011

The test statistic is

2 2 2 2

2 12 22 43 33 28 18 17 27

22 33 18 27

16.835

Page 58: Stats 3000 Week 1 - Winter 2011

What are the degrees of freedom?

(R - 1)(C - 1) = d.f.********************

(2 - 1)(2 - 1) = 1

where R is the number of rows and C is the number

of columns in the contingency table.

Page 59: Stats 3000 Week 1 - Winter 2011

There are R = 2 rows and C =2 columns so we find the P-value using (2-1)(2-1) = 1 degrees of freedom. The P-value is the area under the chi-square distribution with 1 degrees of freedom to the right of which is smaller than 0.005.

216.835

Page 60: Stats 3000 Week 1 - Winter 2011

P-value=0.000041

Page 61: Stats 3000 Week 1 - Winter 2011

Since the P-value is less than the level of significance = 0.01, we reject the null hypothesis.

There is sufficient evidence to conclude that personality and activity preference are dependent at the = 0.01 level of significance.

Page 62: Stats 3000 Week 1 - Winter 2011

Relative Risk Analysis after a significant Chi-Square test of independence (2 x2 tables) : risk ratio

Does the distribution of subjects across the levels of one variable change as a function of levels of the other variable?

Are introvert distributed across activity the same way as extrovert? Look at proportion of row totals.

For people who prefer amusement park, is the proportion of introvert the same as the proportion of extrovert?

Page 63: Stats 3000 Week 1 - Winter 2011

Are the proportion of people who prefer amusement park the same for introvert and extrovert? Comparing 30% vs 71.7% (risk ratio).3/.717=0.4186=relative risk for preferring amusement park

relative risk= the probability of preferring amusement park if you are introvert (12/40) is 0.418 times lower than the probability of preferring amusement park if you are extrovert (43/60). Is this a significant difference?

Page 64: Stats 3000 Week 1 - Winter 2011

Are the proportion of people who prefer retreat the same for introvert and extrovert? Comparing 70% vs 28.3% (risk ratio).7/.283=2.47=relative risk for preferring retreat

relative risk= the probability of preferring retreat if you are introvert (28/40) is 2.47 times higher than the probability of preferring amusement park if you are extrovert (17/60).Is this a significant difference?

Page 65: Stats 3000 Week 1 - Winter 2011
Page 66: Stats 3000 Week 1 - Winter 2011
Page 67: Stats 3000 Week 1 - Winter 2011
Page 68: Stats 3000 Week 1 - Winter 2011
Page 69: Stats 3000 Week 1 - Winter 2011
Page 70: Stats 3000 Week 1 - Winter 2011
Page 71: Stats 3000 Week 1 - Winter 2011
Page 72: Stats 3000 Week 1 - Winter 2011
Page 73: Stats 3000 Week 1 - Winter 2011
Page 74: Stats 3000 Week 1 - Winter 2011
Page 75: Stats 3000 Week 1 - Winter 2011

1240 0.4186

4360

Page 76: Stats 3000 Week 1 - Winter 2011

Note:•Exhibit 6.1d from page 150 textbook is incorrect. Disregard.

•Office hours Tuesdays 3-4pm and Thursdays from 12-1pm

Exercises

Exercises 12 and 13: Webct exercise folder

Readings for next class:

Chapter 9Pages 246-247Sections: 9.1, 9.2, 9.5, 9.7, 9.3, 9.4