unit 9: inferences for proportions and count...

23
1 12/15/2008 Unit 9 - Stat 571 - Ramón V. León 1 Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 12/15/2008 Unit 9 - Stat 571 - Ramón V. León 2 Large Sample Confidence Interval for Proportion ( ) ( ) ˆ ˆ Note that (0,1) and (0,1) if is large ˆˆ / / ˆ ˆ ( 1- , 10 and 10) p p p p N N n pq n pq n q p np nq = Confidence interval for p: 2 2 ˆˆ ˆˆ ˆ ˆ pq pq p z p p z n n α α + ( ) 2 2 ˆ 1 ˆˆ p p P z z pq n α α α It follows that:

Upload: trantuyen

Post on 28-May-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

1

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 1

Unit 9: Inferences for Proportions and Count Data

Statistics 571: Statistical MethodsRamón V. León

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 2

Large Sample Confidence Interval for Proportion

( ) ( )ˆ ˆNote that (0,1) and (0,1) if is large

ˆ ˆ/ /ˆ ˆ ( 1- , 10 and 10)

p p p pN N n

pq n pq nq p np nq

− −≈ ≈

= ≥ ≥

Confidence interval for p:

2 2ˆ ˆ ˆ ˆˆ ˆpq pqp z p p zn nα α− ≤ ≤ +

( )2 2

ˆ1

ˆ ˆp p

P z zpq nα α α

⎛ ⎞−− ≤ ≤ ≈ −⎜ ⎟⎜ ⎟⎝ ⎠

It follows that:

2

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 3

A Better Confidence Interval for Proportion

CI for p:

( )2 2

ˆ1

p pP z z

pq nα α α⎛ ⎞−− ≤ ≤ ≈ −⎜ ⎟⎜ ⎟⎝ ⎠

Use this probability statement

Solve for p using quadratic equation

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 4

3

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 5

CI for Proportion in JMP

Value column has two categories. Can you imagine a situation where one would have more categories in this column?

= 0.45 x 800= 0.55 x 800

Arbitrary choice of names

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 6

CI for Proportion in JMP

4

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 7

Sample Size Determination for a Confidence Interval for Proportion

Want (1-α)-level two-sided CI:

2

22

ˆ ˆˆ where E is the margin of error. Then .

ˆ ˆSolving for gives (Formula 9.4)

pqp E E zn

zn n pq

E

α

α

± =

⎛ ⎞= ⎜ ⎟⎝ ⎠

22

1 1 1Largest value of so conservative sample size is:2 2 4

1 (Formula 9.5)4

pq

zn

⎛ ⎞⎛ ⎞= =⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

⎛ ⎞= ⎜ ⎟⎝ ⎠

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 8

Example 9.2: Presidential Poll

Threefold increase in precision requires ninefold increase in sample size

21.96 1 96040.01 4

n ⎛ ⎞= =⎜ ⎟⎝ ⎠

1067.11 9 9604× =

5

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 9

Largest Sample Hypothesis Test on Proportion

0 0 1 0: vs. : H p p H p p= ≠

Best test statistics: 0

0 0

p̂ pzp q n−

=

•Dual relationship between CI and test of hypothesis holds if thebetter confidence interval is used.

•There is an exact test that can be used when the sample size is small (given in Section 9.1.3). We do not cover it.

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 10

Basketball Problem: z-test

2.182

6

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 11

Test for Proportion in JMP: Baseball Problem

“Value” has two categories.

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 12

Test for Proportion in JMP: Basketball Problem

7

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 13

Sample Size for Z-Test of Proportion0 1 0: vs. :oH p p H p p≤ >

0

1 0

1 02

0 0 1 1

2

Suppose that the power for rejecting must be atleast 1- when the true proportion is .Let . Then

Replace by for two-sided test sample size.

Hp p p

p p

z p q z p qn

z z

α β

α α

βδ

δ

= >= −

⎡ ⎤+= ⎢ ⎥⎢ ⎥⎣ ⎦

0

0 0

Test based on:p̂ pzp q n−

=

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 14

Example 9.4: Pizza Testing

0

1

: Can't tell two pizzas Apart, : Can tell pizzas apart

HH

.10, We wants =.25 when 0.5 0.1 pα β= = ±

( )( ) ( )( )

2

2 0 0 1 1

21.645 .5 .5 0.675 .6 .4

132.98 1330.1

z p q z p qn α β

δ

⎡ ⎤+= ⎢ ⎥⎢ ⎥⎣ ⎦

⎡ ⎤+⎢ ⎥= = ≈⎢ ⎥⎣ ⎦

8

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 15

Multinomial Test of Proportions

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 16

Multinomial Test in JMP

Observed count does not exhibit significant deviation from the uniform model.

Note that one could have gotten confidence intervals

9

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 17

Comparing Two Proportion: Independent Sample Design

1 1 1 1 2 2 2 2

1 2 1 2

1 1 2 2

1 2

If , , , 10, thenˆ ˆ ( ) (0,1)

ˆ ˆ ˆ ˆ

n p n q n p n qp p p pZ N

p q p qn n

≥− − −

= ≈+

1 1 2 2 1 1 2 21 2 2 1 2 1 2 2

1 2 1 2

ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆˆ ˆ ˆ ˆp q p q p q p qp p z p p p p zn n n nα α− − + ≤ − ≤ − + +

Confidence Interval:

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 18

Test for Equality of Proportions (Large n) Independent Sample Design

0 1 2 1 1 2: vs. :H p p H p p= ≠

1 2

1 2

1 1 2 2

1 2 1 2

ˆ ˆTest statitics:

1 1ˆ ˆ

ˆ ˆˆwhere

p pz

pqn n

n p n p x ypn n n n

−=

⎛ ⎞+⎜ ⎟

⎝ ⎠+ +

= =+ +

•There is small sample test called Fisher’s exact test. See JMP output latter.•See Example 9.5 for application to the Salk Polio Vaccine Trial•There a test for Matched Pair Design in Section 9.2.2. Please read Example 9.9 to see its application for testing the effectiveness of presidential debates. Do voters change their minds about candidates because of debates?

10

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 19

Example 9.6 –Comparing Two Leukemia Therapies

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 20

Test for Equality of

Proportions in JMP:

Example 9.6

11

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 21

Example 9.6 JMP Output

Res

ult

0.00

0.25

0.50

0.75

1.00

Prednisone Prednisone + VCR

Drug Group

Failure

Success

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 22

Test for Equality of Proportions

in JMP

ModelErrorC. TotalN

Source 1 61 62 63

DF 2.600496 26.575470 29.175966

-LogLike0.0891

RSquare (U)

Likelihood RatioPearson

Test 5.201 5.507

ChiSquare 0.0226 0.0189

Prob>ChiSq

LeftRight2-Tail

Fisher's Exact Test 0.9958 0.0254 0.0323

Prob

Tests

Recall that the P-value of the two-sided z-test was calculated to be 0.019

z2 = (-2.347)2

Less significantresult

12

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 23

Inferences for Two-Way Count Data

Sampling Model 1: Multinomial Model (Total Sample Size Fixed)Sample of 901 from a single population that is then cross-classified

The null hypothesis is that X and Y are independent:

0 . .: ( , ) ( ) ( ) for all i, jij i jH p P X i Y j P X i P Y j p p= = = = = = =

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 24

Sampling Model 1 (Total Sample Size Fixed)

1 1

62 206 62 206Estimated Expected Frequency = 901 14.18901 901 901

np p• •

×⎛ ⎞⎛ ⎞ = =⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

=(Cell 1,1)

13

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 25

Chi-Square Statistics

22

1

( )ci i

i i

n ee

χ=

−=∑

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 26

Chi-Square Test Critical Value

2

29,.05

2

The d.f. for this statistics is (4-1)(4-1) = 9. Since 16.919

the calculated 11.989 is not sufficiently large to reject the hypothesis of independenceat .05 level

χχ

χ

α

=

=

=

In general df = (r-1)(c-1) where c is the number of columns and r is the number of rows.

14

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 27

JMP Analysis

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 28

JMP Analysis

Shows no significance

15

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 29

JMP Analysis

Note that most of the contribution to the chi-square statistics comes from the cornercells. Lack of significancein the chi-square statisticsis the result of the lowcontribution to the chi-square statistic comingfrom the center cells.

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 30

JMP Analysis

Restricting our chi-square analysis to the corner cellsshows a strong relationshipbetween income and levelof satisfaction.

16

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 31

Product Multinomial Model:Row Totals Fixed

Homework Problem 9.28

Notice that it does not make any sense to say that for the population 21 out of 80 returned wallets came from Big Cities.

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 32

Product Multinomial Model:Row Totals Fixed

Sampling Model 2: Product MultinomialTotal number of patients in each drug group is fixed.

•The null hypothesis is that the probability of column response (success or failure) is the same, regardless of the row population:

0 : ( | ) jH P Y j X i p= = =

17

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 33

Chi-Square Statistics

22

1

( )ci i

i i

n ee

χ=

−=∑

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 34

JMP Analysis

18

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 35

JMP Analysis

( )22

Recall Slide 19:

2.347 5.5084z = − =

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 36

Remarks About Chi-Square Test

• The distribution of the chi-square statistics under the null hypothesis is approximately chi-square only when the sample sizes are large– The rule of thumb is that all expected cell counts should be greater

than 1 and – No more than 1/5th of the expected cell counts should be less than

5.

• Combine sparse cell (having small expected cell counts) with adjacent cells. Unfortunately, this has the drawback of losing some information.

19

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 37

Odds Ratio as a Measure of Association for a 2x2 Table

11 12

21 22

Sampling Model I: Multinomialp pp p

ψ =

The numerator is the odds of the column 1 outcome vs. the column 2 outcome for row 1, and the denominator is the same odds for row 2, hence the name “odds ratio”

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 38

Odds Ratio as a Measure of Association for a 2x2 Table

( )( )

1 1

2 2

Sampling Model II: Product Multinomial11

p pp p

ψ−

=−

The two column outcomes are labeled as “success” and “failure,” then ψ is the odds of success for the row 1 population vs. the odds of success for the row 2 population

20

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 39

Odds Ratio as a Measure of Association

for a 2x2 Table

11 21

11 12 21 22

12 22

11 12 21 22

11 12 11 22

21 22

14 3814 3821 42ˆ

7 4 7 421 42

ˆ

n nn n n n

n nn n n n

n n n nn n n

ψ

ψ

⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟+ + ⎛ ⎞ ⎛ ⎞⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟= = = ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎝ ⎠ ⎝ ⎠⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠+ + ⎝ ⎠ ⎝ ⎠⎝ ⎠ ⎝ ⎠⎝ ⎠ ⎝ ⎠

= =12 21

14 4 0.210538 7

Confidence Inteval: [0.053,0.831]n

×= =

×

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 40

Case-Control Studies: The Odds Ratio Approximates the Relative Risk if the Disease is Rare

21

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 41

JMP Output for Case-Control Study

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 42

How to Do It in JMP

22

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 43

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 44

23

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 45

12/15/2008 Unit 9 - Stat 571 - Ramón V. León 46

Do You Need to Know More?578 Categorical Data Analysis (3) Log-linear analysis of multidimensional contingency tables. Logistic regression. Theory, applications, and use of statistical software. Prereq: 1 yr graduate-level statistics, regression analysis and analysis of variance, or consent of instructor. Sp

Reference: “An Introduction to Categorical Data Analysis”by Alan Agresti. Wiley.