lecture 6: dichotomous variables & chi-square testseckel/biostat2/slides/lecture6.pdf ·...

36
Lecture 6: Dichotomous Variables & Chi-square tests Sandy Eckel [email protected] 29 April 2008 1 / 36

Upload: nguyendan

Post on 15-Feb-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 6: Dichotomous Variables &Chi-square tests

Sandy [email protected]

29 April 2008

1 / 36

Today’s lecture

Dichotomous Variables

Comparing two proportions using 2 × 2 tables

Study Designs

Relative risks, odds ratios and their confidence intervals

Chi-square tests

2 / 36

Proportions and 2 × 2 tables

Population Success Failure Total

Population 1 x1 n1 − x1 n1

Population 2 x2 n2 − x2 n2

Total x1 + x2 n − (x1 + x2) n

Row 1 shows results of a binomial experiment with n1 trials

Row 2 shows results of a binomial experiment with n2 trials

3 / 36

How do we compare these proportions

Often, we want to compare p1, the probability of success inpopulation 1, to p2, the probability of success in population 2

Usually: “Success” = DiseasePopulation 1 = Treatment 1Population 2 = Treatment 2 (maybe placebo)

How do we compare these proportions?

We’ve talked about comparing proportions by looking at theirdifferenceBut sometimes we want to look at one proportion ‘relative’ tothe otherThis approach depends on the type of study the data camefrom

4 / 36

Observational epidemiologic study designs

Cross-sectional

Cohort

Case-control

5 / 36

Cross-sectional study design

A ‘snapshot’ of a sample of the population

Commonly a survey/questionnaire or one time visit to assessinformation at a single point in time

Measures existing disease and exposure levels

Difficult to argue for a causal relation between exposure anddisease

Example: A group of Finnish residents between the ages of25 and 64 are mailed a questionnaire asking about their dailycoffee consumption habits and their systolic BP from theirlast visit to the doctor.

6 / 36

Cohort study design

Find a group of individuals without the disease and separateinto those

with the exposure andwithout the exposure

Follow over time and measure the disease rates in both groupsCompare the disease rate in the exposed and unexposedIf the exposure is harmful and associated with the disease, wewould expect to see higher rates of disease in the exposedgroup versus the unexposed groupAllows us to estimate the incidence of the disease(rate at which new disease cases occur)Example: A group of Finnish residents between the ages of25 and 64 and currently without hypertension are asked abouttheir current levels of daily coffee consumption. Theseindividuals are followed for 5 years at which point they areasked if they have been diagnosed with hypertension.

7 / 36

Case-control study design

Identify individualswith the disease of interest (case) andthose without the disease (control)

and then look retrospectively (at prior records) to find whatthe exposure levels were in these two groupsThe goal is to compare the exposure levels in the case andcontrol groupsIf the exposure is harmful and associated with the disease, wewould expect to see higher levels of exposure in the cases thanin the controlsVery useful when the disease is rareExample: A group of Finnish residents with hypertension anda group of Finnish residents without hypertension areidentified. These individuals are contacted and asked abouttheir level of daily coffee consumption in the two years prior todiagnosis with hypertension.

8 / 36

(Almost) Cohort Study Example

Aceh Vitamin A TrialExposure levels (Vitamin A) assigned at baseline and then thechildren are followed to determine survival in the two groups.

25,939 pre-school children in 450 Indonesian villages innorthern Sumatra

200,000 IU vitamin A given 1-3 months after the baselinecensus, and again at 6-8 months

Consider 23,682 out of 25,939 who were visited on apre-designed schedule

References:

1 Sommer A, Djunaedi E, Loeden A et al, Lancet 1986.

2 Sommer A, Zeger S, Statistics in Medicine 1991.

9 / 36

Trial Outcome: 12 month mortality

Alive at 12 months?Vit A No Yes Total

Yes 46 12,048 12,094No 74 11,514 11,588

Total 120 23,562 23,682

Does Vitamin A reduce mortality?

Calculate risk ratio or “relative risk”

Relative Risk abbreviated as RRCould also compare difference in proportions: called“attributable risk”

10 / 36

Relative Risk Calculation

Relative Risk =Rate with Vitamin A

Rate without Vitamin A

=p̂1

p̂2

=46/12, 094

74/11, 588

=0.0038

0.0064= 0.59

The death rate with vitamin A is 0.60 times (or 60% of) thedeath rate without vitamin A.

Equivalent interpretation: Vitamin A group had 40% lowermortality than without vitamin A group!

11 / 36

Confidence interval for RR

Step 1: Find the estimate of the log RR

log(p̂1

p̂2)

Step 2: Estimate the variance of the log(RR) as:

1− p1

n1p1+

1− p2

n2p2

Step 3: Find the 95% CI for log(RR):

log(RR)± 1.96 · SD(log RR) = (lower, upper)

Step 4: Exponentiate to get 95% CI for RR;

e(lower, upper)

12 / 36

Confidence interval for RRfrom Vitamin A Trial

95% CI for log relative risk is:

log(RR)± 1.96 · SD(log RR)

= log(0.59)± 1.96 ·√

0.9962

46+

0.9936

74= −0.53± 0.37

= (−0.90,−0.16)

95% CI for relative risk

(e−0.90, e−0.16) = (0.41, 0.85)

Does this confidence interval contain 1?

13 / 36

What if the data were from a case-control study?

Recall: in case-control studies, individuals are selected byoutcome status

Disease (mortality) status defines the population, andexposure status defines the success

p1 and p2 have a difference interpretation in a case-controlstudy than in a cohort study

Cohort:

p1 = P(Disease | Exposure)p2 = P(Disease | No Exposure)

Case-Control:

p1 = P(Exposure | Disease)p2 = P(Exposure | No Disease)

⇒ This is why we cannot estimate the relative risk fromcase-control data!

14 / 36

The Odds Ratio

The odds ratio measures association in Case-Control studies

Odds =P(event occurs)

P(event does not occur)= p

1−p

Remember the odds ratio is simply a ratio of odds!

OR = odds in group 1odds in group 2

OR = p̂1/(1−p̂1)p̂2/(1−p̂2)

15 / 36

Which p1 and p2 do we use?

We can actually calculate OR using either “case-control” or“cohort” set up

Using “case-control” p1 and p2 where we condition on diseaseor no disease

OR =(46/120)/(74/120)

(12048/23562)/(11514/23562)=

46/74

12048/11514= 0.59

Using “cohort” p1 and p2 where we condition on exposure orno exposure

OR =(46/12094)/(12048/12094)

(74/11588)/(11514/11588)=

46/12048

74/11514= 0.59

We get the same answer either way!

16 / 36

Bottom Line

The relative risk cannot be estimated from acase-control study

The odds ratio can be estimated from a case-control study

OR estimates the RR when the disease is rare in both groups

The OR is invariant to cohort or case-control designs, the RRis not

We are introducing the OR now because it is an essential ideain logistic regression

17 / 36

Confidence interval for OR

Step 1: Find the estimate of the log OR

log(p̂1/(1− p̂1)

p̂2/(1− p̂2))

Step 2: Estimate the variance of the log(OR) as:

1

n1p1+

1

n1q1+

1

n2p2+

1

n2q2

Step 3: Find the 95% CI for log(OR):

log(OR)± 1.96 · SD(log OR) = (lower, upper)

Step 4: Exponentiate to get 95% CI for OR;

e(lower, upper)

18 / 36

All kinds of χ2 tests

Test of Goodness of fit

Test of independence

Test of homogeneity or (no) association

All of these test statistics have a χ2 distribution under the null

19 / 36

The χ2 distribution

Derived from the normal distribution

χ21 = (

y − µσ

)2 = Z 2

χ2k = Z 2

1 + Z 22 + · · ·+ Z 2

k

where Z1, . . . ,Zk are all standard normal random variables

k denotes the degrees of freedom

A χ2k random variable has

mean = kvariance = 2k

Since a normal random variable can take on values in theinterval (−∞,∞), a chi-square random variable can take onvalues in the interval (0,∞)

20 / 36

χ2 Family of Distributions

21 / 36

χ2 Critical Values

We generally use only a one-sided test for the χ2 distributionArea under the curve to the right of the cutoff for each curveis 0.05Increasing critical value with increasing number of degrees offreedom

0 2 4 6 8 10 12

Chi−square value

df=1df=3df=5

22 / 36

χ2 Table

23 / 36

But...we’ll use R

For χ2 random variables with degrees of freedom = df we’ll use

1 pchisq(a, df, lower.tail=F) to find P(χdf ≥ a) =?

P(ChiSq > a ) = ?

a

P(ChiSq > ? ) = b

?

2 qchisq(b, df, lower.tail=F) to find P(χdf ≥?) = b

24 / 36

χ2 Goodness-of-Fit Test

Determine whether or not a sample of observed values of somerandom variable is compatible with the hypothesis that the samplewas drawn from a population with a specified distributional form,i.e.

Normal

Binomial

Poisson

etc...

Here, the expected cell counts would be derived from thedistributional assumption under the null hypothesis

25 / 36

The χ2 test statistic

χ2 =∑k

i=1[ (Oi−Ei )2

Ei] where

Oi = i th observed frequency

Ei = i th expected frequency in the i th cell of a table

Degrees of freedom = (# categories− 1)

Note: This test is based on frequencies (cell counts) in a table, notproportions

26 / 36

Example: Handgun survey I

Survey 200 adults regarding handgun bill:

Statement: “I agree with a ban on handguns”

Four categories: Strongly agree, agree, disagree, stronglydisagree

Can one conclude that opinions are equally distributed overfour responses?

27 / 36

Example: Handgun survey II

Response (count)1 2 3 4

Stronglyagree disagree

Stronglyagree disagree

Responding (Oi ) 102 30 60 8Expected (Ei ) 50 50 50 50

χ2 =k∑

i=1

[(Oi − Ei )

2

Ei]

=(102− 50)2

50+

(30− 50)2

50+

(60− 50)2

50+

(8− 50)2

50= 99.36

df = 4− 1 = 3

28 / 36

Example: Handgun survey III

Critical value: χ24−1,0.05 = χ2

3,0.05 = 7.81

Since 99.36 > 7.81, we conclude that our observation wasunlikely by chance alone (p < 0.05)

Based on these data, opinions do not appear to be equallydistributed among the four responses

29 / 36

χ2 Test of Independence I

Test the null hypothesis that two criteria of classification areindependent

r × c contingency table

Criterion 11 2 3 · · · c Total

Criterion 2

1 n11 n12 n13 · · · n1c n1·2 n21 n22 n23 · · · n2c n2·3 n31 n32 n33 · · · n3c n3·

......

......

......

r nr1 nr2 nr3 · · · nrc nr ·Total n·1 n·2 n·3 · · · n·c n

30 / 36

χ2 Test of Independence II

Test statistic:

χ2 =k∑

i=1

[(Oi − Ei )

2

Ei]

Degrees of freedom = (r − 1)(c − 1)where r is the number of rows and c is number of columns

Assume the marginal totals are fixed

31 / 36

χ2 Test of Homogeneity (No association)

Test the null hypothesis that the samples are drawn frompopulations that are homogenous with respect to some factor

i.e. no association between group and factor

Same test statistic as χ2 test of independence

Test statistic:

χ2 =k∑

i=1

[(Oi − Ei )

2

Ei]

Degrees of freedom = (r − 1)(c − 1)where r is the number of rows and c is number of columns

32 / 36

Example: Treatment response I

χ2 Test of Homogeneity (No association)

Response to Treatment

Treatment Yes No Total

Observed NumbersA 37 13 50B 17 53 70

Total 54 66 120

Test H0 that there is no association between the treatmentand response

Calculate what numbers of “Yes” and “No” would beexpected assuming the probability of “Yes” was the same inboth treatment groups

Condition on total the number of “Yes” and “No” responses

33 / 36

Example: Treatment response II

Expected proportion with “Yes” response = 54120 = 0.45

Expected proportion with “No” response = 66120 = 0.55

Response to Treatment

Treatment Yes No Total

Observed A 37 (22.5) 13 (27.5) 50(Expected) B 17 (31.5) 53 (38.5) 70

Total 54 66 120

Get expected number of Yes responses on treatment A:

54

120× 50 = 0.45× 50 = 22.5

Using a similar approach you get the other expected numbers

34 / 36

Example: Treatment response III

Test statistic: χ2 =k∑

i=1

[(Oi − Ei )

2

Ei]

=(37− 22.5)2

22.5+

(13− 27.5)2

27.5

+(17− 31.5)2

31.5+

(53− 38.5)2

38.5= 29.1

Degrees of freedom = (r-1)(c-1) = (2-1)(2-1) = 1

Critical value for α = 0.001 is 10.82 so we see p<0.001

Reject the null hypothesis, and conclude that the treatmentgroups are not homogenous (similar) with respect to response

Response appears to be associated with treatment35 / 36

Lecture 6 summary

Dichotomous variables

cohort studies - relative riskcase-control - odds ratios

Chi-square tests

Next time, we’ll talk about analysis of variance (ANOVA)

36 / 36