lecture 6: dichotomous variables & chi-square testseckel/biostat2/slides/lecture6.pdf ·...
TRANSCRIPT
Today’s lecture
Dichotomous Variables
Comparing two proportions using 2 × 2 tables
Study Designs
Relative risks, odds ratios and their confidence intervals
Chi-square tests
2 / 36
Proportions and 2 × 2 tables
Population Success Failure Total
Population 1 x1 n1 − x1 n1
Population 2 x2 n2 − x2 n2
Total x1 + x2 n − (x1 + x2) n
Row 1 shows results of a binomial experiment with n1 trials
Row 2 shows results of a binomial experiment with n2 trials
3 / 36
How do we compare these proportions
Often, we want to compare p1, the probability of success inpopulation 1, to p2, the probability of success in population 2
Usually: “Success” = DiseasePopulation 1 = Treatment 1Population 2 = Treatment 2 (maybe placebo)
How do we compare these proportions?
We’ve talked about comparing proportions by looking at theirdifferenceBut sometimes we want to look at one proportion ‘relative’ tothe otherThis approach depends on the type of study the data camefrom
4 / 36
Cross-sectional study design
A ‘snapshot’ of a sample of the population
Commonly a survey/questionnaire or one time visit to assessinformation at a single point in time
Measures existing disease and exposure levels
Difficult to argue for a causal relation between exposure anddisease
Example: A group of Finnish residents between the ages of25 and 64 are mailed a questionnaire asking about their dailycoffee consumption habits and their systolic BP from theirlast visit to the doctor.
6 / 36
Cohort study design
Find a group of individuals without the disease and separateinto those
with the exposure andwithout the exposure
Follow over time and measure the disease rates in both groupsCompare the disease rate in the exposed and unexposedIf the exposure is harmful and associated with the disease, wewould expect to see higher rates of disease in the exposedgroup versus the unexposed groupAllows us to estimate the incidence of the disease(rate at which new disease cases occur)Example: A group of Finnish residents between the ages of25 and 64 and currently without hypertension are asked abouttheir current levels of daily coffee consumption. Theseindividuals are followed for 5 years at which point they areasked if they have been diagnosed with hypertension.
7 / 36
Case-control study design
Identify individualswith the disease of interest (case) andthose without the disease (control)
and then look retrospectively (at prior records) to find whatthe exposure levels were in these two groupsThe goal is to compare the exposure levels in the case andcontrol groupsIf the exposure is harmful and associated with the disease, wewould expect to see higher levels of exposure in the cases thanin the controlsVery useful when the disease is rareExample: A group of Finnish residents with hypertension anda group of Finnish residents without hypertension areidentified. These individuals are contacted and asked abouttheir level of daily coffee consumption in the two years prior todiagnosis with hypertension.
8 / 36
(Almost) Cohort Study Example
Aceh Vitamin A TrialExposure levels (Vitamin A) assigned at baseline and then thechildren are followed to determine survival in the two groups.
25,939 pre-school children in 450 Indonesian villages innorthern Sumatra
200,000 IU vitamin A given 1-3 months after the baselinecensus, and again at 6-8 months
Consider 23,682 out of 25,939 who were visited on apre-designed schedule
References:
1 Sommer A, Djunaedi E, Loeden A et al, Lancet 1986.
2 Sommer A, Zeger S, Statistics in Medicine 1991.
9 / 36
Trial Outcome: 12 month mortality
Alive at 12 months?Vit A No Yes Total
Yes 46 12,048 12,094No 74 11,514 11,588
Total 120 23,562 23,682
Does Vitamin A reduce mortality?
Calculate risk ratio or “relative risk”
Relative Risk abbreviated as RRCould also compare difference in proportions: called“attributable risk”
10 / 36
Relative Risk Calculation
Relative Risk =Rate with Vitamin A
Rate without Vitamin A
=p̂1
p̂2
=46/12, 094
74/11, 588
=0.0038
0.0064= 0.59
The death rate with vitamin A is 0.60 times (or 60% of) thedeath rate without vitamin A.
Equivalent interpretation: Vitamin A group had 40% lowermortality than without vitamin A group!
11 / 36
Confidence interval for RR
Step 1: Find the estimate of the log RR
log(p̂1
p̂2)
Step 2: Estimate the variance of the log(RR) as:
1− p1
n1p1+
1− p2
n2p2
Step 3: Find the 95% CI for log(RR):
log(RR)± 1.96 · SD(log RR) = (lower, upper)
Step 4: Exponentiate to get 95% CI for RR;
e(lower, upper)
12 / 36
Confidence interval for RRfrom Vitamin A Trial
95% CI for log relative risk is:
log(RR)± 1.96 · SD(log RR)
= log(0.59)± 1.96 ·√
0.9962
46+
0.9936
74= −0.53± 0.37
= (−0.90,−0.16)
95% CI for relative risk
(e−0.90, e−0.16) = (0.41, 0.85)
Does this confidence interval contain 1?
13 / 36
What if the data were from a case-control study?
Recall: in case-control studies, individuals are selected byoutcome status
Disease (mortality) status defines the population, andexposure status defines the success
p1 and p2 have a difference interpretation in a case-controlstudy than in a cohort study
Cohort:
p1 = P(Disease | Exposure)p2 = P(Disease | No Exposure)
Case-Control:
p1 = P(Exposure | Disease)p2 = P(Exposure | No Disease)
⇒ This is why we cannot estimate the relative risk fromcase-control data!
14 / 36
The Odds Ratio
The odds ratio measures association in Case-Control studies
Odds =P(event occurs)
P(event does not occur)= p
1−p
Remember the odds ratio is simply a ratio of odds!
OR = odds in group 1odds in group 2
OR = p̂1/(1−p̂1)p̂2/(1−p̂2)
15 / 36
Which p1 and p2 do we use?
We can actually calculate OR using either “case-control” or“cohort” set up
Using “case-control” p1 and p2 where we condition on diseaseor no disease
OR =(46/120)/(74/120)
(12048/23562)/(11514/23562)=
46/74
12048/11514= 0.59
Using “cohort” p1 and p2 where we condition on exposure orno exposure
OR =(46/12094)/(12048/12094)
(74/11588)/(11514/11588)=
46/12048
74/11514= 0.59
We get the same answer either way!
16 / 36
Bottom Line
The relative risk cannot be estimated from acase-control study
The odds ratio can be estimated from a case-control study
OR estimates the RR when the disease is rare in both groups
The OR is invariant to cohort or case-control designs, the RRis not
We are introducing the OR now because it is an essential ideain logistic regression
17 / 36
Confidence interval for OR
Step 1: Find the estimate of the log OR
log(p̂1/(1− p̂1)
p̂2/(1− p̂2))
Step 2: Estimate the variance of the log(OR) as:
1
n1p1+
1
n1q1+
1
n2p2+
1
n2q2
Step 3: Find the 95% CI for log(OR):
log(OR)± 1.96 · SD(log OR) = (lower, upper)
Step 4: Exponentiate to get 95% CI for OR;
e(lower, upper)
18 / 36
All kinds of χ2 tests
Test of Goodness of fit
Test of independence
Test of homogeneity or (no) association
All of these test statistics have a χ2 distribution under the null
19 / 36
The χ2 distribution
Derived from the normal distribution
χ21 = (
y − µσ
)2 = Z 2
χ2k = Z 2
1 + Z 22 + · · ·+ Z 2
k
where Z1, . . . ,Zk are all standard normal random variables
k denotes the degrees of freedom
A χ2k random variable has
mean = kvariance = 2k
Since a normal random variable can take on values in theinterval (−∞,∞), a chi-square random variable can take onvalues in the interval (0,∞)
20 / 36
χ2 Critical Values
We generally use only a one-sided test for the χ2 distributionArea under the curve to the right of the cutoff for each curveis 0.05Increasing critical value with increasing number of degrees offreedom
0 2 4 6 8 10 12
Chi−square value
df=1df=3df=5
22 / 36
But...we’ll use R
For χ2 random variables with degrees of freedom = df we’ll use
1 pchisq(a, df, lower.tail=F) to find P(χdf ≥ a) =?
P(ChiSq > a ) = ?
a
P(ChiSq > ? ) = b
?
2 qchisq(b, df, lower.tail=F) to find P(χdf ≥?) = b
24 / 36
χ2 Goodness-of-Fit Test
Determine whether or not a sample of observed values of somerandom variable is compatible with the hypothesis that the samplewas drawn from a population with a specified distributional form,i.e.
Normal
Binomial
Poisson
etc...
Here, the expected cell counts would be derived from thedistributional assumption under the null hypothesis
25 / 36
The χ2 test statistic
χ2 =∑k
i=1[ (Oi−Ei )2
Ei] where
Oi = i th observed frequency
Ei = i th expected frequency in the i th cell of a table
Degrees of freedom = (# categories− 1)
Note: This test is based on frequencies (cell counts) in a table, notproportions
26 / 36
Example: Handgun survey I
Survey 200 adults regarding handgun bill:
Statement: “I agree with a ban on handguns”
Four categories: Strongly agree, agree, disagree, stronglydisagree
Can one conclude that opinions are equally distributed overfour responses?
27 / 36
Example: Handgun survey II
Response (count)1 2 3 4
Stronglyagree disagree
Stronglyagree disagree
Responding (Oi ) 102 30 60 8Expected (Ei ) 50 50 50 50
χ2 =k∑
i=1
[(Oi − Ei )
2
Ei]
=(102− 50)2
50+
(30− 50)2
50+
(60− 50)2
50+
(8− 50)2
50= 99.36
df = 4− 1 = 3
28 / 36
Example: Handgun survey III
Critical value: χ24−1,0.05 = χ2
3,0.05 = 7.81
Since 99.36 > 7.81, we conclude that our observation wasunlikely by chance alone (p < 0.05)
Based on these data, opinions do not appear to be equallydistributed among the four responses
29 / 36
χ2 Test of Independence I
Test the null hypothesis that two criteria of classification areindependent
r × c contingency table
Criterion 11 2 3 · · · c Total
Criterion 2
1 n11 n12 n13 · · · n1c n1·2 n21 n22 n23 · · · n2c n2·3 n31 n32 n33 · · · n3c n3·
......
......
......
r nr1 nr2 nr3 · · · nrc nr ·Total n·1 n·2 n·3 · · · n·c n
30 / 36
χ2 Test of Independence II
Test statistic:
χ2 =k∑
i=1
[(Oi − Ei )
2
Ei]
Degrees of freedom = (r − 1)(c − 1)where r is the number of rows and c is number of columns
Assume the marginal totals are fixed
31 / 36
χ2 Test of Homogeneity (No association)
Test the null hypothesis that the samples are drawn frompopulations that are homogenous with respect to some factor
i.e. no association between group and factor
Same test statistic as χ2 test of independence
Test statistic:
χ2 =k∑
i=1
[(Oi − Ei )
2
Ei]
Degrees of freedom = (r − 1)(c − 1)where r is the number of rows and c is number of columns
32 / 36
Example: Treatment response I
χ2 Test of Homogeneity (No association)
Response to Treatment
Treatment Yes No Total
Observed NumbersA 37 13 50B 17 53 70
Total 54 66 120
Test H0 that there is no association between the treatmentand response
Calculate what numbers of “Yes” and “No” would beexpected assuming the probability of “Yes” was the same inboth treatment groups
Condition on total the number of “Yes” and “No” responses
33 / 36
Example: Treatment response II
Expected proportion with “Yes” response = 54120 = 0.45
Expected proportion with “No” response = 66120 = 0.55
Response to Treatment
Treatment Yes No Total
Observed A 37 (22.5) 13 (27.5) 50(Expected) B 17 (31.5) 53 (38.5) 70
Total 54 66 120
Get expected number of Yes responses on treatment A:
54
120× 50 = 0.45× 50 = 22.5
Using a similar approach you get the other expected numbers
34 / 36
Example: Treatment response III
Test statistic: χ2 =k∑
i=1
[(Oi − Ei )
2
Ei]
=(37− 22.5)2
22.5+
(13− 27.5)2
27.5
+(17− 31.5)2
31.5+
(53− 38.5)2
38.5= 29.1
Degrees of freedom = (r-1)(c-1) = (2-1)(2-1) = 1
Critical value for α = 0.001 is 10.82 so we see p<0.001
Reject the null hypothesis, and conclude that the treatmentgroups are not homogenous (similar) with respect to response
Response appears to be associated with treatment35 / 36