hsrp 734: advanced statistical methods may 22, 2008
DESCRIPTION
HSRP 734: Advanced Statistical Methods May 22, 2008. Course Website. Course site in Public Health Sciences (PHS) website: http://www.phs.wfubmc.edu/public/edu_statMeth.cfm. Course Syllabus. HSRP 734: Advanced Statistical Methods. Categorical Data Analysis Logistic Regression - PowerPoint PPT PresentationTRANSCRIPT
HSRP 734: Advanced Statistical Methods
May 22, 2008
Course Website
• Course site in Public Health Sciences (PHS) website:
http://www.phs.wfubmc.edu/public/edu_statMeth.cfm
Course Syllabus
HSRP 734: Advanced Statistical Methods
• Categorical Data Analysis
• Logistic Regression
• Survival analysis
• Cox PH regression
What is Categorical Data Analysis?
• Statistical analysis of data that are non-continuous
• Includes dichotomous, ordinal, nominal and count outcomes
• Examples: Disease incidence, Tumor response
What is Logistic Regression?
A statistical method used to model dichotomous or binary outcomes (but not limited to) using predictor variables.
What is Logistic Regression?
• Used when the research method is focused on whether or not an event occurred, rather than when it occurred
• Time course information is not used
Logistic Regression quantifies “effects” using Odds Ratios
• Does not model the outcome directly, which leads to effect estimates quantified by means (i.e., differences in means)
• Estimates of effect are instead quantified by “Odds Ratios”
The Logistic Regression Model
0 1 1 2 2 K K
P Yln
1-P YX X X
predictor variables
YP1
YPln is the log(odds) of the outcome.
dichotomous outcome
The Logistic Regression Model
0 1 1 2 2 K K
P Yln
1-P YX X X
intercept
YP1
YPln is the log(odds) of the outcome.
model coefficients
A Short ReviewA Short Review
Philosophy of Science
• Idea: We posit a paradigm and attempt to falsify that paradigm.
• Science progresses faster via attempting to falsify a paradigm than attempting to corroborate a paradigm.
(Thomas S. Kuhn. 1970. The Structure of Scientific Revolutions. University of Chicago Press.)
Philosophy of Science• The fastest way to progress in science under this paradigm of
falsification is through perturbation experiments.
• In epidemiology, – often unable to do perturbation experiments– it becomes a process of accumulating evidence
• Statistical testing provides a rigorous data-driven framework for falsifying hypothesis
The P-Value
• What is the probability of having gotten a sample mean as extreme as 4.8 if the null hypothesis was true (H0: = 0)?
• P-value = probability of obtaining a result as or more “extreme” than observed if H0 was true.
• Consider for the above example, if p = 0.0089 (less than a 9 out of 1,000 chance)
• What if p = 0.0501 (5 out of 100 chance) ?
Hypothesis Testing
1. Set up a null and alternative hypothesis
2. Calculate test statistic
3. Calculate the p-value for the test statistic
4. Based on p-value make a decision to reject or fail to reject the null hypothesis
5. Make your conclusion
Hypothesis Testing
Your decision vs. Truth
Truth: H0 True Truth: H0 False
Decision: Fail to reject H0
Correct Decision Incorrect DecisionType II Error ()
Decision:Reject H0
Incorrect DecisionType I Error ()
Correct Decision(Power)
Hypothesis Testing
• Type I error () = the probability of rejecting the null hypothesis given that H0 is true (the significance level of a test).
• Type II error (): the probability of not rejecting the null hypothesis given that H0 is false (not rejecting when you should have).
• Power = 1 -
Power
• The power of a test is: The probability of rejecting a false null
hypothesis under certain assumed differences between the populations.
• We like a study that has “high” power (usually at least 80%).
• Any difference can become significant if N is large enough
• Even if there is statistical significance is there clinical significance?
Controversy around HT and p-value
“A methodological culprit responsible for spurious theoretical conclusions”
(Meehl, 1967; see Greenwald et al, 1996)
“The p-value is a measure of the credibility of the null hypothesis. The smaller the p-value is, the less likely one feels the null hypothesis can be true.”
HT and p-value
• “It cannot be denied that many journal editors and investigators use p-value < 0.05 as a yardstick for the publishability of a result.”
• “This is unfortunate because not only p-value, but also the sample size and magnitude of a physically important difference determine the quality of an experimental finding.”
HT and p-value
• Consider a new cancer drug that possibly shows significant improvements.
• Should we consider a p = 0.01 the same as a p = 0.00001 ?
HT and p-value
• “[We] endorse the reporting of estimation statistics (such as effect sizes, variabilities, and confidence intervals) for all important hypothesis tests.”
– Greenwald et al (1996)
Reporting Statistics
• Reporting I. Statistical Methods
The changes in blood pressure after oral contraceptive use were calculated for 10 women. A paired t-test was used to determine if there was a significant change in blood pressure and a 95% confidence was calculated for the mean blood pressure change (after-before).
Reporting Statistics
• Reporting II. Results
Blood pressure measurements increased on average 4.8 mmHg with standard deviation of 4.57. The 95% confidence interval for the mean change was (1.53, 8.07).
There was evidence that blood pressure measurements after oral contraceptive use were significantly higher than before oral contraceptive use (p = 0.009).
HSRP 734Lecture 1:
Measures of Disease Occurrence and Association
Objectives:
1.Define and compute the measures of disease occurrence and association
2.Discuss differences in study design and their implications for inference
Example
CT images rated
by radiologist
(Rosner p.65)
Rated as normal
Rated as questionable
Rated as abnormal
Normal 39 6 13
Abnormal 5 2 44
(Cell %)Row %Col %
Rated as normal
Rated as questionable
Rated as abnormal
Normal
39 (35.8%)
67%88.6%
6 (5.5%)10.3%75%
13 (11.9%)22.4%22.8%
58
Abnormal
5(4.6%)9.8%
11.4%
2(1.8%)3.9%25%
44(40.4%)86.3%77.2%
51
44 8 57 109
Basic Probability
• Conditional probability
– Restrict yourself to a “subspace” of the sample space
Male Female
Young 20% 10%
Old 35% 35%
Conditional probabilities
• Probability that something occurs (event B), given that event A has occurred (conditioning on A)
• Pr(B given that A is true) = Pr(B | A)
Conditional probabilities
• Categorical data analysis• odds ratio = ratio of odds of two
conditional probabilities
• Conditional probabilities in survival analysis of the form :
Pr(live till time t1+t2 | survive up till time t1)
Basic probability
• Example: automatic blood-pressure machine
• 84% hypertensive and 23% normotensives are classified as hypertensive
• Given 20% of adult population is hypertensive
• We now know:
Pr(machine says hypertensive | truly hypertensive)
• What is Pr(truly hypertensive| machine says hypertensive)?
Basic probability
Machine diagnosed as hypertensive (D)
Hypertension (H) Yes No
Yes
No
Basic probability
• Positive predictive value — Probability that a randomly selected subject from the population actually has the disease given that the screening test is positive
• Negative predictive value — Probability that a randomly selected subject from the population is actually disease free given that the screening test is negative
Basic probability
• Sensitivity — Probability that the procedure is positive given that the person has the disease
• Specificity — Probability that the procedure is negative given that the person does not have the disease
Review examples 3.26, 3.27, and 3.28 in Rosner
• Measures of Occurrence– Measure using proportions (e.g.,
prevalence, odds)– Rates (e.g., incidence, cumulative
incidence)
• Measure of Association– Based on odds (e.g., odds ratio)– Based on probabilities (e.g., risk ratio)
Absolute Measures of Disease Occurrence
• Point prevalence = proportion of cases at a given point in time– cross-sectional measure
• Incidence = number of new cases within a specified time interval– prospective measure
Absolute Measures of Disease Occurrence
• Example:
Consider four individuals diagnosed with lung cancer
• Proportion of death = 2/4 = 0.5• Rate of death = 2/(3+5+2+1) = 0.18 deaths per person year
Person Years of Follow-up Status
1 3 Dead
2 5 Alive
3 2 Alive
4 1 Dead
Absolute Measures of Disease Occurrence
• Two kinds of quantities used in measurement:
– Proportion: the numerator of a proportion as a subset of the denominator, e.g., prevalence
– Rate: # events which occur during a time interval divided by the total amount of time, e.g., incidence rate
Absolute Measures of Disease Occurrence
Remarks:
1) Diseases of long duration tend to have a higher prevalence
2) Incidence tends to be more informative than prevalence for causal understanding of the disease etiology
3) Incidence is more difficult to measure & more expensive
Absolute Measures of Disease Occurrence
4) Prevalence & incidence can be influenced by the evolution of screening procedures and diagnostic tests
5) Both incidence and prevalence rates may be age dependent
Absolute Measures of Disease Occurrence
• Odds = ratio of P(event occurs) to the P(event does not occur).
Example:
The probability of a disease is 0.20.
Thus, the odds are 0.20/(1-0.20) = 0.20/0.80 =0.25 = 1:4
That is, for every one person with an event, there are 4 people without the event.
p
podds
1
Absolute Measures of Disease Occurrence
• Risk of disease in time interval [t0, t1)
P(t) = Pr(developing disease in interval of length
t = t1 - t0 given disease free at the start
of the interval)
• Average Prevalence = Incidence x Duration
duration = average duration of disease after onset
Measures of Disease Association
• So far we have discussed
– Prevalence
– Incidence rate
– Cumulative incidence rate
– Risk of disease within an interval t
• All absolute measures
• Next, relative measures and associations
– Exposed (E) versus Unexposed ( )
E
Measures of Disease Association
• Population versus sample
– Probabilities (population) are denoted by symbols such as
• = P(disease within the exposed population)
– Sample estimates are denoted by
1p
1p̂
Measures of Disease Association
Exposed E
Not Exposed Total
Disease D
a b n1
No Diseasec d n0
Total m1 m0 n
D
E
Conditional distribution
Exposed E
Not Exposed Margin
Disease D
No Disease
Margin 1
D
E
1p
11 p
Conditional distribution
Exposed E
Not Exposed Margin
Disease D
No Disease
Margin 1
D
E
0p
01 p
Measures of Association
• Odds ratio: Odds of disease among exposed divided by odds of disease among unexposed
0
0
1
1
1
1
pp
pp
OR
Measures of Association
OR > 1 implies a positive association between disease and exposure
OR < 1 implies a negative association between disease and exposure
OR for disease = OR for exposure
Measures of Association
• Risk ratio = ratio between P(disease for exposed) and P(disease for unexposed) , both P(.) measured within the same duration of time
1
0
pRR
p
Measures of Association?
• Risk Difference (Excess Risk): RD = 1 - 0
RD not scale free
e.g., What is the meaning of these two equal differences
RR = 0.009. RD = 0.010-0.001 vs. RD = 0.210-0.201
• Attributable Risk for Exposed Persons:AR = (1 - 0) / 1 = 1 – 1 / RR
• Measurements of risk and relative risk in different sampling designs
• Cross-sectional• Cohort• Case-control
Measures of Disease Association
Exposed E
Not Exposed Total
Disease D
a b n1
No Diseasec d n0
Total m1 m0 n
D
E
• Cross-Sectional SamplingRandomly sample n subjects from population at time t and determine disease and exposure status.
Important: n is fixed for this design.
1) a/m1 estimates prevalence of disease at t among exposed
2) b/m0 estimates prevalence of disease at t among unexposed
3) ad/bc estimates the OR for disease and exposure
Odds Ratio
p1 = a/m1 = disease risk among exposedp0 = b/m0 = disease risk among unexposed
If p1 and p0 are small (rare disease) and the time interval is relatively short, it can be shown that OR ≈ RR
)1(
)1(
0
0
1
1
pp
pp
OR
Cross-sectional Sampling
• Cross-sectional design not prospective
• Can only test for association between exposure and prevalence and not incidence
• Cannot test hypotheses about causality
• Cohort SamplingSample n disease-free individuals from the population at time t0 and follow them until time t1.
Measure exposure history for each subject and observe which subjects develop disease in interval [t0, t1)
Important: m1, m0, and n are fixed
Cohort study: Estimates of risk
1) p1 = a/m1 estimates risk of developing disease in interval among exposed
2) p0 = b/m0 estimates risk of developing disease in interval among unexposed
3) RR ≈ p1 / p0
4) OR = ad / bc
5) IR (incidence rate): i ≈ pi / t for i = 0, 1 (and small t)
6) RD (risk difference): RD ≈ 1 – 0 ≈ (p1 – p0) / t
• Case-Control Sampling
Sample n1 cases and n0 disease free controls from target population during interval [t0, t1)
Important: n1, n0, and n are fixed
1) a/m1 and b/m0 do not estimate population disease risks
2) a/n1 estimates Pr(prior exposure | disease incidence in [t0, t1)
3) c/n0 estimates Pr(prior exposure | no disease incidence in [t0, t1)
4) OR = ad / bc
5) RR ≈ OR for rare disease or short time intervals
6) IR (incidence rate) or disease risks cannot be estimated; RD (risk difference) cannot be estimated
• Hypothetical exampleFrequency of disease and exposure in a target population
p1 = ? p0 = ?
RR = p1 / p0 = ? OR = ?
ExposureNot
ExposureTotal
Disease 8 32 40
No Disease 92 868 960
Total 100 900 1000
• Hypothetical exampleFrequency of disease and exposure in a target population
p1 = 8 / 100 = 0.08; p0 = 32 / 900 = 0.036
RR = p1 / p0 = 0.08 / 0.036 = 2.25 OR = (8 x 868) / (92 x 32) = 2.36
ExposureNot
ExposureTotal
Disease 8 32 40
No Disease 92 868 960
Total 100 900 1000
• Cohort Study50% of exposed individuals sampled25% of unexposed individuals sampled
p1 = 4 / 50 = 0.08; p0 = 8 / 225 = 0.036 RR = p1 / p0 = 0.08 / 0.036 = 2.25 OR = (4 x 217) / (46 x 8) = 2.36
Exposure Not Exposure Total
Disease 4 8 12
No Disease 46 217 263
Total 50 225 275
• Case-Control Study100% of diseased individuals sampled25% of disease-free individuals sampled
p1 = 8 / 31 = 0.26 ≠ 0.08; p0 = 32 / 249 = 0.13 ≠ 0.036
RR = p1 / p0 = (8/31) / (32/249) = 2.01 ≠ 2.25 OR = (8 x 217) / (23 x 32) = 2.36
ExposureNot
ExposureTotal
Disease 8 32 40
No Disease 23 217 240
Total 31 249 280
Odds ratio
• The odds ratio is equally valid for retrospective, prospective, or cross-sectional sampling designs
• That is, regardless of the design it estimates the same population parameter
Take home messages
– Occurrence of disease measured by prevalence, or proportion
– Incidence measured by incidence rates, or proportion per unit time
– Risk is probability of developing disease over a specified period of time
Take home messages
– Association of disease with exposure measured by odds ratios and risk ratios
– Odds ratios are valid for cross-sectional, cohort, and case-control designs, risk ratios are not
HW #1
• Due May 29
• Can talk to others but turn in own work