epidemiological analysis - kupublicifsv.sund.ku.dk/~nk/epif14/epidemiological analysis...example...
Post on 24-Jul-2020
7 Views
Preview:
TRANSCRIPT
Epidemiological analysis PhD-course in epidemiology
Lau Caspar Thygesen
Associate professor, PhD
25th February 2014
Age standardization
• Incidence and prevalence are strongly age-dependent
– Risks rising (e.g. chronic diseases) or declining (e.g. measles) with age
• Comparisons between populations and over time may be very misleading
• A single age-independent index representing a set of age-specific rates may be more appropriate
Mortality in Denmark and Greenland,
men, 1975
Please interpret this table?
Direct standardization
IR(DK-standardized to Greenlandic age-distribution)
= 0.016*12.2+0.076*0.7+0.268*0.160+0.506*1.4+0.110*11.2+0.024*66.5
= 3.8
Indirect standardization
Example
• Trend study of lung cancer incidence among
women
• Denmark
• 1943-2010
0
1
2
3
4
5
6
7
8
9
19
43
19
45
19
47
19
49
19
51
19
53
19
55
19
57
19
59
19
61
19
63
19
65
19
67
19
69
19
71
19
73
19
75
19
77
19
79
19
81
19
83
19
85
19
87
19
89
19
91
19
93
19
95
19
97
19
99
20
01
20
03
20
05
20
07
20
09
Lung Cancer Denmark Women
rateCrude
0
1
2
3
4
5
6
7
8
9
19
43
19
45
19
47
19
49
19
51
19
53
19
55
19
57
19
59
19
61
19
63
19
65
19
67
19
69
19
71
19
73
19
75
19
77
19
79
19
81
19
83
19
85
19
87
19
89
19
91
19
93
19
95
19
97
19
99
20
01
20
03
20
05
20
07
20
09
Lung Cancer Denmark Women
rateCrude
segi
scand
Example 2
• Incidence of multiple sclerosis
• Denmark
• 1950-2004
• European Standard Population
Example indirect standardization
• 19,185 subjects (3,817 women) who attended
outpatient clinics for alcohol abusers
• Copenhagen
• 1952-1992
• Compare incidence of heart disease by the
incidence rate in the greater Copenhagen area
Problems
• Direct standardisation can produce unreliable
estimates when the calculations are based on
small numbers
• Indirect standardisations from different
populations cannot be directly compared –
only compared to the standard
Compared to regression methods
• Regression based methods are available but are rarely applied in practice
• When individual data are available (presence / absence of disease, age and sex), a logistic regression can be used to estimate the standardized rate
• The main advantage is that it allows adjustment by continuous variables in addition to categorical variables
Missing data
• What does missing mean
• The pattern of missingness (nomenclature)
– How and why is it missing?
• Methods for handling
Missing values • Common in research
– Nonresponse
– Loss to follow-up
– Lack of overlap between linked data sets (not
so common)
What is item nonresponse?
• Unit Nonresponse vs. Item Nonresponse
ID Q1 Q2 Q3
456 1 1 2
457 4 2 1
458 ? ? ?
459 3 2 1
ID Q1 Q2 Q3
456 1 1 2
457 4 ? 1
458 ? 2 1
459 3 2 ?
Unit Nonresponse Examples
• Person who is not at home
• Person who does not pick up the phone
• Person who hangs up on you
• Rat that dies before the study
• The country you could not get data on
• etc.
Item Nonresponse
• “I Don’t Know”
• Refusals to respond
• Questions left blank
• Failed measurement
• etc.
Best way to deal with Missing
Data is not to have any
Minimizing Unit Nonresponse
• Call back if not home
• Refusal conversion
• Don’t mess up
• Clear and understandable questionnaire
• Polite request
• Incentives
Minimizing Item Nonresponse
• Well written questions
• Minimize misunderstandings
– cross-cultural example
– Standardized vs. non-standardized
• Minimize skip patterns
What kind of missing data should be
modeled?
• If an item is missing from your dataset but you
suspect that it has a true value
• I don’t know might simply mean I don’t know
– Don’t model it as if there was a true value
• Dead people (attrition)
The pattern of missingness (nomenclature)
• Ignorable
– MCAR - Missing Completely at Random
– MAR - Missing at Random
• Non-ignorable
– NMAR - Not Missing at Random
Missing completely at random
Missing Completely at Random: if the data are missing
completely at random then missing values cannot be
predicted any better
• Cause of missingness completely random process (like coin
flip)
• Cause uncorrelated with variables of interest
– Example: parents move
• No bias if cause omitted
• In the unlikely event that the process is missing completely at
random, then inferences based on complete cases are
unbiased, but inefficient because we have lost some cases
Missing at random
• Missingness may be related to measured variables
• But no residual relationship with unmeasured variables
• No bias if you control for measured variables
• For example, if highly educated are more likely to participate
in a survey, then the process is missing at random as long we
know the educational level of all persons
• If data is missing at random, then inferences based on
complete cases will be biased and inefficient
Missing not at random
Non-Ignorable / NMAR: if the probability that a cell is missing depends on the unobserved value of the missing value
For example, individuals’ responses to income questions, where
high income people are more likely to refuse to answer survey questions about income and other variables in the data set cannot predict which respondents have high income
If your missing data is non-ignorable, then inferences based on complete cases will be biased and inefficient
Classical Missing Data Treatments
• Whatever you do, you are doing something
– Case Deletion
• Listwise (complete case analysis)
• Pairwise (available case analysis)
– Indicator variable (dummy variable)
– Single Imputation
• (Unconditional) Mean Imputation
• Conditional Mean Imputation (expected value)
– Weighting
• Excludes the whole case
• Default in most software
• Works if mechanism is MCAR
and if pattern and sample size allows (need
to have enough complete cases)
• Can be biased
Listwise Deletion and Multi-Item Pairwise Deletion
• An option for using all available information
correlation/covariance matrixes
• Different calculations may be based on different
populations
• Very unpredictable bias
Indicator method
• For each variable with missing values, create a
missing-value indicator to accompany the
variable in all analysis
• Assumes MCAR
• Even if the stratum is just a random sample of
all subjects, the stratum will yield a
confounded estimate of the exposure effect
Mean imputation
• Technique
– Calculate mean over cases that have values for Y
– Impute this mean where Y is missing
– Ditto for X1, X2, etc.
• Problems
– ignores relationships among X and Y
• underestimates covariances
(Unconditional) Mean Imputation
Scatterplots are from Joe Schafer’s website
Mean imputation
• Standard errors too low
• CI difficult to calculate
Conditional mean imputation • Technique & implicit models
– If Y is missing
• impute mean of cases with similar values for X1, X2
– Y = b0 + X1 b1 + X2 b2
– Likewise, if X2 is missing
• impute mean of cases with similar values for X1, Y
– X1 = g0 + X1 g1 + Y g2
– If both Y and X2 are missing
• impute means of cases with similar values for X1
– Y = d0 + X1 d1
– X2= f0 + X1 f1
• Problem
– Ignores random components (no e)
àUnderestimates variances, se’s
Imputation of Expected Value • Good for creating expected values
• Bad for multivariate analysis
– Decreases standard errors
– Creates overconfident outcomes
– Increases probability of Type I error
Problem with single imputation
• Underestimates se’s!
• Treats imputed values like observed values
– when they are actually less certain
• Ignores imputation variation
Imputation variation • Sampling variation
– If you take a different sample
• you get different parameter estimates
– Standard errors reflect this
– One way to estimate sampling variation
• measure variation across multiple samples
• called “bootstrapping”
• Imputation variation
– If you impute different values
• you get different parameter estimates
– Standard errors should reflect this, too
– One way to estimate imputation variation
• measure variation across multiple imputed data sets
• called “multiple imputation”
Multiple Imputation
• Models both expected value and uncertainty.
• Using the Missing Data Model you specify it
simulates and imputes missing values “multiple”
times creating M complete datasets
– (M=5 is usually OK. It is a good idea to simulate more)
• Analyze each dataset independently
• Combines results to get unbiased estimates. Models
both uncertainty and expectation
Example
Multiple Imputation Simple Procedure
1. Impute using PROC MI
3. Do analysis: PROC REG, LOGISTIC, etc.
using by _imputation_; in the procedure
4. Combine results using PROC MIANALYZE
PROC MI
• Typical syntax:
proc mi data=bmx out=impdat seed=33155;
var bmxbmi bmxht bmxwt bmxarmc bmxarml;
run;
• data= 1 copy of data with missing values
• out= 5 copies of data with imputed values (will be different across copies)
• seed= random seed, you can keep same to reconstruct your results
• var Variables with missing values you need imputed, in model, and those that may be helpful with imputation
PROC MI Sample Output PROC MI Options
• nimpute=5 # imputations, default=5
0 gives missing patterns
• minimum=0 0 0 0 set min & max, sometimes
maximum=1 1 1 90 doesn’t converge as well
• round=1 1 1 0.01 round off option
Output dataset Regression
• Fit your model as if data had no missing values, using by _imputation_;
• proc reg data=impdat outest=parmcov covout;
model bmxbmi=bmxht bmxwt bmxarmc bmxarml;
by _imputation_;
run;
• You’ll get nimpute (usually 5) sets of output
• Estimates, covariances, errors will be combined in MIANALYZE
• Need to generate parameter estimates and covariance data set (varies by procedure)
Parameter Est. & Covariance Matrix
• proc logistic data=impdat descending; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run;
• proc mixed data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /solution covb; by _imputation_; ods output covparms=parmcov; run;
Parameter Est. & Covariance Matrix
• proc genmod data=impdat; model bmxbmi=bmxht bmxwt bmxarmc bmxarml /covb; by _imputation_; ods output ParameterEstimates=parmsdat CovB=covbdat; run;
PROC MIANALYZE • Syntax depends on what procedure you used in previous step:
• proc mianalyze data=parmcov; (or) proc mianalyze parms=parmsdat covb=covbdat; (or) proc mianalyze parms=parmsdat xpxi=xpxidat;
(then type this:) modeleffects intercept bmxht bmxwt bmxarmc bmxarml;
run; • Note the “var” statement is now “modeleffects”
• Note that the dependent variable is omitted
PROC MIANALYZE Output
STATA *preparing dataset for multipel imputation
mi query
mi set mlong
mi describe, detail
mi register imputed total
set seed 29390
mi impute mvn total = i.smoking i.isced4 i.samliv3 i.s57a_ i.alder4 i.gender, add(20) force
mi describe, detail
•
*rounding the imputed binary values to the nearest integer
*replace bingedrinking = 0 if bingedrinking <0.5
*replace bingedrinking = 1 if bingedrinking >0.5
*replace change_new = round(change_new)
*examination of imputations: comparing main descriptive statistics from some imputations to those from the observed data
mi xeq 0 1 20: summarize total
mi estimate: xtmixed total i.gender group##month || username:, mle
mi estimate: mean total, over(sex group month)
Weigted regression
• Suppose that a national survey sampled 2000 subjects with 1000 men and 1000 women
• The response were 500 for men and 750 for women
• If there are large differences between men and women, a simple average of 2000 observations will be a distorted representation of the population mean
• By down-weighting women and up-weighting men we could obtain the accurate picture of the population
• Probability that values are missing
depends on the missing values themselves
• e.g., the probability that weight Y is missing
– is higher for the overweight (depends on Y)
– is higher for women (depends on X1)
• and sometimes X1 is missing, too.
• Methods available – not today!
Values not missing at random (NMAR)
top related