imputation techniques for missing data in clinical trials
DESCRIPTION
Missing data are unavoidable in clinical and epidemiological researches. Missing data leads to bias and loss of information in research analysis. Usually we are not aware of missing data techniques because we are depending on some software’s. The objective of this seminar is to introduce different missing data mechanisms and imputation techniques for missing data with the help of examples.TRANSCRIPT
1
Imputation Techniques For Missing Data In Clinical Trials
Presentation by, NITHIN GEORGE VINOD
PROJECT ASSOCIATECENTRE FOR LIVESTOCK DEVELOPMENT AND POLICY RESEARCH
KERALA VETERINARY AND ANIMAL SCIENCES UNIVERSITY
2
Contents• Objectives• Introduction to missing data• Reasons for missing data• Missing data mechanism• Simple methods• Single imputation
» Last observation carried forward (LOCF)»Hot-deck imputation»Arithmetic mean imputation»Regression imputation»Stochastic imputation
• Multiple imputation
3
Objectives
To introduce different imputation techniques in missing data mechanism.
4
Introduction to missing data
Missing data Some of the values in the data set are either
lost or not observed or not available due to natural or non natural reasons.
(James R. Carpenter: Missing data in randomized controlled trials)
5
Reasons for missing data
• patients are in very critical conditions.• patients wants to change the treatment.• Missing due to the break down of machines• Failed in continuing the follow up.• Failed to answer some questionnaires.• Patients are cured or died before the study.• Investigator is forgot to collect the data• Family migrated• Patients profile may missing
6
Effect of missing data
• Bias • Power and variability• Inaccurate results
7
Missing data mechanism (Rubin 1976)
• Missing Completely At Random (MCAR).• Missing at random (MAR).• Missing Not At Random (MNAR).
8
Missing At Random (MAR)The probability of missing data on a variable Y is related to some other measured variables in the analysis model but not to the values of Y itself.
Examples • Missing blood pressure measurement may be lower than
measured blood pressure because younger people may be more likely to have missing blood pressure measurement.
• In the study of quality of life the psychologist finds that elderly patients with and patients with less education have a higher probability to refuse the QL questionnaire.
9
Missing Completely At Random The probability of missing data on a variable Y is
unrelated to other measured variables and unrelated to the values of Y itself.
Examples• Blood Pressure measurement is missing because of
break down of an automatic sphygmomanometer.• Suppose that a psychologist is studying quality of life
in a group of cancer patients and finds that patient is missing, because they migrated to other place.
10
Missing Not At Random (MNAR)The probability of missing data in a variable Y is related to the values of Y itself, even after controlling for other variables.
Examples• Suppose the study is not effective for reducing
the blood pressure, their may be a chance of subjects drop out.
11
Different methods to deal missing data
• List Wise deletion• Pair Wise Deletion• Last Observation Carried Forward• Hot-Deck Imputation• Arithmetic Mean Imputation• Regression Imputation• Stochastic Regression Imputation• Cold-Deck Imputation• Averaging The Available Pattern Imputation• Maximum Likelihood Estimation• Markov chain Monte Carlo method
12
Simple techniques• List wise deletion Discards the data for any case that has one or
more missing value.
13
Single Imputation
Method that imputes the missing data with seemingly suitable replacement value.
14
Last Observation Carried Forward (LOCF)
LOCF takes the last available response and substitutes the value into all subsequent missing values.
Advantages• It generates a complete data set.• Easy to implementDisadvantages• Produce biased estimates.• Not sensible when the data are MCAR.
15
Hot-deck Imputation (Scheuren, 2005)
Replaces each missing value with a random draw from a subsample of respondents that scored similarly on a data set of matching variables.
Advantages• It generates a complete data set.Disadvantages• Not well suited for estimating measures of
association.• Produce substantially biased estimates of correlation
and regression coefficients.
16
Arithmetic Mean Imputation (Wilks, 1932)
Filling the missing values with arithmetic mean of the available cases.
Advantages• It is applicable for all type of missingness.• It also generate a complete data set.Disadvantages• Reduces the variability of the data.• Affect the measures of association.
17
Regression Imputation (Buck, 1960)
Replaces missing values with predicted scores from a regression equation by using information from the complete variables.
Advantages• It generates a complete data set.• Variables tend to be correlatedDisadvantages• Inputs data with perfectly correlated scores• Over estimate correlation• bias
AGE QL QL_missing R I
35 90 90
36 89 89
38 88 88
38 87 87
41 82 82
45 80 80
47 78 78
48 76 76
49 71 71
55 73 73
57 70 70
59 70 70
62 68 __ 65.03
65 67 __ 62.37
68 67 __ 59.71
72 63 __ 56.17
72 60 __ 56.17
73 59 __ 55.28
75 52 __ 53.51
76 51 __ 52.63
QL R I
mean 70.74 72.05
SD 12.726 11.74
QL = βo+β1*AGEQL = 119.950-.886*AGE
19
Stochastic Regression Imputation
Uses regression equations to predict the incomplete data with a normally distributed residual term.
Advantages• Most appropriate method.• Input approximately equal results.• It gives unbiased parameter under an MAR data
mechanism.Disadvantage• Under estimate standard error.
20
AGE QL QL_missing R V S I
35 90 90
36 89 89
38 88 88
38 87 87
41 82 82
45 80 80
47 78 78
48 76 76
49 71 71
55 73 73
57 70 70
59 70 70
62 68 __ 5.67 70.69
65 67 __ 3.72 66.08
68 67 __ -4.13 55.57
72 63 __ -0.39 55.77
72 60 __ -7.20 48.96
73 59 __ 2.39 57.66
75 52 __ -6.64 46.86
76 51 __ 1.84 54.45
QL = βo+β1*AGE+ʐiQL = 119.950-.886*AGE+ʐi
QL S I
mean 70.74 70.50
SD 12.726 13.61
21
Multiple imputation
Creates several copies of the data and imputes each copy with different plausible estimates of missing values.
22
Procedure
I. Imputation phase• Data augmentation
» I-step» P-step
II. Analysis phase• A statistical analysis is performed on each data
using the same technique.
III. Pooling phase• Estimates and their standard errors are averaged
into a single set of value.
23
Data augmentation
I-step
stochastic imputation
New data set
P-step
Data set 1 Data set 2
Data set 20
24
Conclusion
• Imputation is an attractive idea because it produce a complete data set and make the data usable.
• Each imputation produce biased parameter estimates.• Stochastic regression is the only traditional approach
and yield unbiased estimate under an MAR mechanism.
• Multiple imputation also produce similar estimates• The techniques are rare, if the data is categorical and
if the missing mechanism is MNAR.
25
References
• Amanda N. Baraldi & Craig K. Ender: An introduction to missing data analysis. Journal of school psychology . 2009; 9-18.
• Craig K. Enters: Applied missing data analysis. The Guilford press. New York. London. 2010, 2-85.
• James R. Carpenter: Missing data in randomized controlled trials-A practical guide 2007; 4-16
• Schafer J L, & Graham J W. Missing data: Our view of the state of the art 2002; 147-77.
26
Thank You