imputation techniques for missing data in clinical trials

1

Imputation Techniques For Missing Data In Clinical Trials

Presentation by, NITHIN GEORGE VINOD

PROJECT ASSOCIATECENTRE FOR LIVESTOCK DEVELOPMENT AND POLICY RESEARCH

KERALA VETERINARY AND ANIMAL SCIENCES UNIVERSITY

2

Contents• Objectives• Introduction to missing data• Reasons for missing data• Missing data mechanism• Simple methods• Single imputation

» Last observation carried forward (LOCF)»Hot-deck imputation»Arithmetic mean imputation»Regression imputation»Stochastic imputation

• Multiple imputation

3

Objectives

To introduce different imputation techniques in missing data mechanism.

4

Introduction to missing data

Missing data Some of the values in the data set are either

lost or not observed or not available due to natural or non natural reasons.

(James R. Carpenter: Missing data in randomized controlled trials)

5

Reasons for missing data

• patients are in very critical conditions.• patients wants to change the treatment.• Missing due to the break down of machines• Failed in continuing the follow up.• Failed to answer some questionnaires.• Patients are cured or died before the study.• Investigator is forgot to collect the data• Family migrated• Patients profile may missing

6

Effect of missing data

• Bias • Power and variability• Inaccurate results

7

Missing data mechanism (Rubin 1976)

• Missing Completely At Random (MCAR).• Missing at random (MAR).• Missing Not At Random (MNAR).

8

Missing At Random (MAR)The probability of missing data on a variable Y is related to some other measured variables in the analysis model but not to the values of Y itself.

Examples • Missing blood pressure measurement may be lower than

measured blood pressure because younger people may be more likely to have missing blood pressure measurement.

• In the study of quality of life the psychologist finds that elderly patients with and patients with less education have a higher probability to refuse the QL questionnaire.

9

Missing Completely At Random The probability of missing data on a variable Y is

unrelated to other measured variables and unrelated to the values of Y itself.

Examples• Blood Pressure measurement is missing because of

break down of an automatic sphygmomanometer.• Suppose that a psychologist is studying quality of life

in a group of cancer patients and finds that patient is missing, because they migrated to other place.

10

Missing Not At Random (MNAR)The probability of missing data in a variable Y is related to the values of Y itself, even after controlling for other variables.

Examples• Suppose the study is not effective for reducing

the blood pressure, their may be a chance of subjects drop out.

11

Different methods to deal missing data

• List Wise deletion• Pair Wise Deletion• Last Observation Carried Forward• Hot-Deck Imputation• Arithmetic Mean Imputation• Regression Imputation• Stochastic Regression Imputation• Cold-Deck Imputation• Averaging The Available Pattern Imputation• Maximum Likelihood Estimation• Markov chain Monte Carlo method

12

Simple techniques• List wise deletion Discards the data for any case that has one or

more missing value.

13

Single Imputation

Method that imputes the missing data with seemingly suitable replacement value.

14

Last Observation Carried Forward (LOCF)

LOCF takes the last available response and substitutes the value into all subsequent missing values.

Advantages• It generates a complete data set.• Easy to implementDisadvantages• Produce biased estimates.• Not sensible when the data are MCAR.

15

Hot-deck Imputation (Scheuren, 2005)

Replaces each missing value with a random draw from a subsample of respondents that scored similarly on a data set of matching variables.

Advantages• It generates a complete data set.Disadvantages• Not well suited for estimating measures of

association.• Produce substantially biased estimates of correlation

and regression coefficients.

16

Arithmetic Mean Imputation (Wilks, 1932)

Filling the missing values with arithmetic mean of the available cases.

Advantages• It is applicable for all type of missingness.• It also generate a complete data set.Disadvantages• Reduces the variability of the data.• Affect the measures of association.

17

Regression Imputation (Buck, 1960)

Replaces missing values with predicted scores from a regression equation by using information from the complete variables.

Advantages• It generates a complete data set.• Variables tend to be correlatedDisadvantages• Inputs data with perfectly correlated scores• Over estimate correlation• bias

AGE QL QL_missing R I

35 90 90

36 89 89

38 88 88

38 87 87

41 82 82

45 80 80

47 78 78

48 76 76

49 71 71

55 73 73

57 70 70

59 70 70

62 68 __ 65.03

65 67 __ 62.37

68 67 __ 59.71

72 63 __ 56.17

72 60 __ 56.17

73 59 __ 55.28

75 52 __ 53.51

76 51 __ 52.63

QL R I

mean 70.74 72.05

SD 12.726 11.74

QL = βo+β1*AGEQL = 119.950-.886*AGE

19

Stochastic Regression Imputation

Uses regression equations to predict the incomplete data with a normally distributed residual term.

Advantages• Most appropriate method.• Input approximately equal results.• It gives unbiased parameter under an MAR data

mechanism.Disadvantage• Under estimate standard error.

20

AGE QL QL_missing R V S I

35 90 90

36 89 89

38 88 88

38 87 87

41 82 82

45 80 80

47 78 78

48 76 76

49 71 71

55 73 73

57 70 70

59 70 70

62 68 __ 5.67 70.69

65 67 __ 3.72 66.08

68 67 __ -4.13 55.57

72 63 __ -0.39 55.77

72 60 __ -7.20 48.96

73 59 __ 2.39 57.66

75 52 __ -6.64 46.86

76 51 __ 1.84 54.45

QL = βo+β1*AGE+ʐiQL = 119.950-.886*AGE+ʐi

QL S I

mean 70.74 70.50

SD 12.726 13.61

21

Multiple imputation

Creates several copies of the data and imputes each copy with different plausible estimates of missing values.

22

Procedure

I. Imputation phase• Data augmentation

» I-step» P-step

II. Analysis phase• A statistical analysis is performed on each data

using the same technique.

III. Pooling phase• Estimates and their standard errors are averaged

into a single set of value.

23

Data augmentation

I-step

stochastic imputation

New data set

P-step

Data set 1 Data set 2

Data set 20

24

Conclusion

• Imputation is an attractive idea because it produce a complete data set and make the data usable.

• Each imputation produce biased parameter estimates.• Stochastic regression is the only traditional approach

and yield unbiased estimate under an MAR mechanism.

• Multiple imputation also produce similar estimates• The techniques are rare, if the data is categorical and

if the missing mechanism is MNAR.

25

References

• Amanda N. Baraldi & Craig K. Ender: An introduction to missing data analysis. Journal of school psychology . 2009; 9-18.

• Craig K. Enters: Applied missing data analysis. The Guilford press. New York. London. 2010, 2-85.

• James R. Carpenter: Missing data in randomized controlled trials-A practical guide 2007; 4-16

• Schafer J L, & Graham J W. Missing data: Our view of the state of the art 2002; 147-77.

26

Thank You

imputation techniques for missing data in clinical trials

Education