improved rounding methods for binary and ordinal variables ... · binary and ordinal variables...
TRANSCRIPT
Improved rounding methods for
binary and ordinal variables under
multivariate normal imputation
Milena A. Jacobs
This thesis is presented for the degree of
Doctor of Philosophy
of The University of Western Australia
Department of Mathematics & Statistics.
August, 2015
ii
iii
Abstract
Missing data are common in epidemiological studies. Multivariate normal imputa-
tion (MVNI) is a popular method of handling missing data that imputes missing
values assuming a multivariate normal distribution. This presents a dilemma when
imputing categorical variables as these are not normally distributed. Should the
continuous imputations be rounded, and if so, which rounding method should be
used?
The objective of this study is to evaluate and compare existing methods and
develop new methods of rounding categorical variables under MVNI. We focus on
missingness in covariates rather than outcome variables. This is because MVNI
generally has little or no benefit over complete case analysis if missingness is in an
outcome variable only.
A number of different rounding methods have been proposed for binary variables,
including simple rounding, adaptive rounding and calibration. However, no studies
to date have compared adaptive rounding with calibration. We performed a large
simulation study in Stata to compare the above rounding methods with unrounded
MVNI, and with a new method that we developed called proportional rounding.
Proportional rounding produced similar results to adaptive rounding and calibration
but was faster and easier to implement.
To date, several rounding methods have been proposed for ordinal variables.
Distance-based rounding (DBR) and projected distance-based rounding (PDBR)
are indicator-based methods, while crude rounding, calibration and mean indicator-
based rounding (MIBR) are continuous methods. Previous studies have demon-
strated the inadequacy of DBR, PDBR and crude rounding for rounding categorical
variables with up to seven categories. Calibration and MIBR perform well in some
settings but they are two-stage methods that are time-consuming to implement, par-
ticularly for large data sets. An alternative method of imputing ordinal variables is
iv
fully conditional specification (FCS). There have been no studies to date comparing
FCS with MVNI-based rounding methods for ordinal exposure variables.
We performed a comprehensive simulation study in Stata to compare FCS with
MVNI-based rounding methods for ordinal variables and with our new methods,
continuous proportional rounding (CPR) and indicator-based proportional rounding
(IBPR). These were also compared with ordinal rounding, another new method we
developed. CPR, IBPR and ordinal rounding performed as well as or better than
the other rounding methods in terms of bias, RMSE and estimates of proportions.
The main advantages of the three new methods are their computational speed and
ease of implementation compared to calibration and MIBR.
Epidemiological studies often examine the effect of levels of an ordinal expo-
sure variable on an outcome. It is therefore important to handle missing data in a
way that preserves relationships between the variables in the data set and leads to
statistically valid inferences. Currently, there are no methods for rounding ordinal
variables that preserve marginal proportions as well as associations for a non-linear
exposure-outcome relationship. Our new method IBPR is recommended over exist-
ing methods as it preserves the non-linear relationship and the marginal distribution
of the ordinal variable.
Key Words: binary, categorical, fully conditional specification, missing data,
missingness mechanisms, multiple imputation, MVNI, ordinal, rounding.
v
Contents
Abstract iii
List of Tables ix
List of Figures xi
Glossary xiii
Acknowledgements xv
Preface xvii
0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
0.2 Original work in this thesis . . . . . . . . . . . . . . . . . . . . . . . . xvii
0.3 Thesis organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii
0.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
1 Introduction to Missing Data 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Types of missing data . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Missing data patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Missing data mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5.1 Missing at random (MAR) . . . . . . . . . . . . . . . . . . . . 5
1.5.2 Missing completely at random (MCAR) . . . . . . . . . . . . 8
1.5.3 Missing not at random (MNAR) . . . . . . . . . . . . . . . . . 9
1.6 Available at random (AAR) . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Ignorability and the MAR assumption . . . . . . . . . . . . . . . . . 11
1.8 Planned missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
vi
2 Methods of handling missing data 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Complete case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Pairwise deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Single imputation methods . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Mean imputation . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Hot deck imputation . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Obtaining standard errors from the EM algorithm . . . . . . . 22
2.5.2 Using a hybrid method to accelerate convergence . . . . . . . 23
2.6 Multiple imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Fully conditional specification . . . . . . . . . . . . . . . . . . . . . . 25
2.8 Predictive mean matching . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 Inverse probability weighting . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Methods for data missing not at random . . . . . . . . . . . . . . . . 29
2.10.1 Selection model . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.10.2 Pattern mixture model . . . . . . . . . . . . . . . . . . . . . . 31
2.10.3 Issues associated with MNAR data . . . . . . . . . . . . . . . 33
3 Multiple Imputation 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Imputation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Analysis and pooling phases . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Multivariate normal imputation . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 The I-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 The P-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.3 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
3.4.5 Convergence issues . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.6 Obtaining the m imputed data sets . . . . . . . . . . . . . . . 46
3.5 Comparison with ML estimation . . . . . . . . . . . . . . . . . . . . . 46
3.6 Specifying the imputation model . . . . . . . . . . . . . . . . . . . . . 47
3.7 Number of imputations . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Exploratory Data Analysis 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 The NHANESIII data set . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Relationship between high blood pressure and other variables . . . . . 58
5 Rounding methods for binary variables 61
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Rounding methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 Simple Rounding . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Adaptive rounding . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Proportional rounding: a new rounding method . . . . . . . . . . . . 68
5.4 Substantive analysis model . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5.1 Missingness models . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5.3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Rounding methods for ordinal variables 83
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Existing indicator-based methods . . . . . . . . . . . . . . . . . . . . 85
viii
6.2.1 Projected distance-based rounding . . . . . . . . . . . . . . . 86
6.2.2 Distance-based rounding . . . . . . . . . . . . . . . . . . . . . 86
6.3 Comparison of DBR and PDBR . . . . . . . . . . . . . . . . . . . . . 88
6.4 Existing continuous methods . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.1 Crude rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4.2 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.3 Mean indicator-based rounding . . . . . . . . . . . . . . . . . 92
6.5 Continuous proportional rounding . . . . . . . . . . . . . . . . . . . . 93
6.6 Indicator-based proportional rounding . . . . . . . . . . . . . . . . . 94
6.7 Ordinal rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.8 Substantive analysis model . . . . . . . . . . . . . . . . . . . . . . . . 96
6.9 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.9.1 Missingness Models . . . . . . . . . . . . . . . . . . . . . . . . 99
6.9.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.9.3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 100
6.10 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Rounding ordinal variables: non-linear relationship 113
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Substantive analysis model . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8 Discussion and Conclusion 127
Bibliography 131
ix
List of Tables
4.1 Summary statistics for the full data set (n = 16963). . . . . . . . . . 52
4.2 High blood pressure by sex. . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 High blood pressure by race. . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 High blood pressure by smoking category. . . . . . . . . . . . . . . . 60
5.1 The original, duplicated and stacked data sets for calibration prior to
imputation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 True values of race coefficient β5, its standard error and proportion p
of overweight subjects for each data set. . . . . . . . . . . . . . . . . 70
5.3 Comparison of rounding methods for binary variables under MCAR. . 79
5.4 Comparison of rounding methods for binary variables under MAR. . . 80
5.5 Comparison of rounding methods for binary variables under MNAR. . 81
6.1 Estimates of coefficients β1 and β2 under MCAR. . . . . . . . . . . . 104
6.2 Estimates of coefficients β1 and β2 under MAR. . . . . . . . . . . . . 105
6.3 Estimates of proportions in each category under MCAR. . . . . . . . 106
6.4 Estimates of proportions in each category under MAR. . . . . . . . . 106
7.1 Estimates of coefficients β1 and β2 for a non-linear exposure-outcome
relationship under MCAR. . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2 Estimates of coefficients β1 and β2 for a non-linear exposure-outcome
relationship under MAR. . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3 Estimates of proportions in each category for a non-linear exposure-
outcome relationship under MCAR. . . . . . . . . . . . . . . . . . . . 122
7.4 Estimates of proportions in each category for a non-linear exposure-
outcome relationship under MAR. . . . . . . . . . . . . . . . . . . . . 122
x
xi
List of Figures
1.1 An example of MAR-linear missingness . . . . . . . . . . . . . . . . . 7
1.2 An example of MAR-convex missingness . . . . . . . . . . . . . . . . 8
4.1a Histogram of the variable age. . . . . . . . . . . . . . . . . . . . . . . 54
4.1b Boxplot of the variable age. . . . . . . . . . . . . . . . . . . . . . . . 54
4.2a Histogram of the variable weight (in kilograms). . . . . . . . . . . . . 55
4.2b Boxplot of the variable weight (in kilograms). . . . . . . . . . . . . . 55
4.3a Histogram of the variable height (in cm). . . . . . . . . . . . . . . . . 56
4.3b Boxplot of the variable height (in cm). . . . . . . . . . . . . . . . . . 56
4.4a Histogram of BMI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4b Boxplot of BMI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Boxplots of BMI by high blood pressure category. . . . . . . . . . . . 59
4.6 Boxplot of age by high blood pressure category. . . . . . . . . . . . . 59
5.1 Adaptive rounding thresholds for 0 < ω < 1. . . . . . . . . . . . . . . 67
5.2 Overview of simulations comparing methods for binary variables. . . . 73
6.1 Proportion by weight category in the full data set (n = 16963). . . . . 97
6.2 Proportion of observations with high blood pressure by weight cate-
gory in the full data set (n = 16963). . . . . . . . . . . . . . . . . . . 98
6.3 Odds of high blood pressure by weight category in the full data set
(n = 16963). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 RMSEs for β1 under MCAR. . . . . . . . . . . . . . . . . . . . . . . . 107
6.5 RMSEs for β2 under MCAR. . . . . . . . . . . . . . . . . . . . . . . . 107
6.6 RMSEs for β1 under MAR. . . . . . . . . . . . . . . . . . . . . . . . . 108
6.7 RMSEs for β2 under MAR. . . . . . . . . . . . . . . . . . . . . . . . . 108
6.8 Euclidean distances under MCAR. . . . . . . . . . . . . . . . . . . . 109
xii
6.9 Euclidean distances under MAR. . . . . . . . . . . . . . . . . . . . . 109
7.1 Proportion by weight category in the data set with n = 5000. . . . . . 115
7.2 Proportion of observations with high blood pressure by weight cate-
gory in the data set with n = 5000. . . . . . . . . . . . . . . . . . . . 116
7.3 Odds of high blood pressure by weight category in the data set with
n = 5000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4 RMSEs for β1 for a non-linear relationship under MCAR. . . . . . . . 123
7.5 RMSEs for β2 for a non-linear relationship under MCAR. . . . . . . . 123
7.6 RMSEs for β1 for a non-linear relationship under MAR. . . . . . . . . 124
7.7 RMSEs for β2 for a non-linear relationship under MAR. . . . . . . . . 124
7.8 Euclidean distances for a non-linear relationship under MCAR. . . . . 125
7.9 Euclidean distances for a non-linear relationship under MAR. . . . . 125
xiii
Glossary
AAR available at random
CCA complete case analysis
CPR continuous proportional rounding
DBR distance-based rounding
FCS fully conditional specification
IBPR indicator-based proportional rounding
MAR missing at random
MCAR missing completely at random
MI multiple imputation
MIBR mean indicator-based rounding
ML maximum likelihood
MNAR missing not at random
MVNI multivariate normal imputation
PDBR projected distance-based rounding
xiv
xv
Acknowledgements
First and foremost, I would like to thank my supervisor, Dr R. Nazim Khan, for his
valued advice and support. In addition I would like to thank the people that offered
me support and encouragement throughout my thesis, especially my partner Symon
Aked and my parents Melanie and Vlad.
A special thanks goes to Dr Robin K. Milne for his valuable comments that
helped to improve this manuscript. I would also like to acknowledge Prof Nicholas
de Klerk, from the Telethon Kids Institute, who encouraged me to study missing
data problems. In addition I would like to thank those that helped me in times of
stress, in particular the contributions of Mango.
This study was supported by the following scholarships: Australian Postgraduate
Award (APA), UWA Safety Net Top-Up Scholarship, Bruce and Betty Green Post-
graduate Research Top-Up Scholarship, and the Telethon Kids Institute AREST CF
Postgraduate Top-Up Scholarship.
xvi
xvii
Preface
0.1 Introduction
Multivariate normal imputation (MVNI) is a method of multiple imputation
that accommodates a general missing data pattern with missingness across different
types of variables. It ‘fills in’ or imputes missing values assuming a multivariate
normal distribution for the data. There are two important issues to consider when
using MVNI to impute categorical variables. The first is how the categorical variable
will be imputed. Nominal (unordered) variables are imputed as a set of indicator
variables. However, ordinal variables may be imputed either as a single ‘continuous’
variable or as a set of indicator variables.
Since MVNI assumes multivariate normality, all the imputed values are on a
continuous scale regardless of whether an indicator-based or continuous approach is
used. Therefore the second issue is how each imputed value is assigned to one of
the relevant categories, a process referred to as ‘rounding’. Although it is possible
to use the unrounded imputed values, this is not viable if the substantive analysis
involves estimating the relationship between the levels of an ordinal variable and an
outcome [29].
The objective of this study is to evaluate and compare existing methods and
develop new methods of rounding categorical variables under MVNI. To compare
the methods, we performed large scale simulation studies in Stata using data derived
from the NHANESIII data set [26, Chapter 6].
0.2 Original work in this thesis
The original contributions in this thesis are as follows.
1. The major original contribution is the development of three new methods for
rounding categorical variables under MVNI:
xviii
(i) continuous proportional rounding (CPR), a method for binary and ordi-
nal variables;
(ii) indicator-based proportional rounding (IBPR), a method for ordinal or
nominal variables; and
(iii) ordinal rounding, a method for ordinal variables.
The key advantages of these compared to existing methods are their ease of
implementation and computational speed.
None of the previous methods for rounding ordinal variables preserve both
marginal proportions and associations for a non-linear exposure-outcome re-
lationship. Using simulations, we show that IBPR preserves the non-linear
relationship as well as the marginal distribution of the ordinal variable.
2. We perform large scale simulation studies to compare adaptive rounding with
the calibration method for rounding binary variables.
3. We compare fully conditional specification (FCS) with MVNI-based rounding
methods when the substantive analysis involves estimating the relationship
between the levels of an ordinal exposure and an outcome.
4. A comprehensive survey of the literature and methods is also presented.
All the simulations in this study were performed using Stata [54] statistical soft-
ware, which has convenient inbuilt functions for performing large scale simulations
and multiple imputation.
0.3 Thesis organisation
The thesis is organised into eight chapters. The first chapter provides an intro-
duction to missing data, including types of missing data, missing data patterns and
missingness mechanisms. In Chapter 2 we discuss methods of missing data handling,
xix
from traditional methods such as complete case analysis and pairwise deletion to
modern methods such as multiple imputation and maximum likelihood estimation.
Chapter 3 provides a detailed description of multiple imputation and, in particular,
MVNI. The data set and variables used in the study are described in Chapter 4.
The original work in this thesis is contained in Chapters 5, 6 and 7. In Chap-
ter 5 we examine methods for rounding binary variables under MVNI and introduce
our new method, proportional rounding. In Chapter 6 we compare MVNI-based
rounding methods for ordinal variables with fully conditional specification (FCS),
and introduce our new methods CPR, IBPR and ordinal rounding. A comparison
of FCS with MVNI-based rounding methods for a non-linear exposure-outcome re-
lationship is presented in Chapter 7. Finally, Chapter 8 is devoted to discussion and
conclusions.
0.4 Publications
A paper on CPR and ordinal rounding is undergoing final editing for submission.
A second paper on IBPR is under preparation. A third survey paper on rounding
methods is planned.
xx
1CHAPTER 1
Introduction to Missing Data
1.1 Introduction
Missing data are encountered in many research contexts, including medical stud-
ies and the social and behavioural sciences. Multivariate data sets often contain
substantial missing data for reasons such as attrition, nonresponse and errors in
data collection. Two main problems are associated with missing data: biased pa-
rameter estimates and loss of efficiency [8, p.9]. Loss of efficiency is a result of fewer
observations being available for analysis. The extent of the loss depends on the
type of analysis being undertaken as well as the proportion of missing data [8, p.9].
Missing data may also lead to biased parameter estimates if the observed values are
not representative of the full data set. It is therefore important to handle missing
data in a way that preserves relationships between the variables in the data set and
leads to statistically valid inferences. The primary analysis of interest is generally
referred to as the substantive analysis to distinguish it from the model(s) used to
handle missing data.
The outline of this chapter is as follows. In Section 1.2 we provide an overview
of two broad types of missing data: item nonresponse and wave nonresponse. Sec-
tion 1.3 defines the notation. In Section 1.4 we discuss missing data patterns, which
may be broadly classified into six different types. Section 1.5 describes the three
types of missing data mechanisms: missing at random (MAR), missing completely
at random (MCAR) and missing not at random (MNAR). In Section 1.6 we discuss
a special condition known as available at random (AAR). The concept of ‘ignorabil-
ity’ under an MAR mechanism is described in Section 1.7. The chapter concludes
with a discussion in Section 1.8 of planned missing data.
2 Chapter 1. Introduction to Missing Data
1.2 Types of missing data
Two types of missing data are described in the literature: item nonresponse
and wave nonresponse [18, p.4]. Item nonresponse occurs when a subject does not
respond to an item in a survey. There are many reasons for this type of nonresponse.
The respondent may not know the answer to the question(s) or may have run out
of time to complete the survey. Respondents may be uncomfortable disclosing the
answer to sensitive questions, for example questions about drug and alcohol use or
infidelity. The data may also be missing as a result of data collection/recording
errors or equipment malfunction.
Wave nonresponse relates to longitudinal data, collected at two or more time
points called waves. This type of nonresponse is a result of a subject not partic-
ipating in the survey at a particular wave, perhaps due to relocation or personal
circumstances. Sometimes the nonresponse may be related to the treatment itself,
for example in a drug study where the respondent experiences an adverse reaction.
There are two types of wave nonresponse. In the first, the respondent is absent
from a wave but returns to complete subsequent waves. The second type is known
as attrition or drop out, where the respondent is absent from a wave and does not
return to the study; this type is generally more problematic [18, p.9].
1.3 Notation
Let X = (xij) denote an n × k data set (n cases with k variables), where xij
corresponds to the value of variable j for case i. Let vectors Xmis and Xobs denote
the missing and observed components of X respectively. We denote the parameters
of the substantive analysis model by β.
1.4. Missing data patterns 3
For example, suppose X is a data set with 3 cases and 2 variables, given by
X =
x11 x12
x21 x22
x31 x32
.Suppose that variable X1 is fully observed and variable X2 is incomplete with
missing data on x12 and x32. Then Xobs = (x11, x21, x22, x31) and Xmis = (x12, x32).
Define the n × k missing data indicator matrix R such that Rij = 1 if xij is
observed and 0 otherwise, i = 1, 2, . . . , n and j = 1, 2, . . . , k. In the example above,
the missing data matrix R is given by
R =
1 0
1 1
1 0
Let Ri be the row vector corresponding to row i of R. Then Ri is the missing
data indicator vector for case i. The complete cases have Ri = 1 = (1, 1, . . . , 1), a
k-vector of ones. The incomplete cases have Rij = 0 for at least one j = 1, 2, . . . , k.
In the example above, case 1 has missingness on variable X2 so R1 = (1, 0).
1.4 Missing data patterns
A missing data pattern refers to the arrangement of observed and missing values
in a data set. If X consists of k variables, there are potentially up to 2k missing
data patterns. For example, if X consists of two variables, X1 and X2, there are
potentially 4 missing data patterns:
1. cases that are complete for X1 and X2;
2. cases with missingness on X1 only;
3. cases with missingness on X2 only;
4 Chapter 1. Introduction to Missing Data
4. cases with missingness on both X1 and X2.
In general, missing data patterns are classified into six different types [15, p.4].
These are discussed below.
1. Univariate — Data is missing for only one variable. This was one of the first
missing data patterns to be addressed in the literature.
2. Unit nonresponse — This occurs in sample surveys where a subset of indi-
viduals does not complete the questionnaire. The incomplete variables are
the unanswered items and the fully observed variables are the survey design
variables measured for both respondents and nonrespondents [33, p.5].
3. Monotone — The variables can be arranged so that if variable Xj, j =
1, . . . , k− 1, is missing for a case then variables Xj+1, . . . , Xk are also missing
for that case. This pattern describes attrition in longitudinal studies where
subjects drop out before the end of the study. Monotone missing data patterns
simplify missing data handling since they do not require iterative estimation
algorithms.
4. General — A ‘haphazard’ pattern of missingness across variables.
5. Planned — An example is the three-form questionnaire design described by
Graham et al. [19], discussed in Section 1.8.
6. Latent variable — In a latent variable analysis, there is a set of observed
‘manifest’ variables and a set of unobserved ‘latent’ variables. This is a special
type of missing data pattern since the latent variables are unobservable but
are conceptualised as ‘missing data’.
1.5 Missing data mechanisms
The missing data mechanism describes the relationship between the probability
of missingness and the variables in the data set. That is, it specifies the conditional
1.5. Missing data mechanisms 5
distribution of R given X, denoted by p(R | X, ξ), where ξ represents some un-
known model parameters [33, p.12]. The notation p() refers to a probability mass
function or probability density function, as appropriate. Rubin [42] describes three
different types of missingness mechanisms: missing at random (MAR), missing com-
pletely at random (MCAR) and missing not at random (MNAR).
1.5.1 Missing at random (MAR) Here the missingness is dependent on
the observed variables but not the variables with missing data, that is
p(R |X, ξ) = p(R |Xobs, ξ). (1.1)
For example, less educated respondents may be less likely to answer survey questions
about political preferences. The missingness is therefore dependent on one or more of
the observed variables (education) but not on the incomplete variable itself (political
preference).
Reading speed is a classic example of MAR missingness [18, p.13]. Slower readers
may leave items blank because they have run out of time to complete the survey.
However, reading speed is a variable that can be measured and incorporated into
the missing data handling procedure to adjust for any bias due to the nonresponse.
It is important to note that data are MAR only if there is no relationship between
the incomplete variable and the probability of missingness after controlling for the
other observed variables in the data set. It is not usually possible to determine if
data are MAR without knowing the values of the missing data [15, p.6].
Schafer [46, pp.20–22] describes some situations where an MAR mechanism is
known to hold.
1. Double sampling in sample surveys. Here characteristics X1, X2, . . . , Xp are
recorded for all subjects in the sample, while characteristics Xp+1, . . . , Xk
are recorded only for a subsample of subjects. If this subsample is chosen
6 Chapter 1. Introduction to Missing Data
based entirely on the observed values X1, X2, . . . , Xp, then the missing values
Xp+1, . . . , Xk for the subjects not included in the subsample will be MAR.
2. Nonresponse follow-up. On follow-up, if responses are obtained from a random
sample of subjects who had previously not responded, then the missingness
mechanism for the remaining subjects that were not followed up is MAR.
3. Experiments with unbalanced designs. In an unbalanced experimental design,
the sample sizes for the treatment combinations are not all equal. The ‘missing’
data has a probability of missingness equal to one and is therefore MAR.
4. Medical studies with multiple tests where not all tests are administered to all
subjects. In some medical studies, not all tests are given to all subjects. The
missing data will be MAR, provided that all information that is used to select
the samples is included in the observed data.
5. Matrix sampling for questionnaire item. If a questionnaire is divided into
sections that are given to subjects in a randomized manner, then data will be
missing for the sections of the questionnaire that were not given to some of
the subjects. The missing data will be MAR, provided that all the variables
used in the sampling process are included in the observed data.
Collins et al. [10] describe three types of MAR missingness. The first type is
called MAR-linear, in which the probability of missingness on the variable Y is a
linear function of another measured variable Z. For example,
Pr(Ymis | Z) =
0.2 if Z=1,
0.4 if Z=2,
0.6 if Z=3,
0.8 if Z=4.
That is, Pr(Ymis | Z) = 0.2Z as shown in Figure 1.1.
1.5. Missing data mechanisms 7
.2.4
.6.8
P(Y
mis
sing
)
1 2 3 4Z
Figure 1.1: An example of MAR-linear missingness
The second type is called MAR-convex, in which the probability of missingness
on Y is a non-linear function of Z. For example,
Pr(Ymis | Z) =
0.8 if Z=1,
0.2 if Z=2,
0.2 if Z=3,
0.8 if Z=4,
as shown in Figure 1.2.
The third type is referred to as MAR-sinister, in which the probability of miss-
ingness on Y is a function of the correlation rXZ between X (another measured
variable) and Z (the cause of missingness). For example,
Pr(Ymis | X,Z) =
0.8 if rXZ is high,
0.2 if rXZ is low.
The interpretation of a ‘high’ or ‘low’ correlation depends upon the substantive
analysis.
8 Chapter 1. Introduction to Missing Data
.2.4
.6.8
P(Y
mis
sing
)
1 2 3 4Z
Figure 1.2: An example of MAR-convex missingness
It is important to note that if the substantive analysis model and/or the missing
data analysis model do not incorporate the causes of missingness then the missing-
ness is not MAR.
MAR missingness is sometimes referred to as ignorable missingness [42]. How-
ever, as Graham [18, p.15] points out, this does not mean that the causes of missing-
ness can be ignored, only that the distribution of the missing data can be ignored.
If the variables related to the missingness are incorporated into the missing data
handling procedure, then there is no need to estimate the parameters of the missing
data distribution [42].
1.5.2 Missing completely at random (MCAR) In an MCAR mechanism,
the missingness is unrelated to any of the variables in the data set, that is
p(R |X, ξ) = p(R | ξ). (1.2)
For example, data may be missing as a result of equipment breakages, unexpected
personal events or administrative errors, none of which are related to the data. The
1.6. Available at random (AAR) 9
observed data is therefore a simple random sample of the full data set and the
missingness does not result in bias. Sometimes researchers employ planned missing
data designs to intentionally produce MCAR missingness. An example of this is the
three-form design, discussed in Section 1.8. Note that MCAR is a special case of
MAR.
1.5.3 Missing not at random (MNAR) Here the missingness is dependent
on unmeasured variables, that is
p(R |X, ξ) = p(R |Xobs,Xmis, ξ). (1.3)
For example, high income respondents may be less likely to answer survey questions
related to income, so missingness on the income variable is related to income itself.
This will result in bias since data is missing from the upper tail of the income
distribution. In general, it is not possible to determine if data are MNAR from the
observed data alone [15, p.8].
An MNAR mechanism can occur in two ways [15, p.14]. In the first, direct
MNAR, the probability of missingness is directly related to the incomplete vari-
able. In the second, indirect MNAR, the probability of missingness is related to
the incomplete variable indirectly through mutual correlation with an unmeasured
variable. A direct MNAR mechanism may produce substantial bias [10]. However,
bias is an issue for an indirect MNAR mechanism only if the correlation between
the unmeasured variable and the incomplete variable is “relatively strong” (abso-
lute value greater than 0.40) [10]. Note that an indirect MNAR mechanism becomes
an MAR mechanism when the unmeasured variable is included in the analysis [15,
p.15].
1.6 Available at random (AAR)
Until recently, it was generally accepted that an MCAR mechanism was necessary
for the complete cases to represent a simple random sample of the target sample
10 Chapter 1. Introduction to Missing Data
[42]. However, Galati & Seaton [16] proved that a less stringent condition, which
they refer to as available at random (AAR), is sufficient.
Using the notation in Section 1.3, the distribution of the complete cases is ob-
tained from the joint distribution p(X,R) by conditioning on the event Ri = 1,
that is,
p(X | Ri = 1), (1.4)
where 1 is a 1 × k vector of ones representing the response pattern for a complete
case. The MCAR mechanism implies that the complete cases form a simple random
sample of the target sample. A sufficient condition for this is that [16]
p(X = x | Ri = 1) = p(X = x), (1.5)
for all x. Now
p(X = x | Ri = 1) =p(X = x,Ri = 1)
p(Ri = 1)
=p(Ri = 1 |X = x) p(X = x)
p(Ri = 1). (1.6)
Substituting (1.5) in (1.6) yields
p(Ri = 1 |X = x) = p(Ri = 1). (1.7)
If the condition in (1.7) holds, the complete cases are referred to as available at
random (AAR) [16] with respect to the joint model p(X,R). That is, the probability
of a case being complete does not depend on X.
AAR is a less stringent condition than MCAR because it involves only one con-
straint (on missing data pattern Ri = 1) while MCAR involves constraints on up
to 2k − 1 missing data patterns [16]. If there is only one incomplete variable in X
then MCAR and AAR are equivalent [16].
We note that, regardless of the number of incomplete variables, AAR and MCAR
will be equivalent if there are only two missing data patterns and some cases are
1.7. Ignorability and the MAR assumption 11
complete. For example, suppose we have variables X = (X1,X2,X3) and only two
missing data patterns (1, 1, 1) and (0, 0, 1). If AAR holds, then p(1, 1, 1) = α, where
α is a constant that does not depend on X. Since there are only two missing data
patterns, p(0, 0, 1) = 1−α, which is also a constant with respect to X. In this case,
AAR and MCAR are equivalent. This argument can be extended to a data set with
more than three variables.
Galati & Seaton [16] further demonstrated that AAR can hold for an MCAR,
MAR or MNAR mechanism provided that the probability of being a complete case
is constant. Thus if AAR holds, the complete cases form a simple random sample
of the target sample regardless of the missingness mechanism [16].
1.7 Ignorability and the MAR assumption
Little & Rubin [33, p.119] showed that if there is an MAR mechanism and
the parameters β and ξ are distinct, then likelihood-based inferences for β (the
parameters of the substantive analysis model) are not affected by ξ (the parameters
of the missing data distribution). This is referred to as ignorable missingness. A
loose definition of ‘distinct’ is that the value of β provides little information about
ξ and vice-versa [46, p.11]. A more precise definition of distinctness is given by
Schafer [46, p.11] as follows. From a frequentist perspective, the parameters are
distinct if the joint parameter space of (β, ξ) is the Cartesian cross-product of the
individual parameter spaces for β and ξ. From a Bayesian perspective, any joint
prior distribution applied to (β, ξ) must factor into independent marginal priors for
β and ξ [46, p.11].
Schafer [46, p.12] provides a concise proof of ignorability as follows. The joint
probability distribution of the observed data is given by
p(R,Xobs | β, ξ) =
∫p(R,X | β, ξ) dXmis
=
∫p(R |X, ξ) p(X | β) dXmis.
(1.8)
12 Chapter 1. Introduction to Missing Data
Note that the integral is replaced by summation for discrete distributions.
Under an MAR assumption, p(R |X, ξ) = p(R |Xobs, ξ), so
p(R,Xobs | β, ξ) = p(R |Xobs, ξ)
∫p(X | β) dXmis
= p(R |Xobs, ξ) p(Xobs | β).
(1.9)
Thus the likelihood of the observed data under MAR factorises into two separate
components: a function depending on the parameters of interest β, and a function
depending on ξ which are regarded as ‘nuisance’ parameters. If both MAR and
distinctness hold, the parameters ξ of the missing data distribution are ignorable
for inferences on β [33, p.119].
Van Buuren [57, p.223] describes ignorability as “...the belief on the part of the
user that the available data are sufficient to correct for the effects of the missing
data”. In general, there is no definitive way to determine if data are MAR, but
an MAR assumption may be made more plausible by including in the missing data
handling procedure, variables that are known to be correlated with the causes of
missingness and/or the incomplete variables [15, pp.16–17]. This is known as an
inclusive analysis strategy. Variables not of substantive interest but included in
the missing data handling procedure and/or analysis model are known as auxiliary
variables.
Ideally, missing data would be anticipated and planned for in the design of the
study to support the MAR assumption [18, p.38]. For example, variables that
may explain potential missingness should be included in the questionnaire. For
longitudinal studies, Schafer & Graham [49] recommend that respondents be asked
to report their likelihood of dropping out. In some cases, the missing data can be
‘converted’ to MAR. For example, missing data corresponding to nonresponse in
surveys may not be MAR. However, such missing data can be converted to MAR
by following up a random sample of nonrespondents [49].
1.8. Planned missing data 13
1.8 Planned missing data
The planned missing data design intentionally produces MCAR missingness. It
allows researchers to collect data on all the variables of interest while reducing
the burden on respondents [15, p.23]. Missing data handling procedures, such as
multiple imputation (MI) and maximum likelihood estimation (MLE), can then be
used to analyse the data.
Graham et al. [19] developed the three-form design, which divides the question-
naire items into four sets denoted by X,A,B,C. The items in X are the questions
that are central to the research hypotheses. Three questionnaires (forms) are pre-
pared, each containing X and only two of A, B and C.
Graham [18, p.291] recommends having the same number of items in each of the
four sets, including X. For example, a questionnaire containing 200 items may be
split into four sets X, A, B and C, each containing 50 items. Each subject would
only answer 150 of the 200 items but the researcher would have information on all
200 items.
The order of the sets in each form is important [18, p.291]. Usually X is placed
first in each form since it contains the questions that are central to the research.
However, Graham [18, p.291] notes that it may sometimes be beneficial to place
some items belonging to X further along in the questionnaire. In each form, a
different set should be presented last so that respondents who do not complete the
form do not always leave the same items blank.
The main drawback of the planned missing data approach is a potential loss
of statistical power [15, p.24]. Graham et al. [21] discuss the impact of planned
missing data designs on power and the ways in which this may be mitigated. One
way to improve power is to slightly increase the number of variables in X, since this
set is common to all three forms. This increases the number of variable pairs with
complete data. Another way in which power may be improved is by considering
14 Chapter 1. Introduction to Missing Data
effect size (correlations between pairs of variables). Variables expected to produce a
small effect size have larger sample size requirements and should be placed in X to
maximise power. On the other hand, variables that are expected to produce a large
effect size have smaller sample size requirements and should be placed in A, B or
C. Using Cohen’s [9] guidelines, ρ = |0.10| is a small effect, ρ = |0.30| is a medium
effect and ρ = |0.50| is a large effect.
15CHAPTER 2
Methods of handling missing data
2.1 Introduction
Missing data problems have been studied for almost a century [15]. Traditionally,
researchers used ‘ad hoc’ methods to handle missing data, such as deleting the
incomplete cases (deletion-based methods) or ‘filling in’ the missing values using
single imputation. In the 1970s, there were two major developments in missing data
theory: maximum likelihood (ML) estimation [14] and multiple imputation (MI)
[43]. These methods are currently regarded as ‘state of the art’ methods of handling
missing data [49].
In 1977, Dempster, Laird & Rubin (DLR) [14] published their seminal paper
on the Expectation-Maximisation (EM) algorithm, a maximum likelihood (ML) es-
timation method for a wide range of incomplete data problems, including missing
data, truncated distributions, censored and grouped data. The EM algorithm may
also be applied to other missing data paradigms, such as mixtures, log linear models
and latent variable models. Prior to the EM algorithm, ML estimates (MLEs) were
obtained using methods such as the Newton-Raphson, Fisher’s scoring and quasi-
Newton methods. The main advantage of the EM algorithm over Newton-type
methods is that it reformulates an incomplete data problem in terms of a complete
data problem which is computationally tractable [35, p.2].
A year before the DLR paper, Rubin [42] had published a paper describing
a methodological framework for modern missing data theory. This provided the
foundation for the development of multiple imputation by Rubin in 1978 [43] and
subsequent publications by Rubin [44], and Little & Rubin [32]. In 1987, Tanner
& Wong [56] published their work on data augmentation, which was later used to
implement multiple imputation in statistical software packages.
This chapter describes methods used to handle missing data and the advantages
16 Chapter 2. Methods of handling missing data
and disadvantages of each method. In Sections 2.2–2.4 we describe traditional meth-
ods of handling missing data: complete case analysis, pairwise deletion and single
imputation methods. In Sections 2.5–2.9 we provide an overview of MAR-based
methods of missing data handling. Finally, in Section 2.10 we discuss two methods
for MNAR data: the selection model and the pattern mixture model.
2.2 Complete case analysis
Complete case analysis (CCA) involves discarding all cases that have missing
data and performing the analysis using only the cases that are fully observed. This
is the default method in most statistical software packages and a simplistic approach
still used by many practitioners. While easy to implement, it may exclude a large
proportion of the data set, resulting in biased parameter estimates and a substantial
loss of precision and power [55]. The extent of bias and loss of precision depends
on the fraction of complete cases, the pattern of missing data, the degree to which
complete and incomplete cases differ, and on the parameters of interest [33, p.42].
In general, CCA will be unbiased if at least one of the following holds:
1. The complete cases represent a simple random sample of the data set. Until
recently, it was believed that an MCAR mechanism was required for this to
hold. However, Galati & Seaton [16] demonstrated that a weaker condition,
which they call available at random (AAR), was sufficient (refer to Section 1.6
in Chapter 1).
2. The missing data occur in an outcome variable only that is measured once in
each individual, provided that all the variables associated with the missingness
can be included as covariates [55].
3. The missing data occur in predictor variables and the reasons for missingness
are unrelated to the outcome variable [33, p.43].
2.3. Pairwise deletion 17
The population mean µ for a variable with missing data may be expressed as
µ = πCC µCC + (1− πCC)µIC , (2.1)
where µCC is the population mean of the complete cases, µIC is the population mean
for the incomplete cases and πCC is the proportion of complete cases. The bias in
the complete case sample mean is then [33, p.43]
µCC − µ = µCC − πCC µCC − (1− πCC)µIC
= (1− πCC)(µCC − µIC).
(2.2)
If the complete cases are AAR, they do not differ systematically from the incomplete
cases (µCC = µIC) so the bias will be zero.
2.3 Pairwise deletion
A variation of complete case analysis, known as pairwise deletion or available-
case analysis, eliminates cases with missing data depending on the analysis being
performed. This usually results in more data being retained than under complete
case analysis. For example, suppose there are two incomplete variables X and Y . To
calculate σ2X , the variance of X, pairwise deletion uses all the cases with complete
data on X, and similarly for σ2Y . However, to calculate the covariance, σXY , only
the cases with complete data on both X and Y are used. Thus a different set of
cases may be used to calculate each element of a covariance matrix. In contrast,
complete case analysis would use only the cases that have complete data on both X
and Y to calculate σ2X , σ2
Y and σXY .
Pairwise deletion has several limitations.
1. It may produce biased parameter estimates if the complete cases are not AAR.
2. Using different sets of cases may produce nonpositive definite matrices, and in
particular, correlations with absolute values greater than 1. This may cause
18 Chapter 2. Methods of handling missing data
problems in estimating model parameters, such as in multivariate regression
models that use a covariance matrix as input data [15, p.41].
3. Inconsistent sample sizes may cause problems when calculating standard errors
[15, p.41].
2.4 Single imputation methods
These methods impute or ‘fill in’ each missing value with a single replacement
value. The imputed data set is then analysed using standard complete-data statisti-
cal methods. In contrast to deletion-based methods, single imputation methods do
not discard incomplete observations. However, if the filled in data set is regarded
as truly ‘complete’, the estimated variance will not take into account the uncer-
tainty associated with the missing data [4]. Consequently, single imputation will
underestimate standard errors if corrective measures are not undertaken. Exam-
ples of single imputation methods are mean imputation and hot deck imputation,
discussed below.
2.4.1 Mean imputation Also known as mean substitution, this method re-
places each missing value for a variable with the arithmetic mean of the complete
cases for that variable. Since all the imputed values are equal and a measure of
central location, mean substitution underestimates variances, covariances and cor-
relations. This in turn produces biased parameter estimates, the bias increasing with
the rate of missing data [15, p.43]. Enders [15, p.43] concludes that “...simulation
studies suggest that mean imputation is possibly the worst missing data handling
method available. Consequently, in no situation is mean imputation defensible, and
you should absolutely avoid this approach”.
2.4.2 Hot deck imputation This method replaces missing values for a non-
respondent (the recipient) with observed values from a respondent (the donor) [4].
2.4. Single imputation methods 19
The donor is similar to the recipient with respect to a set of common characteristics.
A set of potential donors is referred to as the donor pool.
Andridge & Little [4] divide hot deck imputation methods into two groups de-
pending on how donors are selected:
1. random hot deck methods, where the donor is randomly selected from the donor
pool; and
2. deterministic hot deck methods, where a donor is selected based on some cri-
teria such as ‘nearest neighbour’.
Hot deck imputation was originally developed by the United States Census Bu-
reau to deal with missing data in large data sets that were available for public use.
The term ‘hot deck’ originates from computer punch cards that were used for storing
data. A ‘hot deck’ is one that is currently being processed, whereas a ‘cold deck’ has
already been processed. In the context of missing data, hot deck imputation draws
donors from observed values in the same data set, whereas cold deck imputation
draws donors from an external data set.
Hot deck imputation uses information from donors to ‘fill in’ missing values in
order to produce a complete or ‘rectangular’ data set, which may then be analysed
using standard complete-data methods. Thus the information in the incomplete
cases is retained. Moreover, the imputed values are plausible since they are drawn
from the observed values in the data set.
Being a non-parametric method, hot deck imputation does not make distribu-
tional assumptions and is therefore less sensitive to model misspecification [4]. It
preserves all the complex relationships, such as interactions, in the data set [8, p.181],
and is invariant to transformations of the marginal distributions of the incomplete
variables [4].
Hot deck imputation will underestimate standard errors unless corrective mea-
20 Chapter 2. Methods of handling missing data
sures are undertaken [15, p.49]. Andridge & Little [4] review three main approaches
for obtaining valid variance estimates from hot deck imputation:
1. an explicit variance formulae that incorporates the nonresponse;
2. resampling methods such as the jackknife and the bootstrap; and
3. hot deck multiple imputation (HDMI), where multiple sets of imputations are
created to mimic imputation uncertainty.
Enders [15, p.49] notes that hot deck imputation preserves univariate distribu-
tions in the data set and does not underestimate variability to the same extent
as other single imputation methods. However, it may produce biased estimates of
correlations and regression coefficients [49].
The validity of hot deck imputation rests on identifying appropriate donor pools
and the effective matching of donors to recipients [8, p.181]. Carpenter & Kenward
[8, p.181] note that non-parametric models are “inefficient” compared to parametric
models in the sense that they produce less precise parameter estimates than those
from a correctly specified parametric model. As a non-parametric method, hot deck
imputation may be useful in very large data sets as matching donors to recipients is
easier and loss of precision may be less of a concern [4]. However, for smaller studies,
Carpenter & Kenward [8, p.181] state that parametric imputation is preferred in
most cases.
2.5 Maximum likelihood estimation
Unlike complete case analysis, maximum likelihood (ML) estimation includes
the information available from the cases with missing data in estimating the model
parameters. ML estimation generally requires the use of an iterative optimisation al-
gorithm such as the Expectation-Maximisation (EM) algorithm, developed by Demp-
ster, Laird & Rubin (DLR) [14]. The EM algorithm reformulates the incomplete
2.5. Maximum likelihood estimation 21
data problem in terms of a complete data problem that is more easily solved [35,
p.2]. As the name suggests, there are two steps in each iteration: the Expectation
or E-step and the Maximisation or M-step.
Suppose we have a data set X = (Y ,Z), where Y is the observed data and Z is
the missing data. Assume that X has a probability density function p(x | β), where
β = (β1, . . . , βk) is a vector of parameters. The E-step calculates Q, the conditional
expectation of the complete data log-likelihood given the observed data Y and the
current parameter estimates. The M-step then obtains the parameter estimates that
maximise Q from the E-step.
The E and M-steps are defined as follows at iteration t + 1, t = 0, 1, 2, . . . [35,
p.19]. Set t = 0 and select initial parameter values β(0).
E-step
Calculate
Q(β;β(t)) = E[ln p(X | β) | Y = y,β(t)]. (2.3)
M-step
Select β(t+1) such that
Q(β(t+1);β(t)) ≥ Q(β;β(t)). (2.4)
DLR [14] proved that the log-likelihood L(β) is non-decreasing with each iteration,
that is
L(β(t+1)) ≥ L(β(t)). (2.5)
The iterations continue until ‖L(β(t+1))−L(β(t))‖ reaches an arbitrarily small value,
at which point the algorithm is said to have converged. In general, the EM algorithm
is numerically stable and has good global convergence properties; that is, it converges
to a local maximum from any arbitrary starting point in the parameter space [35,
p.28].
It should be noted that the EM algorithm does not ‘fill in’ or ‘impute’ missing
values. Instead, the E-step replaces missing values with their conditional expecta-
22 Chapter 2. Methods of handling missing data
tions, which contribute to the calculation of the sufficient statistics. The M-step
then uses the sufficient statistics to generate parameter estimates.
The EM algorithm produces unbiased parameter estimates when the missingness
mechanism is MAR [15, p.87]. For an MCAR mechanism (a special case of MAR),
it increases statistical power compared to complete case analysis since it uses all of
the information available from the observed data. When the missingness mechanism
is MNAR, it has been found to produce biased parameter estimates, although this
is usually limited to a subset of model parameters [15, p.87].
The EM algorithm has evolved considerably since it was first developed and
there are now many different EM-type algorithms with various applications. EM-
type algorithms are used in a wide range of complex missing data problems including
structural equation models with missing data [15, p.104]. The focus of recent re-
search has been mainly on Markov chain Monte Carlo (MCMC) versions of EM-type
algorithms [35].
McLachlan & Krishnan [35, p.29] discuss two disadvantages of the EM algorithm.
The first is that the EM algorithm does not automatically produce an estimate of
the covariance matrix of the maximum likelihood estimates. The second is that
convergence can be slow, particularly when there is a high proportion of missing
data. We discuss these issues below.
2.5.1 Obtaining standard errors from the EM algorithm The standard
errors of the maximum likelihood estimates β may be calculated directly as follows
[15, p.97]. First, the Hessian matrix is computed from the second-order derivatives
of the observed data log likelihood. The observed information matrix I(β;y) is
the negative of the Hessian matrix. The inverse of the observed information matrix
estimates the covariance matrix for the maximum likelihood estimates β.
If the second-order derivatives are difficult to obtain analytically, standard errors
may be estimated using Meng & Rubin’s [37] Supplemented EM algorithm (SEM),
2.5. Maximum likelihood estimation 23
Louis’s method [34] or bootstrapping. Details may be found in McLachlan & Kr-
ishnan [35].
2.5.2 Using a hybrid method to accelerate convergence Hybrid meth-
ods combine the EM algorithm with a Newton-type method to accelerate conver-
gence. To take advantage of its global convergence properties, the EM algorithm is
performed for a few iterations, followed by the Newton-Raphson method or other
Newton-type method with a rapid local convergence [41]. Redner & Walker [41]
found that 95% of the change in log-likelihood from initial to maximum value gen-
erally occurred in the first five iterations of the EM algorithm.
Aitkin & Aitkin [1] developed a hybrid EM/GN (Gauss-Newton) algorithm, EM-
GN5, which is a faster alternative to the EM algorithm for finite mixture distribu-
tions. The algorithm begins with five EM iterations then switches to GN until
convergence or until the log-likelihood decreases. The method was illustrated in the
context of a two-component normal mixture [35].
The EM-GN5 algorithm took 70% of the time required for the EM algorithm
to converge, consistently over all initial values, and provided asymptotic standard
errors [1]. However, the log-likelihood generally decreased when the GN step was
first applied and sometimes required a large number of EM controlling steps before
the log-likelihood increased. It then rapidly converged to the same maximum as the
EM algorithm. Aitkin & Aitkin provide an interesting analogy to describe this [1,
p.130]:
“...we formed the impression of a traveller following the narrow EM path up a
hazardous mountain with chasms on all sides. When in sight of the summit, the
GN path leapt to the top, but when followed earlier, it caused repeated falls into
the chasms, from which the traveller had to be pulled back on to the EM track”.
24 Chapter 2. Methods of handling missing data
2.6 Multiple imputation
Multiple imputation (MI) [43] consists of three distinct phases. The procedure
starts with the imputation phase, where the missing values are ‘filled in’ to produce
m ≥ 2 completed data sets. Next, in the analysis phase, the m completed data sets
are analysed separately using standard complete-data statistical procedures. Finally,
in the pooling phase, the results obtained from the analysis phase are combined
using Rubin’s rules [44] to produce overall parameter estimates. The three phases
are described in detail in Chapter 3.
Note that the ‘filled-in’ values in the imputation phase are not of interest in
themselves; they are simply a means of recovering missing information in order
to obtain unbiased parameter estimates and valid statistical inferences. It is also
worth noting that all of the variables in the imputation model are treated as input
variables — there is no distinction between explanatory and outcome variables in
the imputation phase.
Suppose X = (X1, . . . ,Xk) is a vector of k random variables with k-variate
distribution p(X | β). The general procedure for creating imputations X∗ for Xmis
is as follows [44].
1. Calculate the posterior distribution p(β | Xobs) of β based on the observed
data Xobs.
2. Draw β∗ from p(β |Xobs).
3. Draw X∗ from p(Xmis |Xobs,β = β∗).
Steps 2 and 3 are performed m times to create m sets of imputations.
There are many different types of MI algorithms, including fully conditional
specification (FCS) [40], predictive mean matching [30] and multivariate normal
imputation (MVNI) [44]. This study focuses on MVNI, which will be described in
detail in the next chapter. MVNI accommodates a general missing data pattern,
2.7. Fully conditional specification 25
is straightforward to implement and is available in a range of statistical software
packages. In the following sections we outline some other methods of MI. Note that
the MI methods differ only with respect to the imputation phase. The analysis and
pooling phases are the same for all MI methods.
It is important to note that MI (or any other method of handling missing data)
is not always superior to complete case analysis. Lee & Carlin [28] caution that
potential gains from MI may be mitigated by bias from an incorrectly specified
imputation model, particularly for high rates of missingness. Some guidelines for
specifying the imputation model are discussed in Section 3.6 in Chapter 3.
2.7 Fully conditional specification
Also known as multiple imputation with chained equations (MICE), fully condi-
tional specification (FCS) is a semi parametric method that generates imputations
based on a sequence of univariate imputation models, one for each incomplete vari-
able [40]. A variable with missing data is regressed on some or all of the other
variables and the missing values replaced by simulated draws from the correspond-
ing posterior predictive distribution. FCS accommodates a general missing data
pattern with missingness across different types of variables.
For each incomplete variable, the univariate imputation model will depend on the
type of variable being imputed. For example, normal linear regression is generally
used to impute continuous variables, while a logistic regression model is suitable
for imputing binary variables. An advantage of FCS is that a different type of
imputation model may be specified for each incomplete variable.
FCS uses the following iterative estimation algorithm [8, p.86].
1. First, the variables X1, . . . ,Xk are ordered so that the missingness pattern
is as close to monotone as possible (refer to Section 1.4 in Chapter 1). Stata
imputes variables in order from most to least observed [53, p.160].
26 Chapter 2. Methods of handling missing data
2. To start the algorithm, the missing values for each incomplete variable Xj are
filled in by drawing, with replacement, from the observed values for Xj.
3. For each j = 1, . . . , k in turn, perform the following steps.
(a) Regress the observed values of Xj on the remaining variables, with miss-
ing values set at their current imputed values. If Xj is binary, use a
logistic regression model. If Xj is continuous, use a linear regression
model.
(b) Using the regression model in (a), impute the missing values of Xj.
Performing steps (a) and (b) for j = 1, . . . , k is referred to as a cycle [59].
To stabilise the estimates, a fixed number of cycles are performed to produce
a single imputed data set. Van Buuren [57] recommends between 5 and 20
cycles in most cases. White et al. [59] state that around 10 or 20 cycles are
generally sufficient to produce a single imputed data set, although more might
be required if the incomplete variables are strongly correlated.
4. Step 3 is performed m times to produce m imputed data sets.
White et al. [59] describe the procedure for imputing binary variables using FCS
as follows. Suppose that Z is an incomplete binary variable whose missing values
are imputed from a set of variables X using the logistic regression model
logit Pr(Z = 1 |X;β) = βX.
Let β be the estimated parameter from this regression model, with estimated variance-
covariance matrix V . Let β∗ represent a draw from the posterior distribution of β,
approximated by MVN(β,V ). For each missing value Zi, let
p∗i = [1 + exp (−β∗Xi)]−1.
2.8. Predictive mean matching 27
Draw an imputed value Z∗i = 1 if ui < p∗i , and 0 otherwise, where ui is a random
draw from a uniform distribution on (0, 1). White et al. [59] note that problems
can occur when one or more observations has a fitted probability of exactly 0 or 1,
which causes difficulty in drawing β∗. This is known as perfect prediction and can
occur when imputing binary, ordinal or nominal variables under FCS.
Another criticism of FCS is that the conditional densities do not always form
a multivariate joint conditional distribution. This is referred to as incompatibility
of conditionals [5]. To what extent incompatibility of the conditionals affects the
quality of the imputations is largely unknown. However, Van Buuren [57, p.228]
remarks that “in imputation, the objective is to augment the data and preserve the
relations in the data. In that case, the joint distribution is more like a nuisance
factor that has no intrinsic value”.
2.8 Predictive mean matching
This is a semi parametric imputation method developed by Little [30] that com-
bines normal linear regression with nearest neighbour imputation. It matches a
missing value to the observed value with the closest predicted mean or linear predic-
tion. Suppose we have an incomplete variable y = (y1, . . . , yn), with normal linear
regression model
yi | xi ∼ N(x′iβ, σ2), (2.6)
where xi = (xi1, . . . , xip)′ are the values of the predictors of y for observation i,
β = (β1, . . . , βp)′ are the unknown regression coefficients and σ2 is an unknown
variance.
The predictive mean matching algorithm applies the following steps.
1. Fit the regression model in (2.6) to the observed data to produce parameter
estimates β and σ2.
28 Chapter 2. Methods of handling missing data
2. Draw new parameter estimates β∗ and σ2∗ from their joint posterior distribution
p(β, σ2) ∝ 1/σ2 under the noninformative improper prior.
3. For each incomplete case yj, perform the following steps.
i. Calculate the absolute difference |yj − ycd| between the linear prediction
yj for yj and the linear prediction ycd for each complete case ycd for
d = 1, 2, . . . , l where l is the number of complete cases in y.
ii. Determine the k minimum absolute differences and denote the corre-
sponding complete cases by yc1 , . . . , yck , where k is arbitrarily chosen.
iii. Randomly draw an imputed value for yj from yc1 , . . . , yck .
4. Repeat steps 2 and 3 above to produce m imputed data sets.
Choosing the number k of nearest neighbours is a trade-off between bias and variance
[50]. The smaller the value of k, the higher the variability of the MI estimates. On
the other hand, a large value of k may increase bias. An advantage of predictive
mean matching is that it produces plausible imputed values since they are drawn
from the observed values in the data set. Note that PMM may be used for the
conditional specifications within FCS.
2.9 Inverse probability weighting
In inverse probability weighting (IPW), the complete cases are weighted by the
inverse of their probability of being a complete case [25]. To illustrate this method,
consider a generalised linear model with outcome variable Y regressed on a set
of covariates X. The parameter estimates β are the values that solve the score
equations [51]n∑i=1
Ui(β) = 0, (2.7)
where Ui(β) is the first derivative with respect to β of the log-likelihood function.
Let Ci = 1 if case i is complete and Ci = 0 otherwise for C = (C1, C2, . . . , Cn). In
2.10. Methods for data missing not at random 29
the IPW approach, the parameter estimates β are the solution of the IPW score
equations [51]n∑i=1
CiwiUi(β) = 0, (2.8)
where wi is the weight for case i. Generally, a logistic regression model is fitted with
C as the response variable and predictors taken from X, Y and Z, where Z is a set
of measured variables not included in the substantive analysis model. The weights
wi are then taken as the inverse of the fitted probability that case i is complete [51].
IPW specifies a ‘missingness model’ while MI specifies an imputation model.
MI has two main advantages over IPW [51]. The first is that MI can use partially
observed variables. This is in contrast to IPW, which can only use fully observed
variables unless there is monotone missingness or a relational Markov model (RMM)
is used. Second, MI is generally more efficient than IPW. The advantage of IPW
is that it is, arguably, easier to understand than MI and simpler to use. Interested
readers may refer to [51] for further details.
2.10 Methods for data missing not at random
In this section, we outline methods of missing data handling when the data are
MNAR. Recall that an MNAR mechanism means that the missingness is dependent
on unmeasured variables. This type of missingness is also referred to as non-ignorable
missingness.
Under an MNAR mechanism, the data and the probability of missingness have
a joint distribution. Alternative factorisations of this joint distribution produce
two types of MNAR models: the selection model and the pattern mixture model
[15, p.290]. The selection model consists of the substantive analysis model and a
model that predicts the probability of missingness. The pattern mixture model,
on the other hand, groups the data set by missingness pattern and estimates the
substantive analysis model separately for each pattern.
30 Chapter 2. Methods of handling missing data
It is important to note that both types of MNAR models depend on assumptions
that are not possible to verify. Consequently, Enders [15, p.327] states that the most
useful application of MNAR models is for sensitivity analysis. By applying different
models to the data, the sensitivity of the parameter estimates to various assumptions
can be determined.
2.10.1 Selection model Suppose we have a data set with variables X =
(X1, . . . ,Xk) and missingness indicator R. A selection model for this data set is
given by [8, p.17]
p(X,R) = p(R |X) p(X), (2.9)
where p(X,R) is the joint distribution of the missingness and the data, p(R | X)
is the conditional distribution of the missingness given the data, and p(X) is the
substantive analysis model.
An example of a classic selection model is the Heckman selection model [22],
which corrects bias in a regression model with MNAR missingness on the outcome
variable. Suppose that a researcher is interested in the factors determining wages
but only has wage data for those who are in paid employment. The wage data is
MNAR since people who are not in paid employment are excluded from the sample.
For example, women with low wages may decide not to work outside the home. The
regression equation for wages is
Wi = βXi + εi, (2.10)
where Wi is the wage, Xi are the explanatory variables, β are the regression coef-
ficients and εi is the error term for the ith subject. The propensity for missingness
on W is
R∗i = γZi + ζi, (2.11)
where R∗i is the latent propensity for missingness, Zi are the explanatory variables,
γ are the regression coefficients and ζi is the error term for the ith subject. The
2.10. Methods for data missing not at random 31
binary missingness indicator Ri is a manifest indicator for R∗i and is estimated using
the probit regression model
p(R∗i > 0) = p(Ri = 1 | Zi) = Φ(γZi), (2.12)
where Φ is the cumulative standard normal distribution function. The error terms
ε and ζ have a bivariate normal distribution and are assumed to be independent of
the explanatory variables X and Z. The correlation between ε and ζ captures the
dependency between the outcome variable W and the propensity for missingness
R∗. A non-zero correlation between the error terms implies that missingness is
related to the outcome variable after controlling for the explanatory variables in the
substantive analysis model.
The parameters in equations 2.10 and 2.12 may be estimated using Heckman’s
two-step method based on ordinary least squares regression [22] or using maximum
likelihood estimation. Note that the selection model is sensitive to departures from
the bivariate normality assumption for the error terms [15, pp.293–294].
In order to reduce bias from MNAR missingness, the selection model must cor-
rectly specify the conditional distribution of the missingness given the data. Enders
[15, p.296] states that “...in many realistic scenarios, the model can produce esti-
mates that are even worse than those of MAR-based missing data handling meth-
ods.” Since the causes of missingness are generally unknown, it is not possible to
evaluate the performance of a selection model for a real life data set with missing
data.
2.10.2 Pattern mixture model A pattern mixture model uses the alterna-
tive factorisation of the joint distribution in (2.9) [8, p.17],
p(X,R) = p(X | R) p(R), (2.13)
where p(X | R) is the conditional distribution of the data given R, and p(R) is the
distribution of the missingness.
32 Chapter 2. Methods of handling missing data
A pattern mixture model estimates parameters separately for each missing data
pattern then calculates a weighted average for each parameter to produce a final set
of parameter estimates. The ‘weight’ for a missing data pattern is the proportion of
cases in that pattern. Suppose we have two incomplete variables X1 and X2 with
three missing data patterns: (1) cases with observed values for both X1 and X2,
(2) cases with an observed value for X1 only, and (3) cases with an observed value
for X2 only. Estimating the parameters is straightforward for the first missing data
pattern since both variables are fully observed. However, the other patterns have
missingness in one of the variables. The model is said to be underidentified since
there is a set of inestimable parameters [15, p.299]. Estimating these parameters
requires assumptions, known as identifying restrictions. For example, Little’s [31]
complete case missing variable restriction replaces the inestimable parameters with
the parameter estimates from the complete cases.
Pattern mixture models may be used to model drop-out in longitudinal studies.
Hedeker & Gibbons [23] describe a pattern mixture model for psychiatric drug trial
data where cases with missing values are combined into a single missing data pattern.
This simplifies the model and avoids the need for parameter substitution methods.
Since the parameter estimates from a pattern mixture model are weighted aver-
ages of the estimates from each missing data pattern, additional steps are required
to obtain standard errors [15, p.309]. The delta method [23] is used to calculate ap-
proximate standard errors for parameter estimates obtained from a pattern mixture
model. The mathematical details of the delta method are beyond the scope of this
thesis but interested readers may refer to Hedeker & Gibbons [23], and Molenberghs
& Kenward [38].
The advantage of pattern mixture modelling over selection modelling is that the
former does not make distributional assumptions. However, its potential to reduce
bias depends on the appropriateness of the identifying restrictions [15, pp.300–301].
2.10. Methods for data missing not at random 33
The sensitivity of the parameter estimates may be examined using a range of values
for the inestimable parameters.
2.10.3 Issues associated with MNAR data Most methods of handling
missing data assume that the data are MAR and that the missing data distribution
is ignorable. Thus the parameters of the missing data distribution are ignored when
performing an MAR-based analysis such as MI or ML estimation. Under an MNAR
mechanism, the parameters of the missing data distribution contain unique informa-
tion about the substantive model parameters [15, p.290]. Ignoring the missing data
distribution will therefore produce biased parameter estimates for MNAR data.
The MNAR-based methods described above aim to model the joint distribution
of the data and the probability of missingness. However, both of these methods rely
on assumptions that cannot be verified. The selection model makes distributional
assumptions while the pattern mixture model assumes values for inestimable pa-
rameters. Enders [15, p.287] notes that violation of these assumptions can produce
estimates that are even worse than those from an MAR-based analysis. Demirtas
& Schafer [13, p.2573] state that “the best way to handle drop-out is to make it
ignorable” and argue that an ignorability-based (MAR) analysis that includes good
predictors of attrition is often more plausible than an MNAR-based analysis.
34 Chapter 2. Methods of handling missing data
35CHAPTER 3
Multiple Imputation
3.1 Introduction
Multiple imputation (MI) was introduced by Rubin [43] in 1978 and is considered
a ‘state of the art’ method of missing data handling [49]. It was originally developed
to handle missing data in complex surveys for creating large public-use data sets
[45], but is now used in a variety of research contexts.
MI consists of three distinct phases [44]:
1. The imputation phase: each missing value is replaced with m ≥ 2 imputed
values to produce m completed or ‘imputed’ data sets.
2. The analysis phase: each imputed data set is analysed separately using stan-
dard complete-data methods.
3. The pooling phase: the results obtained from the analysis phase are combined
using Rubin’s rules [44] to produce overall parameter estimates.
MI assumes that the data are missing at random (MAR) and that the missing
data distribution is ignorable [33], as defined in Section 1.7.
Although Rubin’s original justification for MI uses frequentist arguments, Rubin
[44] recommends creating the imputations using a Bayesian approach. This involves
specifying a parametric model for the complete data, a prior distribution for the
model parameters, and then making m draws from the conditional distribution of
the missing data given the observed data [47]. Schafer [46, pp.105–106] describes
multiple imputations as “Bayesianly proper” if they are independent realisations
of p(Xmis |Xobs), the posterior predictive distribution of the missing data under a
complete-data model and prior.
In practice, multiple imputation is performed using algorithms such as data aug-
mentation [56], which produce imputed values with stationary distribution p(Xmis |Xobs).
36 Chapter 3. Multiple Imputation
Thus multiple imputation can be described as a three stage approximation to a full
Bayesian analysis [8, p.48].
An appeal of MI as a method of missing data handling is that the imputation
phase is separate from the analysis phase. This has two main advantages. First, it
allows the imputation and analysis phases to be performed by different individuals.
Thus an incomplete data set, once imputed, may be used by different end users for a
wide range of statistical analyses. According to Rubin [45], data collectors generally
know more about the reasons for missingness and are better equipped to handle
missing data than end users. Second, it allows auxiliary variables not necessarily of
interest in the analysis phase to be included in the imputation phase. The auxiliary
variables are predictors of the incomplete variables and/or predictors of missingness
and help improve the quality of the imputations.
Schafer [47] discusses inconsistencies between the model of the imputer and the
analyst. If the model of the imputer is more general (makes fewer assumptions)
than that of the analyst, then the inferences obtained under MI will be valid, albeit
with some loss of power. If the model of the imputer is less general than that of
the analyst, and the additional assumptions by the imputer are plausible, then the
MI estimates may be more precise [36]. Rubin [45] refers to this as superefficiency.
However, MI estimates may be biased if the additional assumptions are not plausible.
Therefore the imputer should aim to preserve distributional features that will be used
in the analysis [47].
The outline of this chapter is as follows. We describe the imputation phase in
Section 3.2, followed by the analysis and pooling phases in Section 3.3. In Section 3.4
we present a detailed description of multivariate normal imputation (MVNI). A
comparison of MI and maximum likelihood (ML) estimation is given in Section 3.5.
Finally, in Sections 3.6 and 3.7 we discuss some important considerations in the
implementation of MI: specifying the imputation model and determining the number
3.2. Imputation phase 37
of imputations to perform.
3.2 Imputation phase
The first phase of MI, the imputation phase, involves replacing each missing
value with m ≥ 2 imputed values to create m completed data sets. From a Bayesian
perspective, the imputation phase alternates between two steps [44]:
1. random draws of the parameters β from their conditional distributions given
the observed data Xobs and the imputed values X∗; and
2. random draws of the missing values Xmis from their conditional distribution
given the observed data Xobs and the parameters β.
The process repeats steps 1 and 2 until convergence and produces p(β | Xobs), the
posterior distribution of the parameters given the observed data, and p(Xmis |Xobs),
the posterior distribution of the missing values given the observed data [44].
3.3 Analysis and pooling phases
In the analysis phase, each of the m imputed datasets is analysed using standard
complete data statistical methods. The results are then combined using Rubin’s
rules [44] in the pooling phase. For a single (scalar) parameter β, performing the
imputation and analysis phases produces m completed data sets with corresponding
estimates β1, . . . , βm and variances (squared standard errors) σ21, . . . , σ
2m . An overall
estimate of β is obtained by averaging the estimates from the m imputed data sets
using Rubin’s rules [44], giving
βMI =1
m
m∑j=1
βj. (3.1)
The variance of the parameter estimate is
V (βMI) = W +
(1 +
1
m
)B, (3.2)
38 Chapter 3. Multiple Imputation
where W is the average within-imputation variance given by
W =1
m
m∑j=1
σ2j , (3.3)
and B is the between-imputations variance given by
B =1
m− 1
m∑j=1
(βj − βMI)2. (3.4)
The within-imputation variance in (3.3) represents the ‘natural variability’ of the
dataset (had there been no missing data). The between-imputations variance in
(3.4) measures the variability of a parameter estimate across the m imputed data
sets and represents the additional sampling error due to the missing data. If the
number of imputations m is large, then from (3.2)
V (βMI) ≈ W + B. (3.5)
The fraction of missing information (FMI) is the proportion of the total sampling
variance that is due to the missing data. For large m, this is given by [15, p.225]
FMI ≈ B
W + B. (3.6)
The FMI depends on the missing data rate and the correlations among the variables
[15, p.204]. When the variables are uncorrelated, the FMI is approximately equal
to the missingness rate. However, when the variables are correlated, the FMI will
be less than the missingness rate. This is because correlation between the variables
offsets some of the loss of information. Including auxiliary variables that are highly
predictive of the incomplete variables mitigates information loss and decreases FMI.
The relative increase in variance (RIV) is the proportional increase in the sam-
pling variance due to the missing data and is given by [15, p.226]
RIV =FMI
1− FMI. (3.7)
3.4. Multivariate normal imputation 39
It is worth noting that for large m,
RIV ≈ B
W. (3.8)
Carpenter & Kenward [8, p.43] highlight three important points regarding mul-
tiple imputation.
1. Rubin’s rules [44] are generic and do not require model-specific calculations.
2. Rubin’s rules [44] should be applied to estimators that are normally or asymp-
totically normally distributed.
3. Multiple imputation has good frequentist properties for a relatively small num-
ber of imputations.
3.4 Multivariate normal imputation
Multivariate normal imputation (MVNI) [44] uses data augmentation [56], a
Bayesian iterative Markov chain Monte Carlo (MCMC) procedure, to impute missing
values assuming a multivariate normal distribution for the data. MCMC methods
generate pseudorandom draws from probability distributions through the use of
Markov chains [7]. The target distribution is the density f(.), which is often difficult
to draw from directly. Instead, we construct a Markov chain
M0,M1, . . . ,Mt, . . .
with a stationary distribution that converges to the target distribution f(.). For each
t ≥ 0, Mt+1 is sampled from the distribution p(Mt+1 |Mt) that does not depend
on the previous elements in the chain. Thus,
p(Mt+1 |M0,M1, . . . ,Mt) = p(Mt+1 |Mt).
If the value of t is large enough, Mt approximates a random draw from the target
distribution.
40 Chapter 3. Multiple Imputation
MVNI accommodates a general missing data pattern with a haphazard pattern
of missingness across variables in the data set. It replaces missing values by drawing
from the posterior predictive distribution of the missing data given the observed
data. Each iteration of MVNI consists of two steps: an imputation step (I-step) and
a posterior step (P-step). The I-step produces the imputations, while the P-step
generates the parameter estimates that are needed to produce the imputations in the
next iteration. Data augmentation starts with initial estimates of the mean vector
and covariance matrix. These are generally maximum likelihood (EM) estimates.
The I-steps and P-steps are then repeated until convergence is achieved, producing
a single imputed data set. The m imputed data sets are drawn from the data
augmentation chain(s) and used in the subsequent analysis and pooling phases.
Data augmentation was originally developed to approximate the posterior distri-
bution p(β |Xobs) of the model parameters β in missing data problems. Augmenting
the observed data Xobs with the unobserved data Xmis produces a posterior distri-
bution p(β |Xobs,Xmis) that is easier to simulate from. Little & Rubin [33, p.201]
describe data augmentation as the Bayesian analogue of the EM algorithm where
the I-step corresponds to the E-step and the P-step corresponds to the M-step. As in
the EM algorithm, DA involves the application of complete-data methods to missing
data problems.
Since MVNI assumes a multivariate normal distribution for the data, the im-
puted values produced are on a continuous scale. This leads to the issue of handling
imputed values for variables that are clearly not normally distributed, such as cate-
gorical variables. This issue is central to this thesis and will be examined further in
Chapter 5.
3.4.1 The I-step From a Bayesian perspective, the I-step in data augmen-
tation replaces missing values with draws from the posterior predictive distribution
of the missing data given the observed data and the current parameter estimates.
3.4. Multivariate normal imputation 41
Thus
X∗t ∼ p(Xmis |Xobs,β∗t−1), t = 1, 2, . . . , (3.9)
where X∗t denotes the imputed values at iteration t, Xmis is the missing data, Xobs
is the observed data and β∗t−1 represents the current parameter estimates. The
posterior distribution in (3.9) approximates p(Xmis |Xobs).
In essence, the I-step performs what is known as stochastic regression imputation
[15, p.190]. It uses current draws of the mean vector and covariance matrix to
generate a set of regression equations that predict the incomplete variables from the
observed variables. For a multivariate analysis with a single incomplete variable X1
and observed variables X2 and X3, the imputation regression equation is
X1i = β0 + β1X2i + β2X3i + zi, i = 1, . . . , n,
where X1i is the imputed value for observation i, β0, β1 and β2 are the current
values of the regression coefficients and zi is a normally-distributed random residual
with a mean of zero and variance σ2X1|X2,X3
. The imputed values for the incomplete
variable are calculated by substituting the values for the observed variables into the
imputation regression equation and adding a normally distributed residual term.
This residual term adds variability to the imputed data.
In the case of two or more incomplete variables, each missing data pattern will
have its own regression equation. For a multivariate analysis with two incomplete
variables X1 and X2 and fully observed variable X3, the imputation regression
equations are
X1i = β0 + β1X2i + β2X3i + zi,
X2i = β0 + β1X1i + β2X3i + zi.
The residual distribution is multivariate normal and is given byZi ∼ N(0, ΣX1,X2|X3),
where ΣX1,X2|X3 is the residual covariance matrix from the multivariate regression
of the incomplete variables X1 and X2 on the fully observed variable X3.
42 Chapter 3. Multiple Imputation
3.4.2 The P-step This step uses Monte Carlo simulation to generate new pa-
rameter estimates from their conditional posterior distribution given the augmented
data, which consists of the observed data and the imputed data from the preceding
I-step. Thus
β∗t ∼ p(β |Xobs,X∗t ), t = 1, 2, . . . . (3.10)
At convergence, the posterior distribution in (3.10) gives p(β |Xobs).
At iteration t, the P-step uses the augmented data from the preceding I-step to
calculate the sample means µt and the sample sum of squares and cross products
matrix Λt [15, p.193]. These define the posterior distribution of the covariance
matrix at iteration t, given by
p(Σ | µt,X) ∼ W−1(n− 1, Λt), (3.11)
where X is the augmented data from the preceding I-step, W−1 is the inverse
Wishart distribution and n − 1 is the degrees of freedom for sample size n. A
new covariance matrix, Σ∗t , is then drawn from this posterior using Monte Carlo
simulation.
The posterior distribution of the mean vector at iteration t is
p(µ |X,Σ) ∼ N(µt, n−1Σ∗t ). (3.12)
Monte Carlo simulation is used to draw a new mean vector µ∗t from this posterior
distribution. The new estimates of the mean vector and covariance matrix are used
to calculate new parameter estimates β∗t , which are used in the I-step at the next
iteration.
It should be noted that the parameter values generated by the P-step may vary
considerably from one iteration to the next [15, p.199]. However, since the parameter
estimates at iteration t are used to generate the imputations at iteration t + 1, the
parameter values and imputations for successive iterations will be correlated.
3.4. Multivariate normal imputation 43
3.4.3 Prior distributions In the Bayesian paradigm, the posterior distri-
bution is proportional to the product of the prior distribution and the likelihood
function. The posterior distribution of β is given by
p(β |Xobs,Xmis) ∝ p(β)L(Xobs,Xmis | β), (3.13)
where p(β) is the prior distribution and L(Xobs,Xmis | β) is the likelihood function.
Noninformative priors assign an equal probability to every possible value of the pa-
rameter, while informative priors assign different probabilities to values depending
on the (subjective) belief regarding their relative probabilities.
Multiple imputation generally uses a noninformative prior [15, p.186]. Thus the
posterior distribution is determined solely by the likelihood function. The prior
distribution for the mean is Jeffreys’ prior p(µ) = 1, while the prior distribution for
the covariance matrix is also a Jeffreys’ prior of the form [15, p.184]
p(Σ) ∝| Σ |−k+12 . (3.14)
This is a conjugate prior based on the inverse Wishart distribution, where | Σ | is
the determinant of Σ and k is the number of variables.
3.4.4 Convergence Data augmentation starts with initial estimates (µ0, Σ0)
of the mean vector and covariance matrix. These are usually taken as EM estimates.
The initial estimates could also be taken as the complete case ML estimates. The
I-step and the P-step are then applied successively to create an MCMC sequence
{(X∗t ,β∗t ) : t = 1, 2, . . . },
where X∗t is the set of imputed values from the I-step and β∗t is the set of parameter
estimates from the P-step at iteration t. Iterations continue until the sequence
stabilises or ‘converges’ to a stationary distribution. The sequence of imputed values
converges to p(Xmis |Xobs) and the sequence of parameter estimates converges to
p(β |Xobs).
44 Chapter 3. Multiple Imputation
Convergence of the MCMC sequence depends on the FMI and RIV (refer to
Section 3.3) and the initial parameter estimates [15]. Using EM estimates as initial
values usually leads to more rapid convergence [15, p.204]. Convergence is assessed
by examining the sequence of parameter estimates, since these are often easier to
work with than the sequence of imputations. Trace plots or time series plots that
graph the parameter estimates against the iteration number are usually examined.
The burn-in period b is the minimum number of iterations required to achieve sta-
tionarity. This is the point where the sequence of parameter estimates has stabilised.
Note that parameters tend to converge at different rates due to different rates of
missingness among variables. The value of b is constrained by the parameter that is
the slowest to converge.
Another method that is used to assess convergence of data augmentation is the
worst linear function (WLF) of the parameters [46]. This is a weighted sum of the
parameter estimates from the P-step at iteration t,
WLFt = νTβ∗t , (3.15)
where β∗t is a column vector of the parameter estimates and ν is a column vec-
tor of weights that represents the convergence rates of the corresponding maximum
likelihood (EM) estimates. Parameters that converge quickly are given a smaller
weighting, while parameters that converge more slowly are given a larger weighting.
A trace plot of the worst linear function provides a conservative estimate of conver-
gence. Stata [54] has an option called mcmconly that allows the user to obtain the
WLF estimates without performing multiple imputations.
As well as assessing convergence, dependence in the sequence of imputed values
also needs to be examined. This is because Bayesianly proper imputations [46,
pp.105–106] must be independent. The first step is to determine the number of
iterations k such that the imputations at iteration t + k are independent of the
imputations at iteration t. This may be done by examining an autocorrelation plot
3.4. Multivariate normal imputation 45
to determine the lag k at which the autocorrelations for all parameter values have
fallen to zero. The value of k is determined from the parameter that is the slowest
to achieve serial independence.
3.4.5 Convergence issues Sometimes data augmentation fails to converge
for reasons including [15, pp.255–256]:
1. the number of variables is close to or greater than the number of observations;
2. groups of variables are concurrently missing;
3. values of a variable may be completely missing for certain values of another
variable.
Convergence issues can sometimes be alleviated by deleting the variables that are
causing the problem. Another option is to use the ridge prior distribution for the
covariance matrix, a semi-informative prior that smooths the correlation elements
in the covariance matrix towards zero. The ridge prior has an inverse Wishart
distribution with two parameters: degrees of freedom dfp and an estimate of the
sum of squares and cross products matrix Λ. The sum of squares and cross products
matrix at iteration t is [15, p.257]
Λt = dfpΣt, (3.16)
where Σt is a covariance matrix with correlation elements equal to zero and variance
elements obtained using the augmented data in the preceding I-step. The posterior
distribution of the covariance matrix with a ridge prior is [15, p.258]
p(Σ | µ,X) ∼ W−1(dfp + n− 1, [Λt + Λt]). (3.17)
The degrees of freedom is dfp + n− 1 and the sum of squares and cross products
matrix is Λt + Λt. This is in contrast to the posterior in (3.11), which has parameters
n− 1 and Λt.
46 Chapter 3. Multiple Imputation
The ridge prior solves convergence problems by increasing the sample size by dfp
and decreasing correlations between the variables. However, it also adds bias to the
parameter estimates and imputed values. To minimise bias, it is recommended that
dfp be as small as possible and this is determined on a case-by-case basis [15, p.258].
3.4.6 Obtaining the m imputed data sets The aim of the imputation
phase is to generate m imputed data sets that represent independent, random draws
from the distribution of missing values. Once convergence of the MCMC sequence
is achieved, the imputed data sets are drawn from the sequence of imputed values
in the data augmentation chain(s). Two methods are currently used.
The first is sequential data augmentation in which the m imputed data sets are
drawn from the imputed values at iterations b, b+ k, b+ 2k, . . . , b+ (m− 1)k, where
b is the burn-in period and k is the number of iterations required to achieve serial
independence.
The second method is parallel data augmentation. This method generates m
data augmentation chains and draws each imputed data set from the last iteration
in each chain. The number of iterations is determined from the greater of b and
k. Of the two methods, sequential data augmentation is easier to implement and
is used in statistical software packages such as Stata [54]. Provided there are no
problems with convergence, the two methods are likely to produce similar results
[15, p.212].
3.5 Comparison with ML estimation
If the sample size and number of imputations are large, the comparability of MI
and ML estimation depends on the variables that are included in the imputation and
analysis models as well as the relative complexity of the models [48]. The imputation
and analysis models are said to be congenial if they estimate the same number of
parameters and use the same variables [36]. If the imputation and analysis models
3.6. Specifying the imputation model 47
are congenial then MI and ML estimation will produce similar parameter estimates
and standard errors [48].
If the imputation and analysis model are uncongenial [36] but use the same set
of variables, then the parameter estimates produced by MI and ML estimation will
be similar, although standard errors under MI may be slightly higher [48]. However,
if the imputation model includes auxiliary variables that are not part of the analysis
model then MI and ML estimation will produce different results.
Schafer [47, p.7] notes that for smaller samples, MI may be better at identifying
certain features in the data set, such as skewness and multiple modes. This is
because it approximates the observed data posterior density by a finite mixture of
normal densities as opposed to a single normal density.
3.6 Specifying the imputation model
The imputation phase ‘fills in’ the missing values so that the data can be anal-
ysed using standard statistical methods. The imputation model should therefore
include features of the data that are of interest to the substantive analysis, such
as interactions between the variables [47]. In general, the imputation model should
include a larger set of variables than the substantive analysis model. Rubin [45]
recommends including as many variables as possible in the imputation model. How-
ever, including too many variables can lead to estimation problems. In particular,
the number of variables should not exceed the number of observations [15, p.201].
In general, the imputation model should include variables that predict the in-
complete variable(s) and/or predict the probability of missingness. White et al. [59]
note that including predictors of the incomplete variables improves the quality of
the imputations and reduces standard errors in addition to making the MAR as-
sumption more plausible. Spratt et al. [52] found that including variables related
to the variable with the most missing data had the greatest effect on estimates and
48 Chapter 3. Multiple Imputation
standard errors, while variables related only to the probability of missingness had
the smallest effect. Thus the most useful auxiliary variables are those that are highly
correlated with the incomplete variables (r > |0.40|) [15, p.133].
To avoid bias, all the variables in the substantive analysis model must be included
in the imputation model [46, p.140]. In particular, when imputing missing values
for covariates, the outcome must be included in the imputation model, as otherwise
the resulting regression coefficients will be biased towards zero [39].
When specifying the imputation model, it is important to address skewness in
continuous variables. A simulation study by Lee & Carlin [27] concluded that ig-
noring skewness in continuous variables led to large biases for the corresponding
regression parameter estimates. One approach for dealing with skewness is using a
log transformation [27]. An alternative approach involves using a log transforma-
tion with an offset such that the observed values of transformed variable have zero
skewness. This is referred to as the “log-skew()” transformation [27].
3.7 Number of imputations
An important issue in multiple imputation is determining the number m of im-
putations to perform. MI standard errors decrease as the number of imputations
increases — an infinite number of imputations produces the lowest possible standard
error [15, p.212]. The relative efficiency (RE) is the variance of an estimate based on
an infinite number of imputations divided by the variance based on m imputations.
Rubin [44] showed that this is approximately
RE =
(1 +
FMI
m
)−1, (3.18)
where FMI is the fraction of missing information (3.6). For example, if FMI = 0.3,
the standard error of an estimate with m = 3 imputations is√
1 + 0.3/3 = 1.0488
times as large as the standard error of an estimate with infinitely many imputations.
On that basis, early literature stated that a small number of imputations, such as 3 or
3.7. Number of imputations 49
5, would be adequate for statistical efficiency [46, pp.106–107]. However, subsequent
research [52, 59] indicates that a greater number of imputations may be necessary.
Simulation studies by Spratt et al. [52] showed that when only 5 or 10 imputa-
tions were performed, variability due to imputation was significant enough to affect
statistical inference. They recommended that at least 25 imputations be performed
to reduce the effect of random sampling from multiple imputation. White et al.
[59] suggest a rule of thumb that m should be at least equal to the percentage of
incomplete cases to ensure an adequate level of reproducibility. However, they state
that this rule may not be universally appropriate.
Graham et al. [20] concluded that the number of imputations has a greater effect
on statistical power than on relative efficiency. Performing more than 10 imputations
improved statistical power and 20 imputations were comparable to ML estimation
in terms of statistical power. Performing more than 20 imputations only improved
power if the FMI was very high. On that basis, 20 imputations may be regarded as
sufficient for most purposes. However, Enders [15, p.214] notes that it is possible to
use larger values of m while adding little to total processing time.
50 Chapter 3. Multiple Imputation
51CHAPTER 4
Exploratory Data Analysis
4.1 Introduction
The aim of this study is to develop new methods for rounding categorical vari-
ables under MVNI and compare their performance with existing methods. To com-
pare the methods, we performed large scale simulation studies in Stata [54] with
missingness imposed on an otherwise complete data set. In Section 4.2, we describe
the data set used in this study and provide summary statistics for each variable.
Section 4.3 explores the relationship between the outcome variable and the other
variables in the data set.
4.2 The NHANESIII data set
The data set used in this study was derived from the National Health and Nutri-
tion Survey (NHANESIII) conducted by the National Center for Health Statistics
(NCHS) in the United States between 1988 and 1994 [26, Chapter 6]. This was the
third in a series of surveys designed by the NCHS to collect health and nutrition
data on the population of the United States. Data were collected from physical
examinations and clinical and laboratory tests. For the purposes of this study, we
considered only adults aged 20 years or older comprising 17030 observations and 16
variables. From this, 67 subjects with incomplete records were deleted resulting in a
data set of 16963 subjects aged 20 years or older, with complete data on the variables
age, sex, race, height, weight, smoking category and high blood pressure. Summary
statistics for this data set are contained in Table 4.1. The outcome variable in our
study is high blood pressure, defined by an average systolic blood pressure of more
than 140 mmHg.
52 Chapter 4. Exploratory Data Analysis
Table
4.1:
Su
mm
arystatistics
forth
efu
lld
ataset
(n=
16963).
Varia
ble
Descrip
tion
Pro
portio
nR
ange
LQ
Media
nU
QM
ean
Std
Dev
ageage
(years)20.0–90.0
32.045.0
65.048.8
19.694
weigh
tb
ody
weigh
t(k
g)21.8–241.3
62.372.8
84.674.8
17.946
heigh
tstan
din
gheigh
t(cm
)118.6–206.5
159.0166.1
173.2166.2
9.934
BM
IB
ody
Mass
Index
(kg/m
2)11.7–79.4
23.026.1
29.927.0
5.833
sexgen
der
1=m
ale0.4674
0=fem
ale0.5326
racerace
1=C
aucasian
0.6821
0=oth
er0.3179
smoke
smok
ing
status
1=never
0.4933
2=form
er0.2497
3=cu
rrent
0.2570
hbp
high
blo
od
pressu
re1=
yes0.2049
0=no
0.7951
4.2. The NHANESIII data set 53
Age
The age of the subjects in the data set is a continuous variable ranging from 20
to 90 years. The histogram in Figure 4.1a shows a right skewed and multimodal
distribution. The boxplot in Figure 4.1b also indicates right skewness. Note the
data is left truncated since the ages are 20 or more.
Weight
The continuous variable weight represents the body weight (in kilograms) of subjects
in the data set. The weight of subjects ranges from 21.8 kg (for an 80 year old
female) to 241.3 kg (for a 33 year old male). The histogram in Figure 4.2a looks
fairly symmetric but with a slightly longer right tail. The boxplot in Figure 4.2b
shows many large values in the right tail indicating right skewness. The single low
value of 21.8kg is also visible in the boxplot.
Height
The continuous variable height represents the standing height (in centimetres) of
subjects in the data set. The height of subjects ranges from 118.6 cm (for the 80
year old female with the lowest body weight) to 206.5 cm (for a 30 year old male).
The histogram in Figure 4.3a and boxplot in Figure 4.3b show that the distribution
of height is symmetric.
Body Mass Index (BMI)
We created a new variable BMI representing Body Mass Index (BMI) as follows:
BMI =weight (kg)
(height (m))2.
The BMI of subjects ranges from 11.7 to 79.4, with a mean of 27.0 and a median
of 26.1. The histogram in Figure 4.4a is skewed to the right, while the boxplot in
Figure 4.4b has many large values in the right tail.
54 Chapter 4. Exploratory Data Analysis
010
020
030
040
0Fr
eque
ncy
20 40 60 80 100AGE
Figure 4.1a: Histogram of the variable age.
20 40 60 80 100
AGE
Figure 4.1b: Boxplot of the variable age.
4.2. The NHANESIII data set 55
050
010
0015
0020
00Fr
eque
ncy
0 50 100 150 200 250Weight in kg
Figure 4.2a: Histogram of the variable weight (in kilograms).
0 50 100 150 200 250Weight in kg
Figure 4.2b: Boxplot of the variable weight (in kilograms).
56 Chapter 4. Exploratory Data Analysis
050
010
0015
00Fr
eque
ncy
120 140 160 180 200Height in cm
Figure 4.3a: Histogram of the variable height (in cm).
120 140 160 180 200Height in cm
Figure 4.3b: Boxplot of the variable height (in cm).
4.2. The NHANESIII data set 57
050
010
0015
0020
0025
00Fr
eque
ncy
0 20 40 60 80BMI
Figure 4.4a: Histogram of BMI.
0 20 40 60 80BMI
Figure 4.4b: Boxplot of BMI.
58 Chapter 4. Exploratory Data Analysis
Sex
The binary variable sex indicates the gender of subjects in the data set (sex=1 if
the subject is male and 0 if the subject is female).
Race
The binary variable race indicates the race of subjects in the data set (race=1 if a
subject is Caucasian and 0 otherwise).
Smoke
The variable smoke is a nominal variable that describes the smoking status of sub-
jects in the data set and consists of three categories as follows.
1. smoke=1 (never) if the subject did not smoke more than 100 cigarettes in their
lifetime,
2. smoke=2 (former) if the subject smoked more than 100 cigarettes in their
lifetime but does not currently smoke,
3. smoke=3 (current) if the subject smoked more than 100 cigarettes in their
lifetime and currently smokes.
High blood pressure
The binary variable hbp indicates the blood pressure status of subjects in the data set
(hbp=1 if the subject has high blood pressure and 0 otherwise). This is the outcome
variable of interest in our study. Note that high blood pressure for an individual
was defined as an average systolic blood pressure greater than 140 mmHg.
4.3 Relationship between high blood pressure and other variables
High blood pressure and BMI
Subjects with high blood pressure have a slightly higher median BMI. However,
subjects without high blood pressure have a higher range of BMI values, as shown
in Figure 4.5.
4.3. Relationship between high blood pressure and other variables 59
020
4060
80
no yes
BM
I
Figure 4.5: Boxplots of BMI by high blood pressure category.
High blood pressure and age
Older patients tend to have substantially higher blood pressure, as shown in Fig-
ure 4.6.
2040
6080
100
no yes
AG
E
Figure 4.6: Boxplot of age by high blood pressure category.
60 Chapter 4. Exploratory Data Analysis
Two way tables
The frequency of high blood pressure by sex, race and smoking category is shown
in Tables 4.2–4.4.
Table 4.2: High blood pressure by sex.
sex
Female Male Total
hbpNo 7208 6279 13487
Yes 1826 1650 3476
Total 9034 7929 16963
Table 4.3: High blood pressure by race.
race
Caucasian Other Total
hbpNo 9136 4351 13487
Yes 2435 1041 3476
Total 11571 5392 16963
Table 4.4: High blood pressure by smoking category.
smoke
Never Former Current Total
hbpNo 6718 3083 3686 13487
Yes 1649 1153 674 3476
Total 8367 4236 4360 16963
61CHAPTER 5
Rounding methods for binary variables
5.1 Introduction
Multivariate normal imputation (MVNI) [44] is a popular method of handling
missing data since it accommodates a general missing data pattern (a haphazard
pattern of missingness across variables). However, it presents a dilemma when im-
puting discrete variables, such as binary or categorical variables, which are clearly
not normally distributed. When imputing a binary outcome, the continuous impu-
tations must be rounded to either 0 or 1, as in a logistic regression analysis. When
imputing a binary covariate, it is possible to use the continuous unrounded imputa-
tions; however, this may result in implausible values, for example a value of −0.65
for a sex variable. Since rounding is not strictly necessary for binary covariates,
should the continuous imputations be rounded, and if so, which method should be
used?
Until recently, the advice has been to round the imputed values for a binary
variable to the nearer of 0 or 1, essentially using a fixed threshold of 0.5. This
method is known as simple rounding or crude rounding [46]. Previous studies [3, 24]
compared unrounded MVNI with simple rounding and concluded that rounding
produced biased parameter estimates. However, other rounding methods have not
been evaluated in comparison with unrounded MVNI.
In contrast to simple rounding, adaptive rounding [6] does not use a fixed thresh-
old, but applies a rounding threshold to each imputed data set based on the normal
approximation to the binomial distribution. Another rounding method, known as
coin flipping [6], is based on a Bernoulli draw where the imputed value represents
the probability of a 1 (imputed values less than 0 or greater than 1 are rounded
to the nearer of 0 or 1). Simulation studies by Bernaards et al. [6] suggest that
adaptive rounding is superior to both simple rounding and coin flipping.
62 Chapter 5. Rounding methods for binary variables
Yucel et al. [60] proposed a two-stage rounding method, known as calibration,
which uses a subset of the imputed values to determine a rounding threshold that
reproduces the proportions of zeros and ones in the observed data. Under an MCAR
mechanism, by construction calibration produces unbiased estimates of means. How-
ever, under an MAR mechanism, “relationships of imputed values to other variables
are biased, engendering biases in means” [60, p.128]. Their simulations suggested
that with modest amounts of missing data these biases are likely to be tolerable.
When there is a large amount of missing data, there are more imputed values to
be rounded and hence more potential for bias. Although calibration is intuitively
appealing, the two stage process is time consuming to implement, particularly for
large data sets. To the best of our knowledge, there have been no studies to date
comparing calibration with adaptive rounding.
Demirtas [11] compared simple rounding and adaptive rounding with regression-
based rounding methods that incorporate information from other variables in the
data set. According to Demirtas [11, p.677], “a good rule should be driven by
borrowing information from other variables in the system rather than relying on
the marginal characteristics”. However, Lee et al. [29] note that regression-based
rounding is not a general approach, since the analyst must determine the variables to
be included in the regression model on a case-by-case basis. They state that a good
rounding method should preserve associations in the data as well as the marginal
distribution of the categorical variable.
We introduce our new method, which we call proportional rounding, where the
imputed values are rounded so that the overall proportions of zeros and ones match
those observed in the complete cases. Unlike regression-based methods, proportional
rounding is a general approach. Similarly to calibration, it preserves the marginal
proportions in the observed data and will therefore produce unbiased estimates of
marginal proportions if the complete cases are available at random (AAR). This is
5.2. Rounding methods 63
because if AAR holds, the observed data represents a simple random sample of the
full data set [16]. As discussed in Chapter 1, AAR is a weaker assumption than
MCAR and can apply to an MCAR, MAR or MNAR mechanism. The advantage
of proportional rounding over calibration is that duplication of the data set is not
required and imputation is performed only once, making implementation faster and
easier.
According to Lee & Carlin [28], MVNI is more likely to be of benefit when
missingness is in a confounding variable than when missingness is in a covariate
of interest. For this reason we will impose missingness on a binary confounding
variable and examine the effect on the covariate of interest.
In this chapter, we compare the performance of unrounded MVNI with simple
rounding, adaptive rounding, calibration and proportional rounding using a simula-
tion study. Simulations are performed for three missing data mechanisms and five
different sample sizes in the context of a logistic regression analysis with substantial
missingness in a binary confounding variable.
The outline of this chapter is as follows. In Section 5.2 we provide a descrip-
tion of existing rounding methods for binary variables under MVNI. In Section 5.3
we introduce proportional rounding, our new method. The data set used in this
study and the substantive analysis model are described in Section 5.4. Section 5.5
describes the method, and includes the missingness models and evaluation criteria.
In Section 5.6 we summarise the results, followed by a discussion in Section 5.7.
5.2 Rounding methods
5.2.1 Simple Rounding Imputed values are rounded to 0 if they are less
than 0.5; otherwise they are rounded to 1 [46, p.148]. The disadvantage of simple
rounding is that it uses a fixed threshold that does not take into account the marginal
distribution of the binary variable.
64 Chapter 5. Rounding methods for binary variables
5.2.2 Adaptive rounding Introduced by Bernaards et al. [6], this method
uses the normal approximation to the binomial distribution to calculate a rounding
threshold for each imputed data set. If ωj is the mean of the (unrounded) imputed
binary variable for imputed data set j = 1, . . . ,m, then the corresponding rounding
threshold cj is given by
cj = ω − Φ−1(ωj)√ωj(1− ωj), (5.1)
where Φ−1 is the inverse of the standard normal cumulative distribution. Imputed
values that exceed the threshold are rounded to one, while the rest are rounded to
zero. Note that a rounding threshold must be calculated for each imputed data set.
Figure 5.1 shows the adaptive rounding thresholds for different values of ω.
According to Bernaards et al. [6], when a category is relatively rare, the adaptive
rounding threshold would reflect greater variability in the imputations than simple
rounding. Note that even if ω < 0.5, the adaptive rounding threshold can be greater
than 0.5 (refer to Figure 5.1). Similarly, even if ω > 0.5, the threshold can be less
than 0.5. We note that since adaptive rounding is based on the mean of the imputed
binary variable ω, any bias in the imputation model will affect the calculation of the
rounding threshold.
5.2.3 Calibration This is a two-stage approach that applies the following
steps [60].
Stage 1
1. Create a copy of the data set and in this delete the observed values of the
incomplete binary variable. This leaves no observed values for the binary
variable in the duplicated data set.
2. Vertically ‘stack’ the original and the duplicated data sets to create a single
stacked data set.
5.2. Rounding methods 65
3. Impute the missing values in the entire stacked data set to create m imputed
data sets.
The following steps are performed for each imputed data set.
4. Identify the subset of imputed values in the duplicated data set that correspond
to observed values in the original data set.
5. For this subset of imputed values, identify a rounding threshold that produces
the same proportion of zeros and ones as in the observed data.
Stage 2
1. Restore the original data set and impute the missing values to create m im-
puted data sets.
2. For each imputed data set, use the rounding threshold obtained in stage 1 to
round the imputed values for the binary variable.
Thus imputation is performed twice, first to determine the rounding threshold, and
second to impute the missing values in the original data set. Note that a rounding
threshold must be calculated for each imputed data set.
Table 5.1 shows the original, duplicated and stacked data sets prior to imputation
for a variable with n = 5 consisting of 3 observed values (cases 1–3) and 2 missing
values (cases 4 & 5). The observed values are denoted by an asterisk (∗), while the
missing values are denoted by a dash (–). In the duplicated data, all 5 values are
designated as ‘missing’. The stacked data therefore has n = 10 with 3 observed
values and 7 missing values. When the stacked data set is imputed, all the missing
values will be replaced by imputed values. This means that cases 1–3 will also have
imputed values in the duplicated part of the stacked data set.
Re-imputation of the missing values in stage 2 is required for the following rea-
sons. Firstly, in stage 1, imputed values will be calculated for all subjects with
66 Chapter 5. Rounding methods for binary variables
missing values in the stacked data set. This includes subjects that have observed
values for the incomplete binary variable in the original data set. This affects the
calculation of the sample mean and covariance matrix for the incomplete binary
variable in the posterior step (P-step) of MVNI, and subsequently, the calculation
of the imputed values in the imputation step (I-step). Secondly, the stacked data set
contains twice the number of observations as the original data set. Using a sample
size of 2n instead of n affects the posterior distributions for the mean vector and
covariance matrix in the P-step, and hence the calculation of the imputed values in
the I-step. It is therefore necessary to re-impute the missing values in stage 2 using
the original data set, as described above.
Note that the rounding thresholds calculated in stage 1 are based on the imputed
values obtained using the ‘stacked’ data set. These will be different to the imputed
values obtained in stage 2, for the reasons given above. The lack of ‘correspondence’
between the imputed values in stages 1 and 2 is another drawback of the calibration
method.
Table 5.1: The original, duplicated and stacked data sets for calibration prior to impu-
tation.
ID Original data Duplicated data Stacked data
1 ∗ – ∗2 ∗ – ∗3 ∗ – ∗4 – – –5 – – –1 –2 –3 –4 –5 –
5.2. Rounding methods 67
0.0 0.2 0.4 0.6 0.8 1.0
0.3
0.4
0.5
0.6
0.7
Omega_bar
Thre
shol
d
Figure 5.1: Adaptive rounding thresholds for 0 < ω < 1.
68 Chapter 5. Rounding methods for binary variables
5.3 Proportional rounding: a new rounding method
In our new method, the imputed values are rounded so that the overall proportion
of zeros and ones matches the observed proportions in the complete cases. The steps
in this method are as follows.
1. Determine the proportion p of ones and proportion 1−p of zeros in the complete
cases.
2. Calculate the required number of zeros, n0 = (1−p)×number of missing values.
Round this value to the nearest integer.
3. Impute the missing values to create m imputed data sets.
The following steps are performed for each imputed data set.
4. Sort the imputed values in ascending order.
5. Round the first n0 (sorted) imputed values to zero and the rest to one.
Note that there is no need to calculate any rounding thresholds. The only cal-
culation that is necessary is the required number of zeros and this will be the same
for each imputed data set. This makes proportional rounding considerably easier to
implement than adaptive rounding and calibration.
Proportional rounding assumes that the observed proportions reasonably ap-
proximate the true proportions in the data set; that is, the complete cases are AAR.
As discussed in Chapter 1, Galati & Seaton [16] demonstrated that AAR can hold
for an MCAR, MAR or MNAR mechanism provided that the probability of being
a complete case does not depend on the data. If AAR holds, the complete cases
constitute a simple random sample of the data set [16]. Thus proportional rounding
does not require an MCAR mechanism to produce unbiased estimates of marginal
proportions.
5.4. Substantive analysis model 69
5.4 Substantive analysis model
The data set used in this study was derived from the National Health and Nutri-
tion Survey (NHANESIII) conducted by the National Center for Health Statistics
(NCHS) in the United States between 1988 and 1994 [26, Chapter 6]. A description
of the data is given in Chapter 4. The binary variable overweight was generated as
follows:
overweight =
1 if BMI > 25
0 otherwise.
Of the 16963 subjects, 59.23% are overweight. Thus the mean of the binary variable
overweight is 0.5923.
The substantive analysis is a logistic regression,
logit Pr(hbp) = β0 + β1age+ β2smoke1 + β3smoke2 + β4overweight
+ β5race+ error,
(5.2)
which calculates the log odds of high blood pressure based on a subject’s age (age),
smoking habits (smoke), overweight status (overweight) and race. Gender (sex )
was not a significant predictor of high blood pressure after adjusting for the other
covariates, so it was excluded from the model. Note that all the variables in this
regression model are categorical with the exception of the continuous predictor age.
Thus the variables in this study are not jointly multivariate normally distributed.
For illustration, the question we have chosen is: Are Caucasians more likely or
less likely to develop high blood pressure compared to other races, after adjusting for
all the other covariates? This can be answered by considering the coefficient β5
in (5.2). For the purposes of this analysis, missingness was imposed on the binary
covariate overweight, a confounding variable in this study.
70 Chapter 5. Rounding methods for binary variables
5.5 Method
The rounding methods were compared for five data sets with sample sizes n=16963,
5000, 1000, 500 and 200, comprising the full data set and four subsamples. Each
subsample was obtained by drawing a simple random sample from the full dataset
of 16963 subjects described in Section 5.4. For each of the above data sets, the true
proportion p of overweight subjects was calculated and logistic regression was used
to obtain the true value of the race coefficient β5 and its standard error, shown in
Table 5.2 below.
Table 5.2: True values of race coefficient β5, its standard error and proportion p of
overweight subjects for each data set.
Data set β5 SE p
n = 16963 −0.4574 0.0490 0.5923n = 5000 −0.3365 0.0910 0.5974n = 1000 −0.7024 0.2025 0.5860n = 500 −0.6715 0.2926 0.5880n = 200 −1.3956 0.4694 0.5550
Missingness on the variable overweight was imposed for three different missing-
ness mechanisms: MCAR, MAR and MNAR, described in Subsection 5.5.1.
For each combination of five data sets, three missingness mechanisms and six
methods, we performed 1000 simulation replicates to produce a total of 90000 sim-
ulation runs. An overview of the simulations is provided in Figure 5.2.
5.5.1 Missingness models The model for each missingness mechanism is
described below.
MAR mechanism
Missingness was imposed on the binary variable overweight using a logistic regression
model, with the probability of missingness dependent on age, sex, race and hbp but
not overweight itself. All of these variables are observed variables in our analysis,
5.5. Method 71
in accordance with an MAR mechanism.
The MAR missingness model was
logit Pr(overweight missing) = 2− 0.025× age− sex− race+ hbp. (5.3)
The model coefficients above were chosen to create a substantial association between
the variables and missingness as well as a reasonable amount of missingness [27].
According to the model, subjects who are young, female, non-Caucasian and have
high blood pressure have the highest probability of missingness on the variable
overweight. For example, the probability of missingness for a 40 year old Caucasian
female with high blood pressure is calculated as follows:
logit Pr(overweight missing) = 2− 0.025× 40− 0− 1 + 1 = 1,
Pr(overweight missing) =e1
1 + e1= 0.731.
The missingness rates using this model ranged from 45% to 52%, so on average
around half of the observations were missing.
MNAR mechanism
Under an MNAR mechanism, missingness is dependent on the variable overweight.
The missingness model was
logit Pr(overweight missing) = −2 + 2.8624× overweight. (5.4)
Thus an overweight subject has a probability of missingness of 0.703, while a non-
overweight subject has a probability of missingness of only 0.119. The missingness
rates using this model ranged from 44% to 47%, so on average around half of the
observations were missing.
MCAR mechanism
The probability of missingness for the variable overweight was set to 48% for each
subject so that it was comparable to the average missingness rates for the MAR and
72 Chapter 5. Rounding methods for binary variables
MNAR mechanisms above. Note that since there is only one incomplete variable
(overweight), AAR and MCAR are equivalent [16].
Imposing missingness
To determine if an observation for overweight was declared missing, a pseudo-random
number between 0 and 1 was generated from a uniform distribution. If the number
was less than the probability of missingness, as calculated above for each missingness
model, the observation was declared missing.
5.5.2 Simulations The following steps were performed for each simulation
replicate i = 1, . . . , 1000 for each combination of sample size, missingness mechanism
and rounding method.
1. Impose missingness on the variable overweight as appropriate, depending on
the missingness mechanism.
2. Impute the missing overweight values using MVNI to create 30 imputed data
sets. The variables hbp, age, sex, race and smoke were included in the imputa-
tion model. These are all the variables in the substantive analysis model (5.2)
and/or variables that are predictors of missingness [15, p.201].
3. Round the imputed values to either 0 or 1 using the rounding method.
4. Use the command mi estimate to combine the imputed data sets using Ru-
bin’s rules [44] and obtain an estimated regression coefficient β5i, its corre-
sponding standard error SEi and estimated proportion pi of overweight sub-
jects.
The estimates from the 1000 simulation replicates were averaged to produce β5, its
standard error SE and estimated proportion of overweight subjects p.
It is important that the simulations are performed in a way that ensures com-
parability between the rounding methods. We set the random seed in Stata at the
5.5. Method 73
Figure 5.2: Overview of simulations comparing methods for binary variables.
74 Chapter 5. Rounding methods for binary variables
beginning of each set of 1000 simulation replicates to ensure that the pseudo-random
numbers generated for each sample size were the same for each method. This en-
sures that any differences in the results are due to the methods themselves and not
due to simulation error.
5.5.3 Evaluation criteria Using the notation in Subsection 5.5.2, the follow-
ing criteria were used to compare the methods.
1. Bias. For the parameter, β5, this is defined as E(β5) − β5. In this study,
E(β5) is estimated by 11000
∑1000i=1 β5i, the mean race coefficient across the 1000
simulation replicates.
2. Standard error (SE). This is calculated as 11000
∑1000i=1 SEi, the average standard
error for the race coefficient over the 1000 simulation replicates [27].
3. Standard deviation (s) of β5 across the 1000 simulation replicates, defined as
s =
√√√√ 1
1000
1000∑i=1
[β5i − E(β5)
]2.
4. Root mean square error (RMSE). In this study, this is defined as√√√√ 1
1000
1000∑i=1
(β5i − β5)2.
According to Demirtas [11, p.684], RMSE “. . . is arguably the best criterion
for evaluating (a parameter estimate) in terms of combined accuracy and pre-
cision”. Note that for a parameter estimate, RMSE can be written as
RMSE =√
bias2 + s2.
5. p − p. This is the difference between the estimated proportion p and true
proportion p of overweight subjects, where p = 11000
∑1000i=1 pi.
5.6. Results 75
5.6 Results
The results are summarised in Tables 5.3–5.5. For all missingness mechanisms
and sample sizes, complete case analysis produced inflated standard errors and RM-
SEs compared to MVNI. Standard errors and RMSEs increased as the sample size
decreased. The differences between the methods were more pronounced for the
smaller sample sizes.
MCAR mechanism
Rounding resulted in a slightly lower RMSE than unrounded MVNI for all sample
sizes except n = 16963 (Table 5.3). Adaptive rounding, calibration and proportional
rounding produced the lowest RMSEs, except for the smallest sample size (n =
200). All three rounding methods, which use the marginal distribution of the binary
variable, produced very similar results in terms of bias and RMSE.
Complete case analysis and proportional rounding had the lowest values of p− p
and therefore produced the best estimates of the proportion overweight for each
sample size (except n = 200). This was expected since both complete case analysis
and proportional rounding are based on the complete cases, which are a simple
random sample of the full data set under an MCAR missingness mechanism. As
noted previously, since there is only one incomplete variable, MCAR and AAR are
equivalent.
MAR mechanism
Rounding resulted in lower RMSEs compared to unrounded MVNI but only for
n ≤ 1000 (Table 5.4). No method was uniformly superior in terms of bias and
RMSE. Adaptive rounding, calibration and proportional rounding produced very
similar results.
For sample sizes n ≥ 5000, complete case analysis and proportional rounding
had the lowest values of p− p and thus produced the best estimates of proportions.
For sample sizes n ≤ 1000, no method was uniformly superior for estimating propor-
76 Chapter 5. Rounding methods for binary variables
tions but adaptive rounding produced better estimates than proportional rounding.
This is because adaptive rounding uses the mean of the imputed binary variable
(observed and imputed values) to calculate the rounding threshold, whereas propor-
tional rounding is based on the proportions observed in the complete cases.
Under the MAR missingness model in (5.3), the complete cases are not AAR so
proportional rounding was expected to exhibit some bias in estimating proportions.
However, this was limited to the smaller sample sizes (n ≤ 1000).
MNAR mechanism
Rounding resulted in a slightly lower RMSE than unrounded MVNI for all sample
sizes except n = 16963 (Table 5.5). However, no method was uniformly superior in
terms of bias and RMSE.
All of the methods substantially underestimated the proportion of overweight
subjects. This is because under the MNAR missingness model in (5.4) the complete
cases are much less likely to be overweight. Not surprisingly, complete case analysis
and proportional rounding produced the worst estimates of proportions since they
are based on the proportions in the complete cases.
5.7 Discussion
This study highlights the advantages of MVNI over complete case analysis when
there is substantial missingness in a binary confounding variable. The results show
that there are clear benefits to using a rounding method in conjunction with MVNI
when imputing a binary variable.
Adaptive rounding, proportional rounding and calibration produced similar re-
sults and performed slightly better than simple rounding. This is because they
utilise the marginal distribution of the binary variable, in contrast to simple round-
ing which uses a fixed rounding threshold. For an MNAR mechanism, no method
was uniformly superior but complete case analysis was the worst-performing method
5.7. Discussion 77
in terms of bias, RMSE and estimates of proportion. Thus MVNI has considerable
advantages over complete case analysis even when the data are MNAR.
Although complete case analysis and proportional rounding produced identical
estimates of the proportion overweight, proportional rounding produced substan-
tially lower standard errors, biases and RMSEs due to the recovery of the missing
cases under MVNI. As a rounding method, proportional rounding is very straight-
forward to implement and has intuitive appeal. In contrast to calibration, there is no
need to duplicate the data set or perform two sets of imputations. There is also no
need to calculate any rounding thresholds. For the full data set with 16963 subjects
and 48% MCAR missingness, proportional rounding took, on average, one third of
the time to implement compared to calibration. This gives proportional rounding a
significant advantage over calibration in terms of computational efficiency. In con-
trast to simple rounding, proportional rounding uses the marginal distribution of
the binary variable.
To the best of our knowledge, this is the first study to compare adaptive round-
ing with the calibration method. The results of our simulation study show that
the performance of these two methods is very similar in terms of bias, RMSE and
estimates of proportions.
In this study, we imputed missing data in a single binary variable using MVNI.
Another multiple imputation method that could have been used is fully conditional
specification (FCS) [40], described in Chapter 2. In the case of a missing binary
variable, a logistic regression imputation model would be used. Lee & Carlin [27]
concluded that FCS and MVNI produced similar results and that MVNI performed
as well as FCS when imputing binary variables. An advantage of FCS is that a
separate regression imputation model can be specified for each incomplete variable.
However, this may result in inconsistencies between imputation models. An advan-
tage of MVNI over FCS is that it is easier to assess convergence [15, p.276].
78 Chapter 5. Rounding methods for binary variables
In summary, adaptive rounding, proportional rounding and calibration produced
similar results and performed slightly better than simple rounding, particularly when
estimating proportions. However, proportional rounding was the fastest and sim-
plest method to implement as well as having intuitive appeal.
5.7. Discussion 79
Table 5.3: Comparison of rounding methods for binary variables under MCAR.
Method SE Bias RMSE s ∗ p− p
n = 16963, β5 = −0.4574 0.0490
Complete case analysis 0.0679 -0.0012 0.0477 0.0477 0.0000Unrounded MVNI 0.0491 0.0005 0.0035 0.0035 0.0000Simple rounding 0.0491 0.0005 0.0035 0.0035 0.0000Adaptive rounding 0.0490 -0.0017 0.0034 0.0030 -0.0001Calibration 0.0490 -0.0018 0.0034 0.0030 0.0036Proportional rounding 0.0490 -0.0017 0.0034 0.0030 0.0000
n = 5000, β5 = −0.3365 0.0910
Complete case analysis 0.1264 -0.0018 0.0912 0.0912 -0.0002Unrounded MVNI 0.0913 0.0011 0.0080 0.0080 -0.0002Simple rounding 0.0912 -0.0021 0.0074 0.0071 -0.0093Adaptive rounding 0.0911 -0.0020 0.0068 0.0065 -0.0004Calibration 0.0911 -0.0020 0.0068 0.0065 -0.0005Proportional rounding 0.0911 -0.0020 0.0068 0.0065 -0.0002
n = 1000, β5 = −0.7024 0.2025
Complete case analysis 0.2840 -0.0128 0.2020 0.2016 -0.0001Unrounded MVNI 0.2042 0.0109 0.0264 0.0240 0.0003Simple rounding 0.2031 0.0049 0.0199 0.0193 -0.0083Adaptive rounding 0.2032 0.0042 0.0199 0.0194 -0.0005Calibration 0.2031 0.0039 0.0197 0.0193 0.0137Proportional rounding 0.2032 0.0042 0.0198 0.0193 -0.0001
n = 500, β5 = −0.6715 0.2926
Complete case analysis 0.4153 -0.0022 0.2872 0.2871 -0.0002Unrounded MVNI 0.2970 0.0156 0.0433 0.0404 -0.0002Simple rounding 0.2954 0.0057 0.0342 0.0337 -0.0097Adaptive rounding 0.2947 0.0052 0.0320 0.0315 -0.0019Calibration 0.2947 0.0049 0.0318 0.0315 -0.0011Proportional rounding 0.2947 0.0050 0.0319 0.0315 -0.0002
n = 200, β5 = −1.3956 0.4694
Complete case analysis 0.7077 -0.1267 0.6421 0.6295 0.0020Unrounded MVNI 0.4851 0.0018 0.1010 0.1010 0.0025Simple rounding 0.4738 0.0105 0.0668 0.0659 -0.0054Adaptive rounding 0.4780 0.0055 0.0806 0.0804 -0.0006Calibration 0.4774 0.0031 0.0781 0.0780 0.0393Proportional rounding 0.4778 0.0055 0.0799 0.0797 0.0020
∗This is the standard deviation of β5 across the 1000 simulation replicates
80 Chapter 5. Rounding methods for binary variables
Table 5.4: Comparison of rounding methods for binary variables under MAR.
Method SE Bias RMSE s ∗ p− p
n = 16963, β5 = −0.4574 0.0490
Complete case analysis 0.0792 0.2572 0.2642 0.0604 0.0001Unrounded MVNI 0.0490 -0.0096 0.0102 0.0034 -0.0015Simple rounding 0.0490 -0.0117 0.0121 0.0027 -0.0088Adaptive rounding 0.0490 -0.0104 0.0108 0.0028 -0.0013Calibration 0.0490 -0.0099 0.0102 0.0028 0.0024Proportional rounding 0.0490 -0.0102 0.0106 0.0027 0.0001
n = 5000, β5 = −0.3365 0.0910
Complete case analysis 0.1489 0.3001 0.3230 0.1196 -0.0001Unrounded MVNI 0.0914 -0.0113 0.0141 0.0084 -0.0012Simple rounding 0.0912 -0.0150 0.0165 0.0068 -0.0082Adaptive rounding 0.0912 -0.0127 0.0144 0.0068 -0.0012Calibration 0.0912 -0.0127 0.0143 0.0067 -0.0009Proportional rounding 0.0911 -0.0126 0.0142 0.0066 -0.0001
n = 1000, β5 = −0.7024 0.2025
Complete case analysis 0.3318 0.2976 0.3951 0.2598 -0.0120Unrounded MVNI 0.2037 -0.0059 0.0216 0.0208 -0.0047Simple rounding 0.2025 -0.0110 0.0191 0.0156 -0.0152Adaptive rounding 0.2027 -0.0103 0.0195 0.0166 -0.0065Calibration 0.2029 -0.0084 0.0188 0.0168 0.0014Proportional rounding 0.2026 -0.0121 0.0199 0.0158 -0.0120
n = 500, β5 = −0.6715 0.2926
Complete case analysis 0.4960 0.3782 0.5326 0.3751 -0.0121Unrounded MVNI 0.2973 0.0122 0.0411 0.0393 -0.0023Simple rounding 0.2945 -0.0030 0.0293 0.0291 -0.0082Adaptive rounding 0.2943 -0.0009 0.0287 0.0287 -0.0054Calibration 0.2943 -0.0022 0.0276 0.0275 -0.0096Proportional rounding 0.2940 -0.0037 0.0271 0.0269 -0.0121
n = 200, β5 = −1.3956 0.4694
Complete case analysis 0.9409 0.1045 0.9921 0.9866 0.0050Unrounded MVNI 0.5024 0.0238 0.1327 0.1305 0.0068Simple rounding 0.4768 0.0236 0.0733 0.0694 -0.0133Adaptive rounding 0.4840 0.0199 0.0932 0.0911 0.0026Calibration 0.4838 0.0392 0.0940 0.0855 0.0524Proportional rounding 0.4823 0.0161 0.0870 0.0855 0.0050
∗This is the standard deviation of β5 across the 1000 simulation replicates
5.7. Discussion 81
Table 5.5: Comparison of rounding methods for binary variables under MNAR.
Method SE Bias RMSE s ∗ p− p
n = 16963, β5 = −0.4574 0.0490
Complete case analysis 0.0708 -0.0558 0.0702 0.0425 -0.2635Unrounded MVNI 0.0491 0.0003 0.0038 0.0037 -0.2593Simple rounding 0.0491 -0.0016 0.0037 0.0033 -0.2468Adaptive rounding 0.0490 -0.0021 0.0037 0.0031 -0.2600Calibration 0.0490 -0.0021 0.0037 0.0031 -0.2596Proportional rounding 0.0490 -0.0022 0.0038 0.0031 -0.2635
n = 5000, β5 = −0.3365 0.0910
Complete case analysis 0.1323 -0.0585 0.1024 0.0840 -0.2643Unrounded MVNI 0.0914 0.0005 0.0082 0.0082 -0.2590Simple rounding 0.0913 -0.0021 0.0076 0.0073 -0.2469Adaptive rounding 0.0912 -0.0025 0.0071 0.0066 -0.2597Calibration 0.0912 -0.0026 0.0071 0.0066 -0.2623Proportional rounding 0.0912 -0.0027 0.0071 0.0066 -0.2643
n = 1000, β5 = −0.7024 0.2025
Complete case analysis 0.2940 -0.0304 0.1874 0.1849 -0.2628Unrounded MVNI 0.2042 0.0107 0.0272 0.0250 -0.2600Simple rounding 0.2033 0.0060 0.0208 0.0199 -0.2460Adaptive rounding 0.2033 0.0048 0.0210 0.0204 -0.2592Calibration 0.2034 0.0054 0.0211 0.0204 -0.2486Proportional rounding 0.2033 0.0045 0.0210 0.0205 -0.2628
n = 500, β5 = −0.6715 0.2926
Complete case analysis 0.4405 -0.1460 0.3035 0.2660 -0.2634Unrounded MVNI 0.2975 0.0093 0.0413 0.0403 -0.2583Simple rounding 0.2965 0.0040 0.0348 0.0346 -0.2448Adaptive rounding 0.2959 0.0023 0.0329 0.0328 -0.2570Calibration 0.2958 0.0018 0.0327 0.0326 -0.2610Proportional rounding 0.2958 0.0015 0.0328 0.0328 -0.2634
n = 200, β5 = −1.3956 0.4694
Complete case analysis 0.6761 0.0412 0.4672 0.4654 -0.2596Unrounded MVNI 0.4804 0.0068 0.0856 0.0853 -0.2366Simple rounding 0.4723 0.0169 0.0604 0.0580 -0.2263Adaptive rounding 0.4752 0.0127 0.0700 0.0688 -0.2361Calibration 0.4761 0.0106 0.0689 0.0681 -0.2122Proportional rounding 0.4738 0.0149 0.0693 0.0677 -0.2596
∗This is the standard deviation of β5 across the 1000 simulation replicates
82 Chapter 5. Rounding methods for binary variables
83CHAPTER 6
Rounding methods for ordinal variables
6.1 Introduction
The previous chapter considered rounding for binary variables under multivariate
normal imputation (MVNI). We now extend our approach to ordinal variables with
more than two categories. Under MVNI, an ordinal variable may be imputed as
either a single continuous variable or as a set of indicator variables. In either case,
the imputed values are then assigned or ‘rounded’ to one of the ordinal categories.
Note that it is not possible to use the unrounded imputed values if the substantive
analysis involves estimating the relationship between the levels of an ordinal variable
and an outcome [29]. For this reason, we will not consider unrounded MVNI in this
chapter.
Crude rounding [46], calibration [61] and mean indicator-based rounding (MIBR)
[29] impute an ordinal variable as a single continuous variable. We refer to these
methods as continuous methods. Schafer [46, p.148] recommends crude rounding
when the variable has similar proportions in each category and the percentage of
missing data is not very high. However, this method was shown to introduce bias into
the marginal distribution of the categorical variable [24, 61]. Calibration and MIBR
perform well in some settings but are two-stage methods that are computationally
intensive and time-consuming to implement, particularly for large data sets [29].
Both distance-based rounding (DBR) [12] and projected distance-based rounding
(PDBR) [2] impute an ordinal variable as a set of indicator variables. We refer to
these methods as indicator-based methods. Demirtas [12] showed that DBR was
slightly better than crude rounding at estimating proportion in each category but
noted that it performs best when the number of categories is small.
Galati et al. [17] compared PDBR with DBR and crude rounding. They demon-
strated, both empirically and theoretically, that PDBR was superior to both DBR
84 Chapter 6. Rounding methods for ordinal variables
and crude rounding. However, they noted that none of the three rounding methods
take into account the marginal distribution of the ordinal variable and all introduce
bias into the marginal distribution.
We extend our new method for rounding binary variables, proportional rounding,
to ordinal variables with more than two categories. Under proportional rounding,
an ordinal variable may be imputed as either a single continuous variable, a method
we call continuous proportional rounding (CPR), or as a set of indicator variables,
a method we call indicator-based proportional rounding (IBPR). IBPR can also be
used for nominal variables. By construction, proportional rounding preserves the
proportions in the observed data and should therefore produce unbiased estimates of
proportions when the complete cases are AAR. As noted previously, the advantage
of proportional rounding over calibration is that duplication of the data set is not
required. CPR and IBPR are one-stage methods so they require only one set of
imputations, in contrast to two-stage methods such as calibration and MIBR.
In addition, we introduce an alternative new method, which we call ordinal
rounding. This is a one-stage continuous method that is suitable for ordinal variables
only.
Ordinal variables may also be imputed using fully conditional specification (FCS)
[40], a method of multiple imputation using chained equations, described in Chap-
ter 2. Under FCS, an ordinal variable is imputed using ordinal logistic regression.
Note that each imputed value is one of the ordinal categories so no rounding is
necessary under FCS.
An advantage of FCS is that a separate regression imputation model may be
specified for each type of incomplete variable. However, this may result in incon-
sistencies between imputation models [27]. We compare the performance of the
MVNI-based rounding methods with FCS when there is an ordinal exposure and a
binary outcome for an MCAR and an MAR mechanism. To the best of our knowl-
6.2. Existing indicator-based methods 85
edge, there have been no studies to date comparing MVNI-based rounding methods
with FCS in the context of estimating the relationship between the levels of an
ordinal exposure and an outcome.
The outline of this chapter is as follows. In Section 6.2 we describe existing
indicator-based methods, followed by a comparison of DBR and PDBR in Sec-
tion 6.3. An overview of existing continuous methods is given in Section 6.4. In
Sections 6.5–6.7 we describe our new methods CPR, IBPR and ordinal rounding.
The data set used in this study and the substantive analysis model are described
in Section 6.8. Section 6.9 describes the method, including the missingness models
and evaluation criteria. In Section 6.10 we summarise the results, followed by a
discussion in Section 6.11.
6.2 Existing indicator-based methods
In general, an ordinal variable with k levels can be represented by k−1 indicator
(‘dummy’) variables, one for each ordinal level excluding the reference group. Each
indicator variable is a binary variable denoting membership of the corresponding
category. An observation that belongs to the reference group will have a value of
‘0’ for each of the indicator variables. Otherwise, an observation that belongs to
category j will have a value of ‘1’ for the indicator variable corresponding to category
j and a value of ‘0’ for each of the other indicator variables.
Indicator-based methods impute an incomplete ordinal variable as a set of k− 1
indicator variables. Thus after imputation, each missing observation will have a set
of k − 1 imputed values. The imputed value corresponding to the reference group
is calculated by subtracting the sum of the k − 1 imputed values from 1. Note that
since the imputed values are generated from a multivariate normal distribution, it
is possible to have imputed values that are less than 0 or greater than 1.
Example: Consider an ordinal variable weight with three categories: under-
86 Chapter 6. Rounding methods for ordinal variables
weight, normal and overweight, with underweight as the reference group. This ordi-
nal variable has two indicator variables: one for normal and another for overweight.
Each observation may be represented as a vector (In, Iow), where In has a value of
1 if the subject is normal weight and 0 otherwise; similarly, Iow has a value of 1 if
the subject is overweight and 0 otherwise. If the imputed values for a missing ob-
servation are (0.3, 0.1) then the imputed value for the reference group underweight
is 1− (0.3 + 0.1) = 0.6.
6.2.1 Projected distance-based rounding Proposed by Allison [2], PDBR
assigns an incomplete observation to the category with the highest imputed value.
In the example above, the missing observation would be assigned to the underweight
category as this has the highest imputed value of 0.6. PDBR can be used to round
nominal (unordered) or ordinal variables. It is unclear how PDBR would assign an
observation in the event that two or more indicators have the same imputed value.
6.2.2 Distance-based rounding This method was proposed by Demirtas
[12] and can be used to round nominal or ordinal variables. In DBR, a missing
observation is assigned to the ordinal category with the smallest Euclidean distance
to its imputed values. Let w = (w1, . . . , wk−1) be the vector of imputed values for
an incomplete observation and let vj = (I1j, . . . , I(k−1)j) be the vector corresponding
to category j where
Iij =
1 if i = j
0 if i 6= j.
The Euclidean distance dj from the set of imputed values to category j is
dj =
√√√√k−1∑i=1
(wi − Iij)2, j = 1, . . . , k. (6.1)
A three-level ordinal variable has three indicator vectors: v1 = (1, 0), v2 = (0, 1)
and the reference group v3 = (0, 0). In the example above, the vector of imputed
values is (0.3, 0.1) and the unit vectors representing normal and overweight are (1, 0)
6.2. Existing indicator-based methods 87
and (0, 1) respectively. The Euclidean distance to each of the weight categories is
calculated as follows:
dunderweight =√
(0.3− 0)2 + (0.1− 0)2 = 0.32,
dnormal =√
(0.3− 1)2 + (0.1− 0)2 = 0.71,
doverweight =√
(0.3− 0)2 + (0.1− 1)2 = 0.95.
The missing observation would be assigned to the category underweight since the
imputed values have the smallest Euclidean distance to this category. It is unclear
how DBR would assign an observation in the event that two or more indicators have
the same Euclidean distance.
We show that for a binary variable, PDBR and DBR are equivalent to simple
(crude) rounding for w 6= 0.5. Suppose that a binary variable has a missing obser-
vation with an imputed value of w. The Euclidean distance to 0 is√
(w − 0)2 = |w|
and the Euclidean distance to 1 is√
(w − 1)2 = |w − 1|. Under DBR, the imputed
value is rounded to 0 if
|w| < |w − 1| ,
i.e. if w < 0.5, and rounded to 1 if
|w − 1| < |w| ,
i.e. if w > 0.5. This is equivalent to simple rounding for binary variables (for
w 6= 0.5).
A similar argument can be made for PDBR. If a missing observation has an
imputed value of w, then the imputed value corresponding to the reference group
is 1 − w. Since PDBR rounds to the category with the highest imputed value, the
missing observation is rounded to 0 if
1− w > w,
i.e. if w < 0.5, and rounded to 1 if
w > 1− w,
88 Chapter 6. Rounding methods for ordinal variables
i.e. if w > 0.5. Thus PDBR is also equivalent to simple rounding for binary variables
(for w 6= 0.5).
6.3 Comparison of DBR and PDBR
Galati et al. [17] compared DBR and PDBR from a theoretical standpoint. We
illustrate their argument in more detail with the following example. Suppose we
have a variable with k = 3 categories and w = (w1, w2) is the vector of k − 1 = 2
imputed values for an incomplete observation. Let category 1 be represented by
the unit vector (1,0), category 2 by the unit vector (0,1) and the reference group
(category 3) by the origin (0,0). Under DBR, the squared Euclidean distance from
w to the reference group denoted by the origin (0,0) is given by
d2 = w21 + w2
2. (6.2)
The squared Euclidean distance to category 1 represented by the unit vector (1,0)
is given by
d2 = (w1 − 1)2 + w22
= −2w1 + 1 + w21 + w2
2.
(6.3)
The squared Euclidean distance to category 2 represented by the unit vector (0,1)
is given by
d2 = w21 + (w2 − 1)2
= −2w2 + 1 + w21 + w2
2.
(6.4)
Under DBR, an incomplete observation is assigned to the category 1 if the
squared Euclidean distance to (1,0) is less than each of the squared distances to
(0,0) and (0,1). Using (6.2-6.4) we have
−2w1 + 1 + w21 + w2
2 < w21 + w2
2
∴ w1 > 1/2
(6.5)
6.3. Comparison of DBR and PDBR 89
and
−2w1 + 1 + w21 + w2
2 < −2w2 + 1 + w21 + w2
2
∴ w1 > w2.
(6.6)
Thus an incomplete observation will be assigned to category 1 if w1 > 0.5 and
w1 > w2. That is, it will be assigned to category 1 if w1 is the maximum of w1, w2, w3
where w3 = 1−w1 −w2 is the imputed value corresponding to the reference group.
Note that under PDBR, if w1 is the maximum of w1, w2, w3 then the incomplete
observation will also be assigned to category 1. Thus if DBR assigns an observation
to category 1 then PDBR will also assign the observation to category 1.
A similar argument can be made for assigning observations to category 2. Under
DBR, an incomplete observation is assigned to category 2 if the squared Euclidean
distance to (0,1) is less than each of the squared distances to (1,0) and (0,0). Using
(6.2-6.4) we have
−2w2 + 1 + w21 + w2
2 < w21 + w2
2
∴ w2 > 1/2
(6.7)
and
−2w2 + 1 + w21 + w2
2 < −2w1 + 1 + w21 + w2
2
∴ w2 > w1.
(6.8)
Thus an incomplete observation will be assigned to category 2 if w2 is the maximum
of w1, w2, w3. On that basis, PDBR will also assign the observation to category 2.
Under DBR, an incomplete observation is assigned to the reference group when
the squared distance to (0,0) is less than each of the squared distances to (1,0) and
(0,1). Using (6.2-6.4) we have
w21 + w2
2 < −2w1 + 1 + w21 + w2
2
∴ w1 < 1/2
(6.9)
and
w21 + w2
2 < −2w2 + 1 + w21 + w2
2
∴ w2 < 1/2.
(6.10)
90 Chapter 6. Rounding methods for ordinal variables
Thus an incomplete observation is assigned to the reference group if w1 < 0.5 and
w2 < 0.5. However, this does not mean that PDBR will assign the observation to
the reference group since w3 is not necessarily the maximum of w1, w2, w3.
The above arguments can be extended to variables with k ≥ 3 categories and
may be summarised as follows [17]:
1. DBR and PDBR assign an observation to the same category if neither of them
assigns it to the reference group.
2. DBR and PDBR differ only with respect to rounding imputed values to the
reference group.
Galati et al. [17] also demonstrated that DBR biases the rounding of imputed
values towards the reference group and that the bias increases with the number of
categories k. In general, if there are k categories the average vector of imputed
values is (1/k, 1/k, . . . , 1/k). As k increases, the value of 1/k approaches 0, with the
result that observations are more likely to be assigned to the reference group [17].
For this reason, DBR performs best when the number of categories is small.
6.4 Existing continuous methods
These methods impute an incomplete ordinal variable as a single continuous
variable. Thus after imputation, each missing observation will have only one corre-
sponding imputed value. Note that continuous methods cannot be used for nominal
(unordered) categorical variables.
6.4.1 Crude rounding In crude rounding, the imputed values are rounded
to the nearest category [46, p.148]. Using Example 1 in Section 6.2, if we denote
underweight by ‘0’, normal by ‘1’ and overweight by ‘2’, the imputed values would
6.4. Existing continuous methods 91
be rounded as follows:
rounded value =
0 (underweight) if imputed value < 0.5
1 (normal) if 0.5 ≤ imputed value < 1.5
2 (overweight) if imputed value ≥ 1.5.
Note that crude rounding is a fixed threshold method that does not take into account
the marginal distribution of the ordinal variable.
6.4.2 Calibration The calibration method for rounding binary variables [60]
is readily extended to ordinal variables [61]. This two-stage approach applies the
following steps [61].
Stage 1
1. Create a copy of the data set and in this delete the observed values of the
incomplete ordinal variable. This leaves no observed values for the ordinal
variable in the duplicated data set.
2. Vertically ‘stack’ the original and the duplicated data sets to create a single
stacked data set.
3. Impute the ordinal variable in the stacked data set as a single continuous
variable.
The following steps are performed for each imputed data set.
4. Identify the subset of imputed values in the duplicated data set that correspond
to observed values in the original data set.
5. For this subset of imputed values, determine rounding thresholds that produce
the same proportions in each category as in the observed data.
92 Chapter 6. Rounding methods for ordinal variables
Stage 2
1. Restore the original data set and impute the ordinal variable as a single con-
tinuous variable.
2. For each imputed data set, use the corresponding rounding thresholds obtained
in stage 1 to round the imputed values for the ordinal variable.
Note that rounding thresholds must be calculated for each imputed data set. The
disadvantages of calibration are that duplication of the data is required and imputa-
tion must be performed twice, making it time consuming to implement. As noted in
Chapter 5, the rounding thresholds calculated in stage 1 are based on the imputed
values obtained using the ‘stacked’ data set. These will be different to the imputed
values obtained in stage 2. As stated previously, the lack of ‘correspondence’ be-
tween the imputed values in stages 1 and 2 is a further drawback of the calibration
method.
Yucel et al. [61] presented an indicator-based approach to calibration for nominal
(unordered) variables. However, they did not specify how the method should be
implemented in order to prevent missing observations being unassigned or assigned
to more than one category.
6.4.3 Mean indicator-based rounding Lee et al. [29] proposed an alterna-
tive two-stage approach for rounding ordinal variables as follows. In the first stage,
k− 1 indicator variables are imputed using MVNI. The mean of each indicator vari-
able is then calculated for the entire imputed data set (consisting of the observed
and imputed values). The indicator mean j for a category j = 1, 2, . . . , k represents
an estimate of the proportion of observations in category j.
In the second stage, the original data set is restored and the ordinal variable is
imputed as a single continuous variable. The imputed values are rounded so that
the proportion of observations in category j is equal to j. The above steps are
6.5. Continuous proportional rounding 93
performed for each imputed data set. Note that MIBR assumes that the imputation
model accurately estimates the first-order moments of the multivariate distribution
[29].
Lee et al. [29] showed that MIBR preserves the marginal distribution of the
ordinal variable. However, a disadvantage of the method is that imputations must
be performed twice.
6.5 Continuous proportional rounding
CPR is similar to proportional rounding for binary variables, except that there
are k > 2 categories. The preliminary step involves determining the proportion
p1, p2, . . . , pk of observations in each category for the complete cases, where k is the
highest ordinal category. The corresponding number n1, n2, . . . , nk of ones required
in each category is then calculated. For category j, the required number of ones is
nj = pj × number of missing values, j = 1, 2, . . . , k,
where nj is rounded to the nearest integer. The following steps are then applied.
1. Impute the ordinal variable as a single continuous variable.
The following steps are performed for each imputed data set.
2. Sort the imputed values for the continuous variable in descending order.
3. Round the first nk imputed values to the highest ordinal category, the next
nk−1 imputed values to the second highest ordinal category and so on until
the last n1 imputed values have been rounded to the lowest ordinal category.
Note that there is no need to calculate any rounding thresholds. The only calcu-
lations that are necessary are the required number of ones in each category, which
will be the same for each imputed data set.
94 Chapter 6. Rounding methods for ordinal variables
In contrast to MIBR, estimates of proportions are based on the proportions ob-
served in the complete cases rather than the post-imputation indicator means. Thus
the estimates of the proportions in each category are unaffected by any bias in the
imputation model. However, proportional rounding assumes that the observed pro-
portions reasonably approximate the marginal distribution of the ordinal variable.
For an MCAR mechanism, there is little benefit to using MIBR over CPR since the
observed proportions represent unbiased estimates of proportions.
6.6 Indicator-based proportional rounding
In IBPR, the indicator variable corresponding to the category with the highest
proportion of observations is rounded first, followed by the indicator variables corre-
sponding to the other categories, in order of size. Rounding each indicator variable
in turn avoids the issue of missing observations being unassigned or assigned to more
than one category.
First, the proportion p1 ≤ p2 ≤ . . . ≤ pk of observations in each category is cal-
culated for the complete cases, where k is the category with the highest proportion
of observed values. The corresponding number n1 ≤ n2 ≤ . . . ≤ nk of ones required
in each category is then calculated. For category j, the required number of ones is
nj = pj × number of missing values, j = 1, 2, . . . , k,
where nj is rounded to the nearest integer. The following steps are then applied.
1. Impute the ordinal variable as a set of k − 1 indicator variables.
2. For each missing observation, calculate the imputed value for the reference
category by subtracting the sum of the k − 1 imputed values from 1. There
are now k ‘filled in’ indicator variables.
The following steps are performed for each imputed data set.
6.7. Ordinal rounding 95
3. Set j = k, the category with the highest proportion of observed values.
4. For the indicator variable corresponding to category j, sort the imputed values
in descending order.
5. Assign the first nj of these imputed values to category j. Thus the largest
nj imputed values for the indicator variable corresponding to category j are
assigned to category j.
6. If j > 1 decrement j to j − 1 (the category with the next highest proportion
of observations) and return to step 4.
Note that there is no need to calculate any rounding thresholds. The only calcu-
lations that are necessary are the required number of ones in each category, which
will be the same for each imputed data set.
DBR and PDBR deal with each missing case in isolation and do not take into ac-
count the marginal distribution of the ordinal variable. On the other hand, IBPR ex-
amines all of the imputed values for an indicator variable and preserves the observed
proportions in the data. The observed proportions represent unbiased estimates of
the marginal proportions when the complete cases are AAR.
Since the ordering of the values of the categorical variable has not been used,
IBPR can also be employed for nominal variables.
6.7 Ordinal rounding
We introduce another new approach, which we call ordinal rounding. This
method may be used to round ordinal variables but is not suitable for nominal
variables.
For an ordinal variable X considered as a continuous variable, let x be the mean
and s2 = 1n−1
∑ni=1(xi − x)2 the variance of the complete cases. Let p1, p2, . . . , pk
be the observed proportion in each category for the complete cases, where k is the
96 Chapter 6. Rounding methods for ordinal variables
highest ordinal category. The ordinal variable is imputed as a single continuous
variable and the rounding threshold for each category is calculated as follows.
1. Put j = k.
2. For category j, the threshold is
tj = x− Φ−1(k∑i=j
pi)s,
where Φ−1 is the inverse of the standard normal cumulative distribution. Im-
puted values greater than tj that have not already been assigned to a category
are rounded to category j.
3. Decrement j to j − 1.
4. If j > 1 return to step 2.
For example, an ordinal variable with three categories will have two rounding
thresholds t3 and t2. Imputed values greater than t3 are assigned to the highest
ordinal category while imputed values between t2 and t3 are assigned to the second
(middle) category. The remaining imputed values are assigned to the lowest ordinal
category. Note that the rounding threshold for each category is the same for each
imputed data set.
Ordinal rounding is based on the information available from the complete cases
and should therefore produce unbiased estimates if the complete cases are AAR.
When the variable has only two categories (k = 2), ordinal rounding is similar to
adaptive rounding except that it uses the mean of the complete cases instead of the
mean of the imputed binary variable.
6.8 Substantive analysis model
The data set used in this study was derived from the National Health and Nutri-
tion Survey (NHANESIII) conducted by the National Center for Health Statistics
6.8. Substantive analysis model 97
(NCHS) in the United States between 1988 and 1994 [26, Chapter 6]. A description
of the data is given in Chapter 4.
For the purposes of this analysis, weight was divided into three categories based
on Body Mass Index (BMI):
weight =
0 (underweight) if BMI < 20
1 (normal) if 20 ≤ BMI ≤ 25
2 (overweight) if BMI > 25.
Of the 16963 subjects, 59.23% were overweight, 33.58% normal weight and 7.19%
underweight as shown in Figure 6.1. The ordinal variable weight is therefore asym-
metrical with the underweight category having a very low prevalence and the over-
weight category predominating. When represented as a continuous variable, weight
has a mean of 1.5204 with a standard deviation of 0.6272. The proportion of obser-
vations with high blood pressure was 14.27% for underweight subjects, 16.89% for
normal weight subjects and 23.29% for overweight subjects, as shown in Figure 6.2.
0.0
5.1
.15
.2.2
5.3
.35
.4.4
5.5
.55
.6.6
5P
ropo
rtion
Underweight Normal Overweight
Figure 6.1: Proportion by weight category in the full data set (n = 16963).
98 Chapter 6. Rounding methods for ordinal variables
The substantive analysis is a logistic regression
logit Pr(hbp) = β0 + β1normal + β2overweight+ error, (6.11)
which calculates the log odds of high blood pressure for normal and overweight sub-
jects compared with the reference group of underweight subjects. The parameters
of interest are the coefficients β1 and β2 with corresponding odds ratios eβ1 and eβ2 .
The odds of high blood pressure for normal weight subjects is 1.2202, while for
overweight subjects it is 1.8234, as shown in Figure 6.3. This indicates that there
is a positive relationship between high blood pressure and weight category in this
data set.
6.9 Method
From the full data set with 16963 subjects, the true proportion p of subjects in
each category was calculated and logistic regression was used to obtain the ‘true
values’ of the regression coefficients β1 and β2.
0.0
2.0
4.0
6.0
8.1
.12
.14
.16
.18
.2.2
2.2
4P
ropo
rtion
Underweight Normal Overweight
Figure 6.2: Proportion of observations with high blood pressure by weight category in
the full data set (n = 16963).
6.9. Method 99
11.
21.
41.
61.
8O
dds
Rat
io
Underweight Normal Overweight
Figure 6.3: Odds of high blood pressure by weight category in the full data set (n =
16963).
Note that, due to time constraints, simulations were performed using the full data
set only. Missingness was imposed on the ordinal variable weight for an MCAR and
an MAR missingness mechanism, as described in Subsection 6.9.1 below.
6.9.1 Missingness Models The MCAR and MAR missingness models are
the same as those in Chapter 5. In the MCAR model, the probability of missingness
for the variable weight was set to 48% for each subject. As noted previously, since
there is only one incomplete variable, AAR and MCAR are equivalent.
In the MAR model, missingness was imposed on weight using a logistic regression
model, with the probability of missingness dependent on age, sex, race and hbp. The
MAR missingness model was
logit Pr(weight missing) = 2− 0.025× age− sex− race+ hbp. (6.12)
To determine if a weight observation was declared missing, a pseudo-random number
between 0 and 1 was generated from a uniform distribution. If the number was less
100 Chapter 6. Rounding methods for ordinal variables
than the probability of missingness, as calculated for each missingness model, the
observation was declared missing. The average missingness rate was 46.5% for the
MAR model compared to 48% for the MCAR model. Thus, on average, just under
half of the observations were missing.
6.9.2 Simulations The following steps were performed for each simulation
replicate i = 1, . . . , 1000.
1. Impose missingness on the weight variable.
2. Depending on the method, impute the missing values as either a single con-
tinuous variable or as a set of indicator variables to create 30 imputed data
sets. The outcome variable hbp was included in the imputation model, as well
as the auxiliary variables age, sex, race and smoke.
3. Where applicable, round the imputed values to one of the ordinal categories
using the rounding method.
4. Use the command mi estimate to combine the imputed data sets and obtain
estimates of the regression coefficients, corresponding standard errors and pro-
portions in each category.
The estimates from the 1000 simulation replicates were averaged to produce overall
estimates of the regression coefficients, corresponding standard errors and propor-
tions in each category for each method and missingness mechanism.
For comparability and reproducibility, we used the same random seed in Stata
at the beginning of each set of 1000 simulation replicates.
6.9.3 Evaluation criteria The following criteria were used to compare the
methods.
1. Bias. For the regression coefficient βj, this is defined as E(βj)−βj for j = 1, 2,
where βj is the true value of the regression coefficient obtained from the full
6.10. Results 101
data set. In this study, E(βj) is estimated by 11000
∑1000i=1 βij for j = 1, 2 and
simulation replicates i = 1, . . . , 1000.
2. Standard error (SE). For each regression coefficient, this is calculated as the
average standard error over the 1000 simulation replicates [27].
3. Root mean square error (RMSE). For the regression coefficient βj, j = 1, 2,
this is defined as√E[(βj − βj)2], estimated by√∑1000
i=1 (βij − βj)2
1000
over 1000 simulation replicates.
4. Distance. This is the Euclidean distance between the actual and estimated
proportions, given by
√(p1 − E(p1))2 + (p2 − E(p2))2 + (p3 − E(p3))2,
where p1, p2, p3 are the true proportions in the full data set and E(p1), E(p2), E(p3)
are estimated by the corresponding average proportions (over the 1000 simu-
lation replicates) of underweight, normal weight and overweight subjects re-
spectively.
6.10 Results
The results of the simulations are presented in Tables 6.1–6.4. The continuous
methods, including FCS, produced better estimates of the regression coefficients
than the indicator-based methods, in terms of bias and RMSE (Table 6.1 and 6.2).
This suggests that continuous methods produce better estimates of regression co-
efficients for a positive exposure-outcome relationship. Our results show that the
performance of FCS is comparable to the continuous MVNI-based rounding meth-
ods.
102 Chapter 6. Rounding methods for ordinal variables
All of the indicator-based methods underestimated the odds of high blood pres-
sure for normal weight and overweight subjects. DBR was the worst method overall
for estimating regression coefficients and odds ratios, in terms of bias and RMSE.
The best methods for estimating proportions were FCS as well as MVNI-based
rounding methods CPR, IBPR, calibration, MIBR and ordinal rounding that use the
marginal distribution of the ordinal variable (Table 6.3 and 6.4). The worst method
for estimating proportions was crude rounding, followed by DBR and PDBR (in
that order). Graphs comparing the Euclidean distances for each method are given
in Figure 6.8 and 6.9.
Crude rounding underestimated the proportions in the underweight and over-
weight categories and overestimated the proportions in the normal category. This
is because, in the full data set, the proportion of normal weight subjects is 0.3358,
while the continuous variable weight has a mean of 1.5204 and a standard deviation
of 0.6272. If weight is imputed as a normally distributed continuous variable, we
would expect roughly 43.5% of the imputed values to be between 0.5 and 1.5 (the
rounding cut-offs for the normal category). Thus crude rounding was biased towards
the normal category in this scenario.
Consistent with the findings of Galati et al. [17], DBR was biased towards the
reference group and overestimated the proportion of underweight subjects.
We note that there were larger biases, RMSEs and Euclidean distances for an
MAR mechanism compared to an MCAR mechanism despite a slightly lower average
missingness rate (46.5% for the MAR model compared to 48% for the MCAR model).
Overall, the results show that ordinal rounding, IBPR and CPR are competitive
with existing methods under an MCAR and an MAR mechanism. Note that the
new methods performed well even when the complete cases were not AAR (under
the MAR missingness model in Subsection 6.9.1).
6.10. Results 103
MCAR mechanism
Complete case analysis produced the lowest bias for both regression coefficients but
standard errors were inflated as a result of the reduction in sample size (Table 6.1).
The continuous methods and PDBR produced lower RMSEs than complete case
analysis for the regression coefficient β1 (normal). FCS produced the lowest RMSE
overall for β1. For the regression coefficient β2 (overweight), only the continuous
methods produced lower RMSEs when compared with complete case analysis. Crude
rounding produced the lowest RMSE overall for β2. Graphs comparing RMSEs for
β1 and β2 are given in Figure 6.4 and 6.5.
MAR mechanism
The continuous methods and PDBR produced lower RMSEs than complete case
analysis for the regression coefficient β1 (Table 6.2). CPR produced the lowest bias
and RMSE overall for β1.
All the methods produced large biases and RMSEs for the regression coefficient
β2. Crude rounding was the only method to produce a (very slightly) lower RMSE
than complete case analysis. Graphs comparing RMSEs for β1 and β2 are given in
Figures 6.6 and 6.7.
In terms of Euclidean distance, MIBR produced the best estimates of propor-
tions, while crude rounding was the worst-performing method (Table 6.4).
104 Chapter 6. Rounding methods for ordinal variables
Table 6.1: Estimates of coefficients β1 and β2 under MCAR.
Normal (β1) Odds Ratio Coefficient SE Bias RMSE
Full data 1.2202 0.1990 0.0892Complete case 1.2212 0.1998 0.1239 0.0008 0.0864
Indicator-basedPDBR 1.1536 0.1429 0.1148 -0.0561 0.0856IBPR 1.0815 0.0783 0.1082 -0.1207 0.1324DBR 1.0508 0.0496 0.0997 -0.1494 0.1567
ContinuousFCS 1.2039 0.1856 0.1094 -0.0134 0.0495Ordinal 1.2487 0.2221 0.1098 0.0231 0.0565MIBR 1.2487 0.2221 0.1099 0.0231 0.0566CPR 1.2488 0.2222 0.1098 0.0232 0.0566Calibration 1.2490 0.2224 0.1097 0.0234 0.0567Crude 1.2845 0.2504 0.1128 0.0514 0.0761
Overweight (β2) Odds Ratio Coefficient SE Bias RMSE
Full data 1.8234 0.6007 0.0852Complete case 1.8260 0.6021 0.1184 0.0014 0.0842
Indicator-basedPDBR 1.6516 0.5017 0.1105 -0.0990 0.1181IBPR 1.5725 0.4526 0.1039 -0.1481 0.1582DBR 1.5205 0.4190 0.0950 -0.1817 0.1879
ContinuousCrude 1.8199 0.5988 0.1130 -0.0019 0.0683Calibration 1.7740 0.5732 0.1091 -0.0275 0.0691CPR 1.7725 0.5724 0.1092 -0.0283 0.0695Ordinal 1.7724 0.5723 0.1092 -0.0284 0.0695FCS 1.8077 0.5921 0.1094 -0.0086 0.0697MIBR 1.7724 0.5723 0.1093 -0.0284 0.0697
6.10. Results 105
Table 6.2: Estimates of coefficients β1 and β2 under MAR.
Normal (β1) Odds Ratio Coefficient SE Bias RMSE
Full data 1.2202 0.1990 0.0892Complete case 1.1627 0.1507 0.1285 -0.0483 0.0985
Indicator-basedPDBR 1.1480 0.1380 0.1171 -0.0610 0.0877IBPR 1.0504 0.0492 0.1096 -0.1498 0.1590DBR 0.9756 -0.0247 0.0994 -0.2237 0.2284
ContinuousCPR 1.1809 0.1663 0.1109 -0.0327 0.0567Calibration 1.1787 0.1644 0.1103 -0.0346 0.0570Ordinal 1.1783 0.1641 0.1103 -0.0349 0.0574Crude 1.2643 0.2345 0.1142 0.0355 0.0619MIBR 1.1617 0.1499 0.1091 -0.0491 0.0662FCS 1.1484 0.1384 0.1092 -0.0606 0.0746
Overweight (β2) Odds Ratio Coefficient SE Bias RMSE
Full data 1.8234 0.6007 0.0852Complete case 1.5709 0.4516 0.1233 -0.1491 0.1697
Indicator-basedPDBR 1.4822 0.3935 0.1135 -0.2072 0.2159IBPR 1.3831 0.3243 0.1061 -0.2764 0.2813DBR 1.2744 0.2425 0.0952 -0.3582 0.3609
ContinuousCrude 1.5699 0.4510 0.1160 -0.1497 0.1631FCS 1.5395 0.4315 0.1118 -0.1692 0.1798CPR 1.5222 0.4201 0.1118 -0.1806 0.1903Calibration 1.5143 0.4150 0.1116 -0.1857 0.1951Ordinal 1.5141 0.4149 0.1114 -0.1858 0.1952MIBR 1.4972 0.4036 0.1101 -0.1971 0.2056
106 Chapter 6. Rounding methods for ordinal variables
Table 6.3: Estimates of proportions in each category under MCAR.
Method Underweight Normal Overweight Distance
Full data 0.0719 0.3358 0.5923Complete case 0.0719 0.3357 0.5924 0.0001
Indicator-basedIBPR 0.0719 0.3357 0.5924 0.0001PDBR 0.0606 0.3512 0.5882 0.0195DBR 0.0936 0.3351 0.5712 0.0303
ContinuousCPR 0.0719 0.3358 0.5923 0.0001MIBR 0.0719 0.3358 0.5922 0.0001Calibration 0.0720 0.3358 0.5923 0.0001Ordinal 0.0720 0.3358 0.5922 0.0002FCS 0.0720 0.3360 0.5920 0.0003Crude 0.0624 0.3834 0.5542 0.0618
Table 6.4: Estimates of proportions in each category under MAR.
Method Underweight Normal Overweight Distance
Full data 0.0719 0.3358 0.5923Complete case 0.0672 0.3407 0.5921 0.0068
Indicator-basedIBPR 0.0672 0.3407 0.5921 0.0068PDBR 0.0573 0.3547 0.5880 0.0243DBR 0.0902 0.3387 0.5711 0.0282
ContinuousMIBR 0.0714 0.3378 0.5908 0.0025CPR 0.0672 0.3405 0.5924 0.0066Calibration 0.0684 0.3428 0.5888 0.0086Ordinal 0.0685 0.3428 0.5887 0.0086FCS 0.0683 0.3430 0.5887 0.0088Crude 0.0597 0.3870 0.5532 0.0656
6.10. Results 107
.04
.06
.08
.1.1
2.1
4.1
6R
MS
E
FCS Ord MIBR CPR Cal Crude PDBR CCA IBPR DBR
Figure 6.4: RMSEs for β1 under MCAR.
.05
.07
.09
.11
.13
.15
.17
.19
RM
SE
Crude Cal CPR Ord FCS MIBR CCA PDBR IBPR DBR
Figure 6.5: RMSEs for β2 under MCAR.
108 Chapter 6. Rounding methods for ordinal variables
.05
.07
.09
.11
.13
.15
.17
.19
.21
.23
RM
SE
CPR Cal Ord Crude MIBR FCS PDBR CCA IBPR DBR
Figure 6.6: RMSEs for β1 under MAR.
.15
.2.2
5.3
.35
RM
SE
Crude CCA FCS CPR Cal Ord MIBR PDBR IBPR DBR
Figure 6.7: RMSEs for β2 under MAR.
6.10. Results 109
0.0
1.0
2.0
3.0
4.0
5.0
6D
ista
nce
CCA IBPR CPR MIBR Cal Ord FCS PDBR DBR Crude
Figure 6.8: Euclidean distances under MCAR.
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8D
ista
nce
MIBR CPR IBPR CCA Cal Ord FCS PDBR DBR Crude
Figure 6.9: Euclidean distances under MAR.
110 Chapter 6. Rounding methods for ordinal variables
6.11 Discussion
The results show that for a positive exposure-outcome relationship, the best
estimates of the regression coefficients are obtained from methods that impute the
ordinal variable as a single continuous variable rather than as a set of indicators.
Our new one-stage methods, CPR and ordinal rounding, produced comparable
results to MIBR and calibration but were easier to implement and faster to run.
The worst method overall was DBR, which substantially underestimated regression
coefficients and odds ratios, particularly for an MAR mechanism.
The best estimates of proportions in terms of Euclidean distance were obtained
using either FCS or MVNI-based rounding methods that utilise the marginal distri-
bution of the ordinal variable. Not surprisingly, crude rounding produced the worst
estimates of proportions as it uses fixed rounding thresholds [17, 29].
We note that CPR and ordinal rounding performed well compared to existing
methods when applied to an asymmetrical ordinal variable where one of the cate-
gories (underweight) had a very low prevalence.
In general, the performance of FCS was comparable to the continuous MVNI-
based methods. However, FCS may be difficult to implement in a general missing
data setting in which missingness occurs across different types of variables [29]. We
note that FCS is also susceptible to perfect prediction when imputing categorical
variables, although this problem did not occur in our study. Perfect prediction occurs
when a categorical outcome completely separates one or more explanatory variables.
That is, the explanatory variable(s) are perfect predictors of the outcome variable.
To mitigate this issue, Stata includes an option known as augmented regression, in
which a few observations with small weightings are added to the data during estima-
tion [53, p.138]. Perfect prediction is a common problem when imputing categorical
variables, particularly for smaller sample sizes, and is an important consideration
when using FCS. White et al. [58] provide a detailed discussion of methods used
6.11. Discussion 111
to handle perfect prediction, including bootstrapping, penalised regression methods
and augmented regression.
We note that MI is not always superior to complete case analysis when esti-
mating the relationship between the levels of an ordinal exposure and an outcome.
For an MCAR mechanism, all of the continuous methods produced lower RMSEs
than complete case analysis for both regression coefficients. However, for an MAR
mechanism, the MI methods produced large biases and RMSEs for the regression
coefficient β2. Thus the performance of MI may vary considerably for different levels
of an ordinal exposure.
112 Chapter 6. Rounding methods for ordinal variables
113CHAPTER 7
Rounding ordinal variables: non-linear relationship
7.1 Introduction
In the previous chapter we considered rounding methods for ordinal variables
under MVNI. We now extend our approach to ordinal variables in the case of a
non-linear exposure-outcome relationship.
To date, very few studies have examined rounding methods for ordinal variables
in this context. A recent study by Lee et al. [29] concluded that methods that
impute an ordinal exposure variable as continuous tended to ‘flatten’ a non-linear
exposure-outcome relationship. However, they noted that methods that imputed an
ordinal variable as a set of indicator variables preserved the non-linear relationship
but not the proportion of observations in each category. They concluded that further
work was needed to develop a method that would preserve the non-linear association
as well as the marginal distribution of the ordinal variable. The method is expected
to be an indicator-based method in order to preserve the non-linear association.
Note that Lee et al. [29] examined MVNI-based methods only and did not include
FCS in their study.
We observe that there are two types of indicator-based rounding methods. The
first type examines each missing case in isolation, for example projected distance-
based rounding (PDBR) and distance-based rounding (DBR). The second type ex-
amines all the imputed values for the indicator variable. Our new method, indicator-
based proportional rounding (IBPR) introduced in the previous chapter is of the
second type. The advantage of IBPR is that it preserves the proportions in the
observed data and thus produces unbiased estimates of proportions if the complete
cases are AAR.
We compare the performance of the MVNI-based rounding methods with FCS
(ordinal logistic regression) for the case of a v-shaped relationship between an ordinal
114 Chapter 7. Rounding ordinal variables: non-linear relationship
exposure variable with three categories and a binary outcome. If there are more than
three categories, this relationship is described as u-shaped [29]. The methods are
compared for MCAR and MAR mechanisms.
The outline of this chapter is as follows. In Section 7.2, we describe the data set
and substantive analysis model. Section 7.3 outlines the method and in Section 7.4
we summarise the results, followed by a discussion in Section 7.5.
7.2 Substantive analysis model
The data set used in this study was derived from the National Health and Nutri-
tion Survey (NHANESIII) conducted by the National Center for Health Statistics
(NCHS) in the United States between 1988 and 1994 [26, Chapter 6]. A description
of the data is given in Chapter 4.
To create a non-linear (v-shaped) relationship between hbp and weight, we deleted
400 subjects of normal weight with high blood pressure. For computational ease, we
drew a simple random sample from the remaining data to create a subsample with
5000 subjects. This subsample consists of 60.54% overweight, 31.98% normal weight
and 7.48% underweight subjects as shown in Figure 7.1. Thus the underweight cat-
egory has a very low prevalence in this data set. When represented as a continuous
variable, weight has a mean of 1.5306 with a standard deviation of 0.6315. A graph
of the proportion of observations with high blood pressure by weight category based
on the subsample of 5000 observations is given in Figure 7.2. The proportion of
observations with high blood pressure was 14.71% for underweight subjects, 10.57%
for normal weight subjects and 24.05% for overweight subjects.
The substantive analysis is a logistic regression,
logit Pr(hbp) = β0 + β1normal + β2overweight+ error, (7.1)
which calculates the log odds of high blood pressure for normal and overweight sub-
jects compared with the reference group of underweight subjects. The parameters
7.3. Method 115
of interest are the coefficients β1 and β2 with corresponding odds ratios eβ1 and eβ2 .
A graph of the odds of high blood pressure by weight category is shown in
Figure 7.3. The odds ratio for normal weight subjects is 0.6854, while the odds
ratio for overweight subjects is 1.8366. Thus normal weight subjects have lower
odds of high blood pressure, and overweight subjects have higher odds of high blood
pressure, compared to underweight subjects.
7.3 Method
Simulations were performed to compare the methods using the data set with 5000
subjects described in Section 7.2. The missingness models, simulation procedure and
evaluation criteria are described in Chapter 6.
0.0
5.1
.15
.2.2
5.3
.35
.4.4
5.5
.55
.6.6
5
Pro
porti
on
Underweight Normal Overweight
Figure 7.1: Proportion by weight category in the data set with n = 5000.
116 Chapter 7. Rounding ordinal variables: non-linear relationship
.05
.1.1
5.2
.25
Pro
porti
on w
ith h
igh
bloo
d pr
essu
re
Underweight Normal Overweight
Figure 7.2: Proportion of observations with high blood pressure by weight category in
the data set with n = 5000.
.51
1.5
2
Odd
s of
hig
h bl
ood
pres
sure
Underweight Normal Overweight
Figure 7.3: Odds of high blood pressure by weight category in the data set with n = 5000.
7.4. Results 117
7.4 Results
MCAR mechanism
Complete case analysis produced the lowest bias for both regression coefficients.
However, standard errors were inflated as a result of the reduction in sample size
(Table 7.1).
All of the indicator-based methods produced lower RMSEs than complete case
analysis. DBR produced the lowest RMSE for both regression coefficients. In con-
trast, all of the continuous methods (including FCS) produced large biases and
RMSEs for both regression coefficients, substantially overestimating the odds of
high blood pressure for normal weight and overweight subjects. Graphs comparing
RMSEs are given in Figure 7.4 and 7.5.
Complete case analysis, IBPR, CPR and ordinal rounding produced the best
estimates of proportions, in terms of Euclidean distance (Table 7.3). In contrast,
crude rounding produced the largest Euclidean distance. A graph comparing the
Euclidean distances for each method is given in Figure 7.8.
MAR mechanism
All of the indicator-based methods produced lower biases and RMSEs than complete
case analysis for the regression coefficient β1 (Table 7.2). IBPR produced the lowest
bias and RMSE for this coefficient. In contrast, all of the continuous methods
(including FCS) produced large biases and RMSEs for β1, overestimating the odds
of high blood pressure for normal weight subjects.
The continuous methods performed better than the indicator-based methods in
estimating the regression coefficient β2. The indicator-based methods substantially
underestimated the odds of high blood pressure for overweight subjects. This is
in contrast to the results obtained for an MCAR mechanism where the indicator-
based methods performed better than the continuous methods for both regression
coefficients. Graphs comparing RMSEs are given in Figure 7.6 and 7.7.
118 Chapter 7. Rounding ordinal variables: non-linear relationship
In terms of Euclidean distance, MIBR produced the best estimates of propor-
tions, while crude rounding was the worst-performing method (Table 7.4 and Fig-
ure 7.9).
7.5 Discussion
The results show that for an MCAR mechanism and a non-linear exposure-
outcome relationship, the best estimates of regression coefficients are obtained using
indicator-based rounding methods such as PDBR, DBR and IBPR. The continuous
methods produced large biases and RMSEs for both regression coefficients. FCS
was comparable to the continuous MVNI-based methods in terms of bias, RMSE
and estimates of proportions.
A study by Lee et al. [29] concluded that methods that impute an ordinal
exposure variable as continuous tended to ‘flatten’ a non-linear exposure-outcome
relationship. However, methods that impute an ordinal variable as a set of indicator
variables preserved the non-linear relationship but not the proportion of observa-
tions in each category. We found that for an MCAR mechanism, the continuous
methods distorted the non-linear relationship by overestimating the odds of high
blood pressure for both normal weight and overweight subjects. All these methods
produced odds ratios that were close to 1 for normal weight subjects. IBPR was
the only method that preserved the non-linear relationship as well as the marginal
distribution of the ordinal variable.
For an MAR mechanism, the results were not so clear. While the indicator-
based methods produced the best estimates for the regression coefficient β1, the
continuous methods produced better estimates for β2. For both an MCAR and
an MAR mechanism, FCS produced the lowest RMSE for β1 but had the highest
RMSE for β2. This indicates that the performance of MI may vary considerably for
different levels of an ordinal exposure, as noted in the previous chapter.
7.5. Discussion 119
The best estimates of proportions in terms of Euclidean distance were obtained
using either FCS or MVNI-based rounding methods that preserve the marginal dis-
tribution of the ordinal variable (MIBR, CPR, IBPR, ordinal rounding and calibra-
tion). Not surprisingly, crude rounding produced the worst estimates of proportions
as it uses fixed rounding thresholds.
In general, for a non-linear exposure-outcome relationship, an incomplete ordinal
variable should be rounded using an indicator-based method. IBPR is recommended
over other indicator-based methods as it preserves the non-linear relationship as well
as the marginal distribution of the ordinal variable. We note that IBPR was superior
to existing indicator-based methods even when one of the categories had a very low
prevalence.
120 Chapter 7. Rounding ordinal variables: non-linear relationship
Table 7.1: Estimates of coefficients β1 and β2 for a non-linear exposure-outcome rela-
tionship under MCAR.
Normal (β1) Odds Ratio Coefficient SE Bias RMSE
Full data 0.6854 -0.3777 0.1671Complete case 0.6897 -0.3716 0.2329 0.0061 0.1634
Indicator-basedDBR 0.7231 -0.3243 0.1936 0.0534 0.1104IBPR 0.7221 -0.3256 0.2061 0.0521 0.1196PDBR 0.7589 -0.2759 0.2152 0.1018 0.1606
ContinuousFCS 0.9328 -0.0695 0.2127 0.3082 0.3234MIBR 0.9958 -0.0042 0.2098 0.3735 0.3868Calibration 0.9957 -0.0043 0.2101 0.3734 0.3871CPR 0.9966 -0.0034 0.2094 0.3743 0.3876Ordinal 0.9969 -0.0031 0.2097 0.3746 0.3879Crude 1.0297 0.0293 0.2124 0.4070 0.4214
Overweight (β2) Odds Ratio Coefficient SE Bias RMSE
Full data 1.8366 0.6079 0.1521Complete case 1.8480 0.6141 0.2119 0.0062 0.1465
Indicator-basedDBR 1.7153 0.5396 0.1765 -0.0683 0.1126PDBR 1.7473 0.5581 0.1997 -0.0498 0.1277IBPR 1.6998 0.5305 0.1902 -0.0774 0.1287
ContinuousMIBR 2.0910 0.7377 0.2011 0.1298 0.1761CPR 2.0918 0.7380 0.2008 0.1301 0.1763Calibration 2.0906 0.7375 0.2013 0.1296 0.1764Ordinal 2.0924 0.7383 0.2010 0.1304 0.1764Crude 2.1074 0.7455 0.2067 0.1376 0.1884FCS 2.2965 0.8314 0.2045 0.2235 0.2559
7.5. Discussion 121
Table 7.2: Estimates of coefficients β1 and β2 for a non-linear exposure-outcome rela-
tionship under MAR.
Normal (β1) Odds Ratio Coefficient SE Bias RMSE
Full data 0.6854 -0.3777 0.1671Complete case 0.6101 -0.4942 0.2380 -0.1165 0.1959
Indicator-basedIBPR 0.6768 -0.3904 0.2047 -0.0127 0.1009DBR 0.6490 -0.4324 0.1900 -0.0547 0.1047PDBR 0.7227 -0.3248 0.2161 0.0529 0.1288
ContinuousFCS 0.8889 -0.1177 0.2099 0.2600 0.2756MIBR 0.9356 -0.0666 0.2065 0.3111 0.3238CPR 0.9457 -0.0558 0.2080 0.3219 0.3348Calibration 0.9460 -0.0555 0.2081 0.3222 0.3351Ordinal 0.9467 -0.0548 0.2077 0.3229 0.3355Crude 1.0272 0.0268 0.2121 0.4045 0.4170
Overweight (β2) Odds Ratio Coefficient SE Bias RMSE
Full data 1.8366 0.6079 0.1521Complete case 1.4858 0.3960 0.2165 -0.2119 0.2582
Indicator-basedPDBR 1.4874 0.3970 0.2008 -0.2109 0.2411IBPR 1.4226 0.3525 0.1898 -0.2554 0.2752DBR 1.3818 0.3234 0.1737 -0.2845 0.2982
ContinuousOrdinal 1.7466 0.5577 0.2032 -0.0502 0.1229CPR 1.7483 0.5586 0.2036 -0.0493 0.1230Calibration 1.7474 0.5582 0.2033 -0.0497 0.1237MIBR 1.7379 0.5527 0.2014 -0.0552 0.1238Crude 1.7795 0.5763 0.2102 -0.0316 0.1264FCS 1.9165 0.6505 0.2051 0.0426 0.1296
122 Chapter 7. Rounding ordinal variables: non-linear relationship
Table 7.3: Estimates of proportions in each category for a non-linear exposure-outcome
relationship under MCAR.
Method Underweight Normal Overweight Distance
Full data 0.0748 0.3198 0.6054Complete case 0.0750 0.3196 0.6054 0.0002
Indicator-basedIBPR 0.0750 0.3196 0.6054 0.0002PDBR 0.0644 0.3357 0.5999 0.0198DBR 0.0970 0.3199 0.5831 0.0315
ContinuousOrdinal 0.0749 0.3197 0.6054 0.0002CPR 0.0747 0.3197 0.6057 0.0003MIBR 0.0747 0.3196 0.6057 0.0004Calibration 0.0747 0.3196 0.6057 0.0004FCS 0.0749 0.3194 0.6057 0.0005Crude 0.0636 0.3721 0.5643 0.0675
Table 7.4: Estimates of proportions in each category for a non-linear exposure-outcome
relationship under MAR.
Method Underweight Normal Overweight Distance
Full data 0.0748 0.3198 0.6054Complete case 0.0705 0.3246 0.6049 0.0064
Indicator-basedIBPR 0.0705 0.3246 0.6049 0.0064PDBR 0.0609 0.3371 0.6020 0.0224DBR 0.0932 0.3214 0.5854 0.0272
ContinuousMIBR 0.0740 0.3198 0.6061 0.0011FCS 0.0709 0.3247 0.6044 0.0064CPR 0.0705 0.3247 0.6048 0.0066Ordinal 0.0713 0.3255 0.6032 0.0070Calibration 0.0711 0.3255 0.6035 0.0071Crude 0.0611 0.3741 0.5648 0.0692
7.5. Discussion 123
.1.1
5.2
.25
.3.3
5.4
.45
RM
SE
DBR IBPR PDBR CCA FCS MIBR Cal CPR Ord Crude
Figure 7.4: RMSEs for β1 for a non-linear relationship under MCAR.
.1.1
2.1
4.1
6.1
8.2
.22
.24
.26
RM
SE
DBR PDBR IBPR CCA MIBR CPR Cal Ord Crude FCS
Figure 7.5: RMSEs for β2 for a non-linear relationship under MCAR.
124 Chapter 7. Rounding ordinal variables: non-linear relationship
.1.1
5.2
.25
.3.3
5.4
RM
SE
IBPR DBR PDBR CCA FCS MIBR CPR Cal Ord Crude
Figure 7.6: RMSEs for β1 for a non-linear relationship under MAR.
.1.1
2.1
4.1
6.1
8.2
.22
.24
.26
.28
.3R
MS
E
Ord CPR Cal MIBR Crude FCS PDBR CCA IBPR DBR
Figure 7.7: RMSEs for β2 for a non-linear relationship under MAR.
7.5. Discussion 125
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7D
ista
nce
CCA IBPR Ord CPR MIBR Cal FCS PDBR DBR Crude
Figure 7.8: Euclidean distances for a non-linear relationship under MCAR.
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7D
ista
nce
MIBR CCA IBPR FCS CPR Ord Cal PDBR DBR Crude
Figure 7.9: Euclidean distances for a non-linear relationship under MAR.
126 Chapter 7. Rounding ordinal variables: non-linear relationship
127CHAPTER 8
Discussion and Conclusion
The aim of this study was to evaluate existing methods and develop new methods
of rounding categorical variables under MVNI. In Chapter 5 we introduced our new
method, proportional rounding, and compared its performance with existing round-
ing methods for binary variables. The results highlighted the clear benefits of using
a rounding method in conjunction with MVNI when imputing a binary confound-
ing variable. Adaptive rounding, proportional rounding and calibration produced
similar results and performed better than simple rounding, particularly when esti-
mating proportions. Calibration was the most difficult method to implement as it
involves duplicating the data set and performing two sets of imputations. Propor-
tional rounding has a similar intuitive appeal to calibration but takes, on average,
one third of the time to implement.
In Chapters 6 and 7 we compared existing rounding methods for ordinal vari-
ables with our new one-stage methods, continuous proportional rounding (CPR),
indicator-based proportional rounding (IBPR) and ordinal rounding. In contrast
to two-stage methods, such as mean indicator-based rounding (MIBR) and calibra-
tion, our new methods require only one set of imputations. The results indicated
that for a positive exposure-outcome relationship, the best estimates of regression
coefficients are obtained from methods that impute the ordinal variable as a single
continuous variable. CPR and ordinal rounding performed as well as or better than
existing continuous methods in terms of bias, RMSE and estimates of proportions.
The main advantages of the new methods are their ease of implementation and
increased computational speed compared to calibration and MIBR.
In general, when the exposure-outcome relationship is non-linear, the best esti-
mates of regression coefficients are obtained using indicator-based rounding methods.
Our new method IBPR is recommended over existing methods because it preserves
the non-linear relationship as well as the marginal distribution of the ordinal vari-
128 Chapter 8. Discussion and Conclusion
able.
We note that CPR, IBPR and ordinal rounding performed well compared to
existing methods even when one of the ordinal categories had a very low prevalence
(less than 10%).
Our results showed that the performance of ordinal logistic regression (FCS) was
comparable to that of the continuous MVNI-based rounding methods for substantial
missingness in an ordinal exposure variable. However, MVNI is often easier to
implement in a general missing data setting. We note that FCS is susceptible to the
problem of perfect prediction when imputing categorical variables. MVNI does not
have this problem since the imputations are produced under a multivariate normal
distribution.
In this study, we examined our new rounding methods for covariates with two or
three categories. Enders [15, p.261] notes that “at an intuitive level, it is reasonable
to expect the effects of rounding to diminish as the number of ordinal response
options increases”. This is a fruitful area for further research.
A limitation of our new methods is they assume the complete cases reasonably
approximate the true proportions in the data set, that is, the complete cases are
available at random (AAR). However, since AAR can hold for an MCAR, MAR
or MNAR mechanism, our new methods do not require an MCAR mechanism to
produce valid estimates of proportions. Although AAR may be regarded as a fairly
restrictive assumption, we note that even in settings where AAR did not hold, our
new methods performed well compared with existing methods.
Our simulation studies were based on a real data set and were designed to model
realistic missing data scenarios. However, we acknowledge that it may be difficult
to draw general conclusions on the basis of simulation studies.
Our findings confirmed the results of previous research, which showed that mul-
tiple imputation is not always superior to complete case analysis. While MVNI
Chapter 8. Discussion and Conclusion 129
had substantial benefits over complete case analysis in the case of missingness in
a binary confounding variable, the results were not so clear for missingness in an
ordinal variable of interest. For an MAR mechanism, we found inconsistencies in the
performance of multiple imputation across levels of the ordinal exposure variable.
The reasons for this are unclear.
Further work is required to determine the settings in which multiple imputation
is likely to perform better than complete case analysis, particularly for missingness
in a covariate of interest.
130 Chapter 8. Discussion and Conclusion
131
Bibliography
[1] Aitkin, M., and Aitkin, I. A hybrid EM/Gauss-Newton algorithm for max-
imum likelihood in mixture distributions. Statistics and Computing 6 (1996),
127–130.
[2] Allison, P. Missing data. Sage, Newbury Park, CA, 2002.
[3] Allison, P. Imputation of categorical variables with proc mi. In 2005 SAS
Users Group International Conference. (2005).
[4] Andridge, R., and Little, R. A review of hot deck imputation for survey
non-response. International Statistical Review 78 (2010), 40–64.
[5] Arnold, B., Castillo, E., and Sarabia, J. Conditional specification of
statistical models. Springer-Verlag, New York, 1999.
[6] Bernaards, C., Belin, T., and Schafer, J. Robustness of a multivariate
normal approximation for imputation of incomplete binary data. Statistics in
Medicine 26 (2007), 1368–1382.
[7] Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of
Markov Chain Monte Carlo. Chapman & Hall/CRC, Boca Raton, FL, 2011.
[8] Carpenter, J., and Kenward, M. Multiple imputation and its application.
John Wiley & Sons, Chichester, UK, 2013.
[9] Cohen, J. Statistical power analysis for the behavioral sciences. Academic
Press, New York, 1977.
[10] Collins, L., Schafer, J., and Kam, C. A comparison of inclusive and
restrictive strategies in modern missing data procedures. Psychological Methods
6 (2001), 330–351.
132 Bibliography
[11] Demirtas, H. Rounding strategies for multiply imputed binary data. Biomet-
rical Journal 51 (2009), 677–688.
[12] Demirtas, H. A distance-based rounding strategy for post-imputation ordinal
data. Journal of Applied Statistics 37 (2010), 489–500.
[13] Demirtas, H., and Schafer, J. On the performance of random-coefficient
pattern-mixture models for non-ignorable drop-out. Statistics in Medicine 22
(2003), 2553–2575.
[14] Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from in-
complete data via the EM algorithm (with discussion). Journal of the Royal
Statistical Society 39 (1977), 1–38.
[15] Enders, C. Applied missing data analysis. The Guildford Press, New York,
2010.
[16] Galati, J., and Seaton, K. MCAR is not necessary for the complete cases to
constitute a simple random subsample of the target sample. Statistical Methods
in Medical Research (2013). DOI: 10.1177/0962280213490360.
[17] Galati, J., Seaton, K., Lee, K., Simpson, J., and Carlin, J. Round-
ing non-binary categorical variables following multivariate normal imputation:
evaluation of simple methods and implications for practice. Journal of Statisti-
cal Computation and Simulation (2012). DOI: 10.1080/00949655.2012.727815.
[18] Graham, J. Missing data analysis and design. Springer, New York, 2012.
[19] Graham, J., Hofer, S., and MacKinnon, D. Maximising the usefulness of
data obtained with planned missing value patterns: an application of maximum
likelihood procedures. Multivariate Behavioural Research 31 (1996), 197–218.
Bibliography 133
[20] Graham, J., Olchowski, A., and Gilreath, T. How many imputations
are really needed? Some practical clarifications of multiple imputation theory.
Prevention Science 8 (2007), 206–213.
[21] Graham, J., Taylor, B., Olchowski, A., and Cumsille, P. Planned
missing data designs in psychological research. Psychological Methods 11
(2006), 323–343.
[22] Heckman, J. Sample selection bias as a specification error. Econometrica 47
(1979), 153–161.
[23] Hedeker, D., and Gibbons, R. Application of random-effects pattern-
mixture models for missing data in longitudinal studies. Psychological Methods
2 (1997), 64–78.
[24] Horton, N., Lipsitz, S., and Parzen, M. A potential for bias when
rounding in multiple imputation. The American Statistician 57 (2003), 229–
232.
[25] Horvitz, D., and Thompson, D. A generalization of sampling without
replacement from a finite universe. Journal of the American Statistical Associ-
ation 47 (1952), 663–685.
[26] Hosmer, D., and Lemeshow, S. Applied logistic regression. John Wiley &
Sons, Hoboken, NJ, 2000.
[27] Lee, K., and Carlin, J. Multiple imputation for missing data: fully condi-
tional specification versus multivariate normal imputation. American Journal
of Epidemiology 171 (January 27 2010), 624–632.
[28] Lee, K., and Carlin, J. Recovery of information from multiple imputation:
a simulation study. Emerging Themes in Epidemiology 9 (2012).
134 Bibliography
[29] Lee, K., Galati, J., Simpson, J., and Carlin, J. Comparison of methods
for imputing ordinal data using multivariate normal imputation: a case study
of non-linear effects in a large cohort study. Statistics in Medicine 31 (2012),
4164–4174.
[30] Little, R. Missing data adjustments in large surveys. Journal of Business
and Economic Statistics 6 (1988), 287–296.
[31] Little, R. Pattern-mixture models for multivariate incomplete data. Journal
of the American Statistical Association 88 (1993), 125–134.
[32] Little, R., and Rubin, D. Statistical analysis with missing data. John Wiley
& Sons, Hoboken, NJ, 1987.
[33] Little, R., and Rubin, D. Statistical analysis with missing data (2nd ed.).
John Wiley & Sons, Hoboken, NJ, 2002.
[34] Louis, T. Finding the observed information matrix when using the EM algo-
rithm. Journal of the Royal Statistical Society Series B 44 (1982), 226–233.
[35] McLachlan, G., and Krishnan, T. The EM algorithm and extensions.
John Wiley & Sons, Hoboken, NJ, 2008.
[36] Meng, X.-L. Multiple imputation inferences with uncongenial sources of in-
put. Statistical Science 9 (1994), 538–558.
[37] Meng, X.-L., and Rubin, D. Using EM to obtain asymptotic variance-
covariance matrices: the SEM algorithm. Journal of the American Statistical
Association 86 (1991), 899–909.
[38] Molenberghs, G., and Kenward, M. Missing data in clinical studies.
John Wiley & Sons, Hoboken, NJ, 2007.
Bibliography 135
[39] Moons, K., Donders, R., Stijnen, T., and Harrell Jnr, F. Using the
outcome for imputation of missing predictor values was preferred. Journal of
Clinical Epidemiology 59 (2006), 1092–1101.
[40] Raghunathan, T., Lepkowski, J., Van Hoewyk, J., and Solen-
berger, P. A multivariate technique for multiply imputing missing values
using a sequence of regression models. Survey Methodology 27 (2001), 85–95.
[41] Redner, R., and Walker, H. Mixture densities, maximum likelihood and
the EM algorithm. SIAM Review. 26 (1984), 195–239.
[42] Rubin, D. Inference and missing data. Biometrika 63 (1976), 581–592.
[43] Rubin, D. Multiple imputations in sample surveys — a phenomenological
Bayesian approach to nonresponse. Proceedings of the Survey Research Methods
Section of the American Statistical Association (1978), 30–34.
[44] Rubin, D. Multiple imputation for nonresponse in surveys. John Wiley &
Sons, Hoboken, NJ, 1987.
[45] Rubin, D. Multiple imputation after 18+ years. Journal of the American
Statistical Association 91 (June 1996), 473–489.
[46] Schafer, J. Analysis of incomplete multivariate data. Chapman & Hall/CRC,
Boca Raton, FL, 1997.
[47] Schafer, J. Multiple imputation: a primer. Statistical methods in medical
research 8 (1999), 3–15.
[48] Schafer, J. Multiple imputation in multivariate problems when the imputa-
tion and analysis models differ. Statistica Neerlandica 57 (2003), 19–35.
[49] Schafer, J., and Graham, J. Missing data: our view of the state of the
art. Psychological Methods 7 (2002), 147–177.
136 Bibliography
[50] Schenker, N., and Taylor, J. Partially parametric techniques for multiple
imputation. Computational Statistics & Data Analysis 22 (1996), 425–446.
[51] Seaman, S., and White, I. Review of inverse probability weighting for
dealing with missing data. Statistical Methods in Medical Research 22 (2011),
278–295.
[52] Spratt, M., Carpenter, J., Sterne, J., Carlin, J., Heron, J., Hen-
derson, J., and Tilling, K. Strategies for multiple imputation in longitu-
dinal studies. American Journal of Epidemiology 172 (July 8 2010), 478–487.
[53] StataCorp. Stata Multiple Imputation Reference Manual Release 12. Stata-
CorpLP, 2011.
[54] StataCorp. Stata: Release 12. Statistical Software. StataCorpLP, 2011.
[55] Sterne, J., White, I., Carlin, J., Spratt, M., Royston, P., Ken-
ward, M., Wood, A., and Carpenter, J. Multiple imputation for miss-
ing data in epidemiological and clinical research: potential and pitfalls. British
Medical Journal (2009), 338:b2393.
[56] Tanner, M., and Wong, W. The calculation of posterior distributions
by data augmentation (with discussion). Journal of the American Statistical
Association 82 (1987), 528–550.
[57] van Buuren, S. Multiple imputation of discrete and continuous data by fully
conditional specification. Statistical Methods in Medical Research 16 (2007),
219–242.
[58] White, I., Daniel, R., and Royston, P. Avoiding bias due to perfect
prediction in multiple imputation of incomplete categorical variables. Compu-
tational Statistics & Data Analysis 54 (2010), 2267–2275.
Bibliography 137
[59] White, I., Royston, P., and Wood, A. Multiple imputation using chained
equations: issues and guidance for practice. Statistics in Medicine 30 (2011),
377–399.
[60] Yucel, R., He, Y., and Zaslavsky, A. Using calibration to improve
rounding in imputation. The American Statistician 62 (2008), 125–129.
[61] Yucel, R., He, Y., and Zaslavsky, A. Gaussian-based routines to impute
categorical variables in health surveys. Statistics in Medicine 30 (2011), 3447–
3460.