improved rounding methods for binary and ordinal variables ... · binary and ordinal variables...

Improved rounding methods for

binary and ordinal variables under

multivariate normal imputation

Milena A. Jacobs

This thesis is presented for the degree of

Doctor of Philosophy

of The University of Western Australia

Department of Mathematics & Statistics.

August, 2015

iii

Abstract

Missing data are common in epidemiological studies. Multivariate normal imputa-

tion (MVNI) is a popular method of handling missing data that imputes missing

values assuming a multivariate normal distribution. This presents a dilemma when

imputing categorical variables as these are not normally distributed. Should the

continuous imputations be rounded, and if so, which rounding method should be

used?

The objective of this study is to evaluate and compare existing methods and

develop new methods of rounding categorical variables under MVNI. We focus on

missingness in covariates rather than outcome variables. This is because MVNI

generally has little or no benefit over complete case analysis if missingness is in an

outcome variable only.

A number of different rounding methods have been proposed for binary variables,

including simple rounding, adaptive rounding and calibration. However, no studies

to date have compared adaptive rounding with calibration. We performed a large

simulation study in Stata to compare the above rounding methods with unrounded

MVNI, and with a new method that we developed called proportional rounding.

Proportional rounding produced similar results to adaptive rounding and calibration

but was faster and easier to implement.

To date, several rounding methods have been proposed for ordinal variables.

Distance-based rounding (DBR) and projected distance-based rounding (PDBR)

are indicator-based methods, while crude rounding, calibration and mean indicator-

based rounding (MIBR) are continuous methods. Previous studies have demon-

strated the inadequacy of DBR, PDBR and crude rounding for rounding categorical

variables with up to seven categories. Calibration and MIBR perform well in some

settings but they are two-stage methods that are time-consuming to implement, par-

ticularly for large data sets. An alternative method of imputing ordinal variables is

iv

fully conditional specification (FCS). There have been no studies to date comparing

FCS with MVNI-based rounding methods for ordinal exposure variables.

We performed a comprehensive simulation study in Stata to compare FCS with

MVNI-based rounding methods for ordinal variables and with our new methods,

continuous proportional rounding (CPR) and indicator-based proportional rounding

(IBPR). These were also compared with ordinal rounding, another new method we

developed. CPR, IBPR and ordinal rounding performed as well as or better than

the other rounding methods in terms of bias, RMSE and estimates of proportions.

The main advantages of the three new methods are their computational speed and

ease of implementation compared to calibration and MIBR.

Epidemiological studies often examine the effect of levels of an ordinal expo-

sure variable on an outcome. It is therefore important to handle missing data in a

way that preserves relationships between the variables in the data set and leads to

statistically valid inferences. Currently, there are no methods for rounding ordinal

variables that preserve marginal proportions as well as associations for a non-linear

exposure-outcome relationship. Our new method IBPR is recommended over exist-

ing methods as it preserves the non-linear relationship and the marginal distribution

of the ordinal variable.

Key Words: binary, categorical, fully conditional specification, missing data,

missingness mechanisms, multiple imputation, MVNI, ordinal, rounding.

v

Contents

Abstract iii

List of Tables ix

List of Figures xi

Glossary xiii

Acknowledgements xv

Preface xvii

0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

0.2 Original work in this thesis . . . . . . . . . . . . . . . . . . . . . . . . xvii

0.3 Thesis organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

0.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

1 Introduction to Missing Data 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Types of missing data . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Missing data patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Missing data mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5.1 Missing at random (MAR) . . . . . . . . . . . . . . . . . . . . 5

1.5.2 Missing completely at random (MCAR) . . . . . . . . . . . . 8

1.5.3 Missing not at random (MNAR) . . . . . . . . . . . . . . . . . 9

1.6 Available at random (AAR) . . . . . . . . . . . . . . . . . . . . . . . 9

1.7 Ignorability and the MAR assumption . . . . . . . . . . . . . . . . . 11

1.8 Planned missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

vi

2 Methods of handling missing data 15

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Complete case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Pairwise deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Single imputation methods . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Mean imputation . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Hot deck imputation . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . 20

2.5.1 Obtaining standard errors from the EM algorithm . . . . . . . 22

2.5.2 Using a hybrid method to accelerate convergence . . . . . . . 23

2.6 Multiple imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Fully conditional specification . . . . . . . . . . . . . . . . . . . . . . 25

2.8 Predictive mean matching . . . . . . . . . . . . . . . . . . . . . . . . 27

2.9 Inverse probability weighting . . . . . . . . . . . . . . . . . . . . . . . 28

2.10 Methods for data missing not at random . . . . . . . . . . . . . . . . 29

2.10.1 Selection model . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.10.2 Pattern mixture model . . . . . . . . . . . . . . . . . . . . . . 31

2.10.3 Issues associated with MNAR data . . . . . . . . . . . . . . . 33

3 Multiple Imputation 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Imputation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Analysis and pooling phases . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Multivariate normal imputation . . . . . . . . . . . . . . . . . . . . . 39

3.4.1 The I-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.2 The P-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.3 Prior distributions . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vii

3.4.5 Convergence issues . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.6 Obtaining the m imputed data sets . . . . . . . . . . . . . . . 46

3.5 Comparison with ML estimation . . . . . . . . . . . . . . . . . . . . . 46

3.6 Specifying the imputation model . . . . . . . . . . . . . . . . . . . . . 47

3.7 Number of imputations . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Exploratory Data Analysis 51

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 The NHANESIII data set . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Relationship between high blood pressure and other variables . . . . . 58

5 Rounding methods for binary variables 61

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Rounding methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.1 Simple Rounding . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.2 Adaptive rounding . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Proportional rounding: a new rounding method . . . . . . . . . . . . 68

5.4 Substantive analysis model . . . . . . . . . . . . . . . . . . . . . . . . 69

5.5 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.5.1 Missingness models . . . . . . . . . . . . . . . . . . . . . . . . 70

5.5.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5.3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 74

5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6 Rounding methods for ordinal variables 83

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2 Existing indicator-based methods . . . . . . . . . . . . . . . . . . . . 85

viii

6.2.1 Projected distance-based rounding . . . . . . . . . . . . . . . 86

6.2.2 Distance-based rounding . . . . . . . . . . . . . . . . . . . . . 86

6.3 Comparison of DBR and PDBR . . . . . . . . . . . . . . . . . . . . . 88

6.4 Existing continuous methods . . . . . . . . . . . . . . . . . . . . . . . 90

6.4.1 Crude rounding . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.4.2 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.4.3 Mean indicator-based rounding . . . . . . . . . . . . . . . . . 92

6.5 Continuous proportional rounding . . . . . . . . . . . . . . . . . . . . 93

6.6 Indicator-based proportional rounding . . . . . . . . . . . . . . . . . 94

6.7 Ordinal rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


6.9 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.9.1 Missingness Models . . . . . . . . . . . . . . . . . . . . . . . . 99

6.9.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.9.3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 100

6.10 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7 Rounding ordinal variables: non-linear relationship 113

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8 Discussion and Conclusion 127

Bibliography 131

ix

List of Tables

4.1 Summary statistics for the full data set (n = 16963). . . . . . . . . . 52

4.2 High blood pressure by sex. . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 High blood pressure by race. . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 High blood pressure by smoking category. . . . . . . . . . . . . . . . 60

5.1 The original, duplicated and stacked data sets for calibration prior to

imputation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 True values of race coefficient β5, its standard error and proportion p

of overweight subjects for each data set. . . . . . . . . . . . . . . . . 70

5.3 Comparison of rounding methods for binary variables under MCAR. . 79

5.4 Comparison of rounding methods for binary variables under MAR. . . 80

5.5 Comparison of rounding methods for binary variables under MNAR. . 81

6.1 Estimates of coefficients β1 and β2 under MCAR. . . . . . . . . . . . 104

6.2 Estimates of coefficients β1 and β2 under MAR. . . . . . . . . . . . . 105

6.3 Estimates of proportions in each category under MCAR. . . . . . . . 106

6.4 Estimates of proportions in each category under MAR. . . . . . . . . 106

7.1 Estimates of coefficients β1 and β2 for a non-linear exposure-outcome

relationship under MCAR. . . . . . . . . . . . . . . . . . . . . . . . . 120

7.2 Estimates of coefficients β1 and β2 for a non-linear exposure-outcome

relationship under MAR. . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.3 Estimates of proportions in each category for a non-linear exposure-

outcome relationship under MCAR. . . . . . . . . . . . . . . . . . . . 122

7.4 Estimates of proportions in each category for a non-linear exposure-

outcome relationship under MAR. . . . . . . . . . . . . . . . . . . . . 122

xi

List of Figures

1.1 An example of MAR-linear missingness . . . . . . . . . . . . . . . . . 7

1.2 An example of MAR-convex missingness . . . . . . . . . . . . . . . . 8

4.1a Histogram of the variable age. . . . . . . . . . . . . . . . . . . . . . . 54

4.1b Boxplot of the variable age. . . . . . . . . . . . . . . . . . . . . . . . 54

4.2a Histogram of the variable weight (in kilograms). . . . . . . . . . . . . 55

4.2b Boxplot of the variable weight (in kilograms). . . . . . . . . . . . . . 55

4.3a Histogram of the variable height (in cm). . . . . . . . . . . . . . . . . 56

4.3b Boxplot of the variable height (in cm). . . . . . . . . . . . . . . . . . 56

4.4a Histogram of BMI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4b Boxplot of BMI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Boxplots of BMI by high blood pressure category. . . . . . . . . . . . 59

4.6 Boxplot of age by high blood pressure category. . . . . . . . . . . . . 59

5.1 Adaptive rounding thresholds for 0 < ω < 1. . . . . . . . . . . . . . . 67

5.2 Overview of simulations comparing methods for binary variables. . . . 73

6.1 Proportion by weight category in the full data set (n = 16963). . . . . 97

6.2 Proportion of observations with high blood pressure by weight cate-

gory in the full data set (n = 16963). . . . . . . . . . . . . . . . . . . 98

6.3 Odds of high blood pressure by weight category in the full data set

(n = 16963). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.4 RMSEs for β1 under MCAR. . . . . . . . . . . . . . . . . . . . . . . . 107

6.5 RMSEs for β2 under MCAR. . . . . . . . . . . . . . . . . . . . . . . . 107

6.6 RMSEs for β1 under MAR. . . . . . . . . . . . . . . . . . . . . . . . . 108

6.7 RMSEs for β2 under MAR. . . . . . . . . . . . . . . . . . . . . . . . . 108

6.8 Euclidean distances under MCAR. . . . . . . . . . . . . . . . . . . . 109

xii

6.9 Euclidean distances under MAR. . . . . . . . . . . . . . . . . . . . . 109

7.1 Proportion by weight category in the data set with n = 5000. . . . . . 115

7.2 Proportion of observations with high blood pressure by weight cate-

gory in the data set with n = 5000. . . . . . . . . . . . . . . . . . . . 116

7.3 Odds of high blood pressure by weight category in the data set with

n = 5000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.4 RMSEs for β1 for a non-linear relationship under MCAR. . . . . . . . 123

7.5 RMSEs for β2 for a non-linear relationship under MCAR. . . . . . . . 123

7.6 RMSEs for β1 for a non-linear relationship under MAR. . . . . . . . . 124

7.7 RMSEs for β2 for a non-linear relationship under MAR. . . . . . . . . 124

7.8 Euclidean distances for a non-linear relationship under MCAR. . . . . 125

7.9 Euclidean distances for a non-linear relationship under MAR. . . . . 125

xiii

Glossary

AAR available at random

CCA complete case analysis

CPR continuous proportional rounding

DBR distance-based rounding

FCS fully conditional specification

IBPR indicator-based proportional rounding

MAR missing at random

MCAR missing completely at random

MI multiple imputation

MIBR mean indicator-based rounding

ML maximum likelihood

MNAR missing not at random

MVNI multivariate normal imputation

PDBR projected distance-based rounding

xv

Acknowledgements

First and foremost, I would like to thank my supervisor, Dr R. Nazim Khan, for his

valued advice and support. In addition I would like to thank the people that offered

me support and encouragement throughout my thesis, especially my partner Symon

Aked and my parents Melanie and Vlad.

A special thanks goes to Dr Robin K. Milne for his valuable comments that

helped to improve this manuscript. I would also like to acknowledge Prof Nicholas

de Klerk, from the Telethon Kids Institute, who encouraged me to study missing

data problems. In addition I would like to thank those that helped me in times of

stress, in particular the contributions of Mango.

This study was supported by the following scholarships: Australian Postgraduate

Award (APA), UWA Safety Net Top-Up Scholarship, Bruce and Betty Green Post-

graduate Research Top-Up Scholarship, and the Telethon Kids Institute AREST CF

Postgraduate Top-Up Scholarship.

xvii

Preface

0.1 Introduction

Multivariate normal imputation (MVNI) is a method of multiple imputation

that accommodates a general missing data pattern with missingness across different

types of variables. It ‘fills in’ or imputes missing values assuming a multivariate

normal distribution for the data. There are two important issues to consider when

using MVNI to impute categorical variables. The first is how the categorical variable

will be imputed. Nominal (unordered) variables are imputed as a set of indicator

variables. However, ordinal variables may be imputed either as a single ‘continuous’

variable or as a set of indicator variables.

Since MVNI assumes multivariate normality, all the imputed values are on a

continuous scale regardless of whether an indicator-based or continuous approach is

used. Therefore the second issue is how each imputed value is assigned to one of

the relevant categories, a process referred to as ‘rounding’. Although it is possible

to use the unrounded imputed values, this is not viable if the substantive analysis

involves estimating the relationship between the levels of an ordinal variable and an

outcome [29].

The objective of this study is to evaluate and compare existing methods and

develop new methods of rounding categorical variables under MVNI. To compare

the methods, we performed large scale simulation studies in Stata using data derived

from the NHANESIII data set [26, Chapter 6].

0.2 Original work in this thesis

The original contributions in this thesis are as follows.

1. The major original contribution is the development of three new methods for

rounding categorical variables under MVNI:

xviii

(i) continuous proportional rounding (CPR), a method for binary and ordi-

nal variables;

(ii) indicator-based proportional rounding (IBPR), a method for ordinal or

nominal variables; and

(iii) ordinal rounding, a method for ordinal variables.

The key advantages of these compared to existing methods are their ease of

implementation and computational speed.

None of the previous methods for rounding ordinal variables preserve both

marginal proportions and associations for a non-linear exposure-outcome re-

lationship. Using simulations, we show that IBPR preserves the non-linear

relationship as well as the marginal distribution of the ordinal variable.

2. We perform large scale simulation studies to compare adaptive rounding with

the calibration method for rounding binary variables.

3. We compare fully conditional specification (FCS) with MVNI-based rounding

methods when the substantive analysis involves estimating the relationship

between the levels of an ordinal exposure and an outcome.

4. A comprehensive survey of the literature and methods is also presented.

All the simulations in this study were performed using Stata [54] statistical soft-

ware, which has convenient inbuilt functions for performing large scale simulations

and multiple imputation.

0.3 Thesis organisation

The thesis is organised into eight chapters. The first chapter provides an intro-

duction to missing data, including types of missing data, missing data patterns and

missingness mechanisms. In Chapter 2 we discuss methods of missing data handling,

xix

from traditional methods such as complete case analysis and pairwise deletion to

modern methods such as multiple imputation and maximum likelihood estimation.

Chapter 3 provides a detailed description of multiple imputation and, in particular,

MVNI. The data set and variables used in the study are described in Chapter 4.

The original work in this thesis is contained in Chapters 5, 6 and 7. In Chap-

ter 5 we examine methods for rounding binary variables under MVNI and introduce

our new method, proportional rounding. In Chapter 6 we compare MVNI-based

rounding methods for ordinal variables with fully conditional specification (FCS),

and introduce our new methods CPR, IBPR and ordinal rounding. A comparison

of FCS with MVNI-based rounding methods for a non-linear exposure-outcome re-

lationship is presented in Chapter 7. Finally, Chapter 8 is devoted to discussion and

conclusions.

0.4 Publications

A paper on CPR and ordinal rounding is undergoing final editing for submission.

A second paper on IBPR is under preparation. A third survey paper on rounding

methods is planned.

1CHAPTER 1

Introduction to Missing Data

1.1 Introduction

Missing data are encountered in many research contexts, including medical stud-

ies and the social and behavioural sciences. Multivariate data sets often contain

substantial missing data for reasons such as attrition, nonresponse and errors in

data collection. Two main problems are associated with missing data: biased pa-

rameter estimates and loss of efficiency [8, p.9]. Loss of efficiency is a result of fewer

observations being available for analysis. The extent of the loss depends on the

type of analysis being undertaken as well as the proportion of missing data [8, p.9].

Missing data may also lead to biased parameter estimates if the observed values are

not representative of the full data set. It is therefore important to handle missing

data in a way that preserves relationships between the variables in the data set and

leads to statistically valid inferences. The primary analysis of interest is generally

referred to as the substantive analysis to distinguish it from the model(s) used to

handle missing data.

The outline of this chapter is as follows. In Section 1.2 we provide an overview

of two broad types of missing data: item nonresponse and wave nonresponse. Sec-

tion 1.3 defines the notation. In Section 1.4 we discuss missing data patterns, which

may be broadly classified into six different types. Section 1.5 describes the three

types of missing data mechanisms: missing at random (MAR), missing completely

at random (MCAR) and missing not at random (MNAR). In Section 1.6 we discuss

a special condition known as available at random (AAR). The concept of ‘ignorabil-

ity’ under an MAR mechanism is described in Section 1.7. The chapter concludes

with a discussion in Section 1.8 of planned missing data.

2 Chapter 1. Introduction to Missing Data

1.2 Types of missing data

Two types of missing data are described in the literature: item nonresponse

and wave nonresponse [18, p.4]. Item nonresponse occurs when a subject does not

respond to an item in a survey. There are many reasons for this type of nonresponse.

The respondent may not know the answer to the question(s) or may have run out

of time to complete the survey. Respondents may be uncomfortable disclosing the

answer to sensitive questions, for example questions about drug and alcohol use or

infidelity. The data may also be missing as a result of data collection/recording

errors or equipment malfunction.

Wave nonresponse relates to longitudinal data, collected at two or more time

points called waves. This type of nonresponse is a result of a subject not partic-

ipating in the survey at a particular wave, perhaps due to relocation or personal

circumstances. Sometimes the nonresponse may be related to the treatment itself,

for example in a drug study where the respondent experiences an adverse reaction.

There are two types of wave nonresponse. In the first, the respondent is absent

from a wave but returns to complete subsequent waves. The second type is known

as attrition or drop out, where the respondent is absent from a wave and does not

return to the study; this type is generally more problematic [18, p.9].

1.3 Notation

Let X = (xij) denote an n × k data set (n cases with k variables), where xij

corresponds to the value of variable j for case i. Let vectors Xmis and Xobs denote

the missing and observed components of X respectively. We denote the parameters

of the substantive analysis model by β.

1.4. Missing data patterns 3

For example, suppose X is a data set with 3 cases and 2 variables, given by

X =

x11 x12

x21 x22

x31 x32

.Suppose that variable X1 is fully observed and variable X2 is incomplete with

missing data on x12 and x32. Then Xobs = (x11, x21, x22, x31) and Xmis = (x12, x32).

Define the n × k missing data indicator matrix R such that Rij = 1 if xij is

observed and 0 otherwise, i = 1, 2, . . . , n and j = 1, 2, . . . , k. In the example above,

the missing data matrix R is given by

R =

1 0

1 1

1 0

Let Ri be the row vector corresponding to row i of R. Then Ri is the missing

data indicator vector for case i. The complete cases have Ri = 1 = (1, 1, . . . , 1), a

k-vector of ones. The incomplete cases have Rij = 0 for at least one j = 1, 2, . . . , k.

In the example above, case 1 has missingness on variable X2 so R1 = (1, 0).

1.4 Missing data patterns

A missing data pattern refers to the arrangement of observed and missing values

in a data set. If X consists of k variables, there are potentially up to 2k missing

data patterns. For example, if X consists of two variables, X1 and X2, there are

potentially 4 missing data patterns:

1. cases that are complete for X1 and X2;

2. cases with missingness on X1 only;

3. cases with missingness on X2 only;


4. cases with missingness on both X1 and X2.

In general, missing data patterns are classified into six different types [15, p.4].

These are discussed below.

1. Univariate — Data is missing for only one variable. This was one of the first

missing data patterns to be addressed in the literature.

2. Unit nonresponse — This occurs in sample surveys where a subset of indi-

viduals does not complete the questionnaire. The incomplete variables are

the unanswered items and the fully observed variables are the survey design

variables measured for both respondents and nonrespondents [33, p.5].

3. Monotone — The variables can be arranged so that if variable Xj, j =

1, . . . , k− 1, is missing for a case then variables Xj+1, . . . , Xk are also missing

for that case. This pattern describes attrition in longitudinal studies where

subjects drop out before the end of the study. Monotone missing data patterns

simplify missing data handling since they do not require iterative estimation

algorithms.

4. General — A ‘haphazard’ pattern of missingness across variables.

5. Planned — An example is the three-form questionnaire design described by

Graham et al. [19], discussed in Section 1.8.

6. Latent variable — In a latent variable analysis, there is a set of observed

‘manifest’ variables and a set of unobserved ‘latent’ variables. This is a special

type of missing data pattern since the latent variables are unobservable but

are conceptualised as ‘missing data’.

1.5 Missing data mechanisms

The missing data mechanism describes the relationship between the probability

of missingness and the variables in the data set. That is, it specifies the conditional

1.5. Missing data mechanisms 5

distribution of R given X, denoted by p(R | X, ξ), where ξ represents some un-

known model parameters [33, p.12]. The notation p() refers to a probability mass

function or probability density function, as appropriate. Rubin [42] describes three

different types of missingness mechanisms: missing at random (MAR), missing com-

pletely at random (MCAR) and missing not at random (MNAR).

1.5.1 Missing at random (MAR) Here the missingness is dependent on

the observed variables but not the variables with missing data, that is

p(R |X, ξ) = p(R |Xobs, ξ). (1.1)

For example, less educated respondents may be less likely to answer survey questions

about political preferences. The missingness is therefore dependent on one or more of

the observed variables (education) but not on the incomplete variable itself (political

preference).

Reading speed is a classic example of MAR missingness [18, p.13]. Slower readers

may leave items blank because they have run out of time to complete the survey.

However, reading speed is a variable that can be measured and incorporated into

the missing data handling procedure to adjust for any bias due to the nonresponse.

It is important to note that data are MAR only if there is no relationship between

the incomplete variable and the probability of missingness after controlling for the

other observed variables in the data set. It is not usually possible to determine if

data are MAR without knowing the values of the missing data [15, p.6].

Schafer [46, pp.20–22] describes some situations where an MAR mechanism is

known to hold.

1. Double sampling in sample surveys. Here characteristics X1, X2, . . . , Xp are

recorded for all subjects in the sample, while characteristics Xp+1, . . . , Xk

are recorded only for a subsample of subjects. If this subsample is chosen


based entirely on the observed values X1, X2, . . . , Xp, then the missing values

Xp+1, . . . , Xk for the subjects not included in the subsample will be MAR.

2. Nonresponse follow-up. On follow-up, if responses are obtained from a random

sample of subjects who had previously not responded, then the missingness

mechanism for the remaining subjects that were not followed up is MAR.

3. Experiments with unbalanced designs. In an unbalanced experimental design,

the sample sizes for the treatment combinations are not all equal. The ‘missing’

data has a probability of missingness equal to one and is therefore MAR.

4. Medical studies with multiple tests where not all tests are administered to all

subjects. In some medical studies, not all tests are given to all subjects. The

missing data will be MAR, provided that all information that is used to select

the samples is included in the observed data.

5. Matrix sampling for questionnaire item. If a questionnaire is divided into

sections that are given to subjects in a randomized manner, then data will be

missing for the sections of the questionnaire that were not given to some of

the subjects. The missing data will be MAR, provided that all the variables

used in the sampling process are included in the observed data.

Collins et al. [10] describe three types of MAR missingness. The first type is

called MAR-linear, in which the probability of missingness on the variable Y is a

linear function of another measured variable Z. For example,

Pr(Ymis | Z) =

0.2 if Z=1,

0.4 if Z=2,

0.6 if Z=3,

0.8 if Z=4.

That is, Pr(Ymis | Z) = 0.2Z as shown in Figure 1.1.

1.5. Missing data mechanisms 7

.2.4

.6.8

P(Y

mis

sing

)

1 2 3 4Z

Figure 1.1: An example of MAR-linear missingness

The second type is called MAR-convex, in which the probability of missingness

on Y is a non-linear function of Z. For example,

Pr(Ymis | Z) =

0.8 if Z=1,

0.2 if Z=2,

0.2 if Z=3,

0.8 if Z=4,

as shown in Figure 1.2.

The third type is referred to as MAR-sinister, in which the probability of miss-

ingness on Y is a function of the correlation rXZ between X (another measured

variable) and Z (the cause of missingness). For example,

Pr(Ymis | X,Z) =

0.8 if rXZ is high,

0.2 if rXZ is low.

The interpretation of a ‘high’ or ‘low’ correlation depends upon the substantive

analysis.


.2.4

.6.8

P(Y

mis

sing

)

1 2 3 4Z

Figure 1.2: An example of MAR-convex missingness

It is important to note that if the substantive analysis model and/or the missing

data analysis model do not incorporate the causes of missingness then the missing-

ness is not MAR.

MAR missingness is sometimes referred to as ignorable missingness [42]. How-

ever, as Graham [18, p.15] points out, this does not mean that the causes of missing-

ness can be ignored, only that the distribution of the missing data can be ignored.

If the variables related to the missingness are incorporated into the missing data

handling procedure, then there is no need to estimate the parameters of the missing

data distribution [42].

1.5.2 Missing completely at random (MCAR) In an MCAR mechanism,

the missingness is unrelated to any of the variables in the data set, that is

p(R |X, ξ) = p(R | ξ). (1.2)

For example, data may be missing as a result of equipment breakages, unexpected

personal events or administrative errors, none of which are related to the data. The

1.6. Available at random (AAR) 9

observed data is therefore a simple random sample of the full data set and the

missingness does not result in bias. Sometimes researchers employ planned missing

data designs to intentionally produce MCAR missingness. An example of this is the

three-form design, discussed in Section 1.8. Note that MCAR is a special case of

MAR.

1.5.3 Missing not at random (MNAR) Here the missingness is dependent

on unmeasured variables, that is

p(R |X, ξ) = p(R |Xobs,Xmis, ξ). (1.3)

For example, high income respondents may be less likely to answer survey questions

related to income, so missingness on the income variable is related to income itself.

This will result in bias since data is missing from the upper tail of the income

distribution. In general, it is not possible to determine if data are MNAR from the

observed data alone [15, p.8].

An MNAR mechanism can occur in two ways [15, p.14]. In the first, direct

MNAR, the probability of missingness is directly related to the incomplete vari-

able. In the second, indirect MNAR, the probability of missingness is related to

the incomplete variable indirectly through mutual correlation with an unmeasured

variable. A direct MNAR mechanism may produce substantial bias [10]. However,

bias is an issue for an indirect MNAR mechanism only if the correlation between

the unmeasured variable and the incomplete variable is “relatively strong” (abso-

lute value greater than 0.40) [10]. Note that an indirect MNAR mechanism becomes

an MAR mechanism when the unmeasured variable is included in the analysis [15,

p.15].

1.6 Available at random (AAR)

Until recently, it was generally accepted that an MCAR mechanism was necessary

for the complete cases to represent a simple random sample of the target sample


[42]. However, Galati & Seaton [16] proved that a less stringent condition, which

they refer to as available at random (AAR), is sufficient.

Using the notation in Section 1.3, the distribution of the complete cases is ob-

tained from the joint distribution p(X,R) by conditioning on the event Ri = 1,

that is,

p(X | Ri = 1), (1.4)

where 1 is a 1 × k vector of ones representing the response pattern for a complete

case. The MCAR mechanism implies that the complete cases form a simple random

sample of the target sample. A sufficient condition for this is that [16]

p(X = x | Ri = 1) = p(X = x), (1.5)

for all x. Now

p(X = x | Ri = 1) =p(X = x,Ri = 1)

p(Ri = 1)

=p(Ri = 1 |X = x) p(X = x)

p(Ri = 1). (1.6)

Substituting (1.5) in (1.6) yields

p(Ri = 1 |X = x) = p(Ri = 1). (1.7)

If the condition in (1.7) holds, the complete cases are referred to as available at

random (AAR) [16] with respect to the joint model p(X,R). That is, the probability

of a case being complete does not depend on X.

AAR is a less stringent condition than MCAR because it involves only one con-

straint (on missing data pattern Ri = 1) while MCAR involves constraints on up

to 2k − 1 missing data patterns [16]. If there is only one incomplete variable in X

then MCAR and AAR are equivalent [16].

We note that, regardless of the number of incomplete variables, AAR and MCAR

will be equivalent if there are only two missing data patterns and some cases are

1.7. Ignorability and the MAR assumption 11

complete. For example, suppose we have variables X = (X1,X2,X3) and only two

missing data patterns (1, 1, 1) and (0, 0, 1). If AAR holds, then p(1, 1, 1) = α, where

α is a constant that does not depend on X. Since there are only two missing data

patterns, p(0, 0, 1) = 1−α, which is also a constant with respect to X. In this case,

AAR and MCAR are equivalent. This argument can be extended to a data set with

more than three variables.

Galati & Seaton [16] further demonstrated that AAR can hold for an MCAR,

MAR or MNAR mechanism provided that the probability of being a complete case

is constant. Thus if AAR holds, the complete cases form a simple random sample

of the target sample regardless of the missingness mechanism [16].

1.7 Ignorability and the MAR assumption

Little & Rubin [33, p.119] showed that if there is an MAR mechanism and

the parameters β and ξ are distinct, then likelihood-based inferences for β (the

parameters of the substantive analysis model) are not affected by ξ (the parameters

of the missing data distribution). This is referred to as ignorable missingness. A

loose definition of ‘distinct’ is that the value of β provides little information about

ξ and vice-versa [46, p.11]. A more precise definition of distinctness is given by

Schafer [46, p.11] as follows. From a frequentist perspective, the parameters are

distinct if the joint parameter space of (β, ξ) is the Cartesian cross-product of the

individual parameter spaces for β and ξ. From a Bayesian perspective, any joint

prior distribution applied to (β, ξ) must factor into independent marginal priors for

β and ξ [46, p.11].

Schafer [46, p.12] provides a concise proof of ignorability as follows. The joint

probability distribution of the observed data is given by

p(R,Xobs | β, ξ) =

∫p(R,X | β, ξ) dXmis

=

∫p(R |X, ξ) p(X | β) dXmis.

(1.8)


Note that the integral is replaced by summation for discrete distributions.

Under an MAR assumption, p(R |X, ξ) = p(R |Xobs, ξ), so

p(R,Xobs | β, ξ) = p(R |Xobs, ξ)

∫p(X | β) dXmis

= p(R |Xobs, ξ) p(Xobs | β).

(1.9)

Thus the likelihood of the observed data under MAR factorises into two separate

components: a function depending on the parameters of interest β, and a function

depending on ξ which are regarded as ‘nuisance’ parameters. If both MAR and

distinctness hold, the parameters ξ of the missing data distribution are ignorable

for inferences on β [33, p.119].

Van Buuren [57, p.223] describes ignorability as “...the belief on the part of the

user that the available data are sufficient to correct for the effects of the missing

data”. In general, there is no definitive way to determine if data are MAR, but

an MAR assumption may be made more plausible by including in the missing data

handling procedure, variables that are known to be correlated with the causes of

missingness and/or the incomplete variables [15, pp.16–17]. This is known as an

inclusive analysis strategy. Variables not of substantive interest but included in

the missing data handling procedure and/or analysis model are known as auxiliary

variables.

Ideally, missing data would be anticipated and planned for in the design of the

study to support the MAR assumption [18, p.38]. For example, variables that

may explain potential missingness should be included in the questionnaire. For

longitudinal studies, Schafer & Graham [49] recommend that respondents be asked

to report their likelihood of dropping out. In some cases, the missing data can be

‘converted’ to MAR. For example, missing data corresponding to nonresponse in

surveys may not be MAR. However, such missing data can be converted to MAR

by following up a random sample of nonrespondents [49].

1.8. Planned missing data 13

1.8 Planned missing data

The planned missing data design intentionally produces MCAR missingness. It

allows researchers to collect data on all the variables of interest while reducing

the burden on respondents [15, p.23]. Missing data handling procedures, such as

multiple imputation (MI) and maximum likelihood estimation (MLE), can then be

used to analyse the data.

Graham et al. [19] developed the three-form design, which divides the question-

naire items into four sets denoted by X,A,B,C. The items in X are the questions

that are central to the research hypotheses. Three questionnaires (forms) are pre-

pared, each containing X and only two of A, B and C.

Graham [18, p.291] recommends having the same number of items in each of the

four sets, including X. For example, a questionnaire containing 200 items may be

split into four sets X, A, B and C, each containing 50 items. Each subject would

only answer 150 of the 200 items but the researcher would have information on all

200 items.

The order of the sets in each form is important [18, p.291]. Usually X is placed

first in each form since it contains the questions that are central to the research.

However, Graham [18, p.291] notes that it may sometimes be beneficial to place

some items belonging to X further along in the questionnaire. In each form, a

different set should be presented last so that respondents who do not complete the

form do not always leave the same items blank.

The main drawback of the planned missing data approach is a potential loss

of statistical power [15, p.24]. Graham et al. [21] discuss the impact of planned

missing data designs on power and the ways in which this may be mitigated. One

way to improve power is to slightly increase the number of variables in X, since this

set is common to all three forms. This increases the number of variable pairs with

complete data. Another way in which power may be improved is by considering


effect size (correlations between pairs of variables). Variables expected to produce a

small effect size have larger sample size requirements and should be placed in X to

maximise power. On the other hand, variables that are expected to produce a large

effect size have smaller sample size requirements and should be placed in A, B or

C. Using Cohen’s [9] guidelines, ρ = |0.10| is a small effect, ρ = |0.30| is a medium

effect and ρ = |0.50| is a large effect.

15CHAPTER 2

Methods of handling missing data

2.1 Introduction

Missing data problems have been studied for almost a century [15]. Traditionally,

researchers used ‘ad hoc’ methods to handle missing data, such as deleting the

incomplete cases (deletion-based methods) or ‘filling in’ the missing values using

single imputation. In the 1970s, there were two major developments in missing data

theory: maximum likelihood (ML) estimation [14] and multiple imputation (MI)

[43]. These methods are currently regarded as ‘state of the art’ methods of handling

missing data [49].

In 1977, Dempster, Laird & Rubin (DLR) [14] published their seminal paper

on the Expectation-Maximisation (EM) algorithm, a maximum likelihood (ML) es-

timation method for a wide range of incomplete data problems, including missing

data, truncated distributions, censored and grouped data. The EM algorithm may

also be applied to other missing data paradigms, such as mixtures, log linear models

and latent variable models. Prior to the EM algorithm, ML estimates (MLEs) were

obtained using methods such as the Newton-Raphson, Fisher’s scoring and quasi-

Newton methods. The main advantage of the EM algorithm over Newton-type

methods is that it reformulates an incomplete data problem in terms of a complete

data problem which is computationally tractable [35, p.2].

A year before the DLR paper, Rubin [42] had published a paper describing

a methodological framework for modern missing data theory. This provided the

foundation for the development of multiple imputation by Rubin in 1978 [43] and

subsequent publications by Rubin [44], and Little & Rubin [32]. In 1987, Tanner

& Wong [56] published their work on data augmentation, which was later used to

implement multiple imputation in statistical software packages.

This chapter describes methods used to handle missing data and the advantages

16 Chapter 2. Methods of handling missing data

and disadvantages of each method. In Sections 2.2–2.4 we describe traditional meth-

ods of handling missing data: complete case analysis, pairwise deletion and single

imputation methods. In Sections 2.5–2.9 we provide an overview of MAR-based

methods of missing data handling. Finally, in Section 2.10 we discuss two methods

for MNAR data: the selection model and the pattern mixture model.

2.2 Complete case analysis

Complete case analysis (CCA) involves discarding all cases that have missing

data and performing the analysis using only the cases that are fully observed. This

is the default method in most statistical software packages and a simplistic approach

still used by many practitioners. While easy to implement, it may exclude a large

proportion of the data set, resulting in biased parameter estimates and a substantial

loss of precision and power [55]. The extent of bias and loss of precision depends

on the fraction of complete cases, the pattern of missing data, the degree to which

complete and incomplete cases differ, and on the parameters of interest [33, p.42].

In general, CCA will be unbiased if at least one of the following holds:

1. The complete cases represent a simple random sample of the data set. Until

recently, it was believed that an MCAR mechanism was required for this to

hold. However, Galati & Seaton [16] demonstrated that a weaker condition,

which they call available at random (AAR), was sufficient (refer to Section 1.6

in Chapter 1).

2. The missing data occur in an outcome variable only that is measured once in

each individual, provided that all the variables associated with the missingness

can be included as covariates [55].

3. The missing data occur in predictor variables and the reasons for missingness

are unrelated to the outcome variable [33, p.43].

2.3. Pairwise deletion 17

The population mean µ for a variable with missing data may be expressed as

µ = πCC µCC + (1− πCC)µIC , (2.1)

where µCC is the population mean of the complete cases, µIC is the population mean

for the incomplete cases and πCC is the proportion of complete cases. The bias in

the complete case sample mean is then [33, p.43]

µCC − µ = µCC − πCC µCC − (1− πCC)µIC

= (1− πCC)(µCC − µIC).

(2.2)

If the complete cases are AAR, they do not differ systematically from the incomplete

cases (µCC = µIC) so the bias will be zero.

2.3 Pairwise deletion

A variation of complete case analysis, known as pairwise deletion or available-

case analysis, eliminates cases with missing data depending on the analysis being

performed. This usually results in more data being retained than under complete

case analysis. For example, suppose there are two incomplete variables X and Y . To

calculate σ2X , the variance of X, pairwise deletion uses all the cases with complete

data on X, and similarly for σ2Y . However, to calculate the covariance, σXY , only

the cases with complete data on both X and Y are used. Thus a different set of

cases may be used to calculate each element of a covariance matrix. In contrast,

complete case analysis would use only the cases that have complete data on both X

and Y to calculate σ2X , σ2

Y and σXY .

Pairwise deletion has several limitations.

1. It may produce biased parameter estimates if the complete cases are not AAR.

2. Using different sets of cases may produce nonpositive definite matrices, and in

particular, correlations with absolute values greater than 1. This may cause


problems in estimating model parameters, such as in multivariate regression

models that use a covariance matrix as input data [15, p.41].

3. Inconsistent sample sizes may cause problems when calculating standard errors

[15, p.41].

2.4 Single imputation methods

These methods impute or ‘fill in’ each missing value with a single replacement

value. The imputed data set is then analysed using standard complete-data statisti-

cal methods. In contrast to deletion-based methods, single imputation methods do

not discard incomplete observations. However, if the filled in data set is regarded

as truly ‘complete’, the estimated variance will not take into account the uncer-

tainty associated with the missing data [4]. Consequently, single imputation will

underestimate standard errors if corrective measures are not undertaken. Exam-

ples of single imputation methods are mean imputation and hot deck imputation,

discussed below.

2.4.1 Mean imputation Also known as mean substitution, this method re-

places each missing value for a variable with the arithmetic mean of the complete

cases for that variable. Since all the imputed values are equal and a measure of

central location, mean substitution underestimates variances, covariances and cor-

relations. This in turn produces biased parameter estimates, the bias increasing with

the rate of missing data [15, p.43]. Enders [15, p.43] concludes that “...simulation

studies suggest that mean imputation is possibly the worst missing data handling

method available. Consequently, in no situation is mean imputation defensible, and

you should absolutely avoid this approach”.

2.4.2 Hot deck imputation This method replaces missing values for a non-

respondent (the recipient) with observed values from a respondent (the donor) [4].

2.4. Single imputation methods 19

The donor is similar to the recipient with respect to a set of common characteristics.

A set of potential donors is referred to as the donor pool.

Andridge & Little [4] divide hot deck imputation methods into two groups de-

pending on how donors are selected:

1. random hot deck methods, where the donor is randomly selected from the donor

pool; and

2. deterministic hot deck methods, where a donor is selected based on some cri-

teria such as ‘nearest neighbour’.

Hot deck imputation was originally developed by the United States Census Bu-

reau to deal with missing data in large data sets that were available for public use.

The term ‘hot deck’ originates from computer punch cards that were used for storing

data. A ‘hot deck’ is one that is currently being processed, whereas a ‘cold deck’ has

already been processed. In the context of missing data, hot deck imputation draws

donors from observed values in the same data set, whereas cold deck imputation

draws donors from an external data set.

Hot deck imputation uses information from donors to ‘fill in’ missing values in

order to produce a complete or ‘rectangular’ data set, which may then be analysed

using standard complete-data methods. Thus the information in the incomplete

cases is retained. Moreover, the imputed values are plausible since they are drawn

from the observed values in the data set.

Being a non-parametric method, hot deck imputation does not make distribu-

tional assumptions and is therefore less sensitive to model misspecification [4]. It

preserves all the complex relationships, such as interactions, in the data set [8, p.181],

and is invariant to transformations of the marginal distributions of the incomplete

variables [4].

Hot deck imputation will underestimate standard errors unless corrective mea-


sures are undertaken [15, p.49]. Andridge & Little [4] review three main approaches

for obtaining valid variance estimates from hot deck imputation:

1. an explicit variance formulae that incorporates the nonresponse;

2. resampling methods such as the jackknife and the bootstrap; and

3. hot deck multiple imputation (HDMI), where multiple sets of imputations are

created to mimic imputation uncertainty.

Enders [15, p.49] notes that hot deck imputation preserves univariate distribu-

tions in the data set and does not underestimate variability to the same extent

as other single imputation methods. However, it may produce biased estimates of

correlations and regression coefficients [49].

The validity of hot deck imputation rests on identifying appropriate donor pools

and the effective matching of donors to recipients [8, p.181]. Carpenter & Kenward

[8, p.181] note that non-parametric models are “inefficient” compared to parametric

models in the sense that they produce less precise parameter estimates than those

from a correctly specified parametric model. As a non-parametric method, hot deck

imputation may be useful in very large data sets as matching donors to recipients is

easier and loss of precision may be less of a concern [4]. However, for smaller studies,

Carpenter & Kenward [8, p.181] state that parametric imputation is preferred in

most cases.

2.5 Maximum likelihood estimation

Unlike complete case analysis, maximum likelihood (ML) estimation includes

the information available from the cases with missing data in estimating the model

parameters. ML estimation generally requires the use of an iterative optimisation al-

gorithm such as the Expectation-Maximisation (EM) algorithm, developed by Demp-

ster, Laird & Rubin (DLR) [14]. The EM algorithm reformulates the incomplete

2.5. Maximum likelihood estimation 21

data problem in terms of a complete data problem that is more easily solved [35,

p.2]. As the name suggests, there are two steps in each iteration: the Expectation

or E-step and the Maximisation or M-step.

Suppose we have a data set X = (Y ,Z), where Y is the observed data and Z is

the missing data. Assume that X has a probability density function p(x | β), where

β = (β1, . . . , βk) is a vector of parameters. The E-step calculates Q, the conditional

expectation of the complete data log-likelihood given the observed data Y and the

current parameter estimates. The M-step then obtains the parameter estimates that

maximise Q from the E-step.

The E and M-steps are defined as follows at iteration t + 1, t = 0, 1, 2, . . . [35,

p.19]. Set t = 0 and select initial parameter values β(0).

E-step

Calculate

Q(β;β(t)) = E[ln p(X | β) | Y = y,β(t)]. (2.3)

M-step

Select β(t+1) such that

Q(β(t+1);β(t)) ≥ Q(β;β(t)). (2.4)

DLR [14] proved that the log-likelihood L(β) is non-decreasing with each iteration,

that is

L(β(t+1)) ≥ L(β(t)). (2.5)

The iterations continue until ‖L(β(t+1))−L(β(t))‖ reaches an arbitrarily small value,

at which point the algorithm is said to have converged. In general, the EM algorithm

is numerically stable and has good global convergence properties; that is, it converges

to a local maximum from any arbitrary starting point in the parameter space [35,

p.28].

It should be noted that the EM algorithm does not ‘fill in’ or ‘impute’ missing

values. Instead, the E-step replaces missing values with their conditional expecta-


tions, which contribute to the calculation of the sufficient statistics. The M-step

then uses the sufficient statistics to generate parameter estimates.

The EM algorithm produces unbiased parameter estimates when the missingness

mechanism is MAR [15, p.87]. For an MCAR mechanism (a special case of MAR),

it increases statistical power compared to complete case analysis since it uses all of

the information available from the observed data. When the missingness mechanism

is MNAR, it has been found to produce biased parameter estimates, although this

is usually limited to a subset of model parameters [15, p.87].

The EM algorithm has evolved considerably since it was first developed and

there are now many different EM-type algorithms with various applications. EM-

type algorithms are used in a wide range of complex missing data problems including

structural equation models with missing data [15, p.104]. The focus of recent re-

search has been mainly on Markov chain Monte Carlo (MCMC) versions of EM-type

algorithms [35].

McLachlan & Krishnan [35, p.29] discuss two disadvantages of the EM algorithm.

The first is that the EM algorithm does not automatically produce an estimate of

the covariance matrix of the maximum likelihood estimates. The second is that

convergence can be slow, particularly when there is a high proportion of missing

data. We discuss these issues below.

2.5.1 Obtaining standard errors from the EM algorithm The standard

errors of the maximum likelihood estimates β may be calculated directly as follows

[15, p.97]. First, the Hessian matrix is computed from the second-order derivatives

of the observed data log likelihood. The observed information matrix I(β;y) is

the negative of the Hessian matrix. The inverse of the observed information matrix

estimates the covariance matrix for the maximum likelihood estimates β.

If the second-order derivatives are difficult to obtain analytically, standard errors

may be estimated using Meng & Rubin’s [37] Supplemented EM algorithm (SEM),

2.5. Maximum likelihood estimation 23

Louis’s method [34] or bootstrapping. Details may be found in McLachlan & Kr-

ishnan [35].

2.5.2 Using a hybrid method to accelerate convergence Hybrid meth-

ods combine the EM algorithm with a Newton-type method to accelerate conver-

gence. To take advantage of its global convergence properties, the EM algorithm is

performed for a few iterations, followed by the Newton-Raphson method or other

Newton-type method with a rapid local convergence [41]. Redner & Walker [41]

found that 95% of the change in log-likelihood from initial to maximum value gen-

erally occurred in the first five iterations of the EM algorithm.

Aitkin & Aitkin [1] developed a hybrid EM/GN (Gauss-Newton) algorithm, EM-

GN5, which is a faster alternative to the EM algorithm for finite mixture distribu-

tions. The algorithm begins with five EM iterations then switches to GN until

convergence or until the log-likelihood decreases. The method was illustrated in the

context of a two-component normal mixture [35].

The EM-GN5 algorithm took 70% of the time required for the EM algorithm

to converge, consistently over all initial values, and provided asymptotic standard

errors [1]. However, the log-likelihood generally decreased when the GN step was

first applied and sometimes required a large number of EM controlling steps before

the log-likelihood increased. It then rapidly converged to the same maximum as the

EM algorithm. Aitkin & Aitkin provide an interesting analogy to describe this [1,

p.130]:

“...we formed the impression of a traveller following the narrow EM path up a

hazardous mountain with chasms on all sides. When in sight of the summit, the

GN path leapt to the top, but when followed earlier, it caused repeated falls into

the chasms, from which the traveller had to be pulled back on to the EM track”.


2.6 Multiple imputation

Multiple imputation (MI) [43] consists of three distinct phases. The procedure

starts with the imputation phase, where the missing values are ‘filled in’ to produce

m ≥ 2 completed data sets. Next, in the analysis phase, the m completed data sets

are analysed separately using standard complete-data statistical procedures. Finally,

in the pooling phase, the results obtained from the analysis phase are combined

using Rubin’s rules [44] to produce overall parameter estimates. The three phases

are described in detail in Chapter 3.

Note that the ‘filled-in’ values in the imputation phase are not of interest in

themselves; they are simply a means of recovering missing information in order

to obtain unbiased parameter estimates and valid statistical inferences. It is also

worth noting that all of the variables in the imputation model are treated as input

variables — there is no distinction between explanatory and outcome variables in

the imputation phase.

Suppose X = (X1, . . . ,Xk) is a vector of k random variables with k-variate

distribution p(X | β). The general procedure for creating imputations X∗ for Xmis

is as follows [44].

1. Calculate the posterior distribution p(β | Xobs) of β based on the observed

data Xobs.

2. Draw β∗ from p(β |Xobs).

3. Draw X∗ from p(Xmis |Xobs,β = β∗).

Steps 2 and 3 are performed m times to create m sets of imputations.

There are many different types of MI algorithms, including fully conditional

specification (FCS) [40], predictive mean matching [30] and multivariate normal

imputation (MVNI) [44]. This study focuses on MVNI, which will be described in

detail in the next chapter. MVNI accommodates a general missing data pattern,

2.7. Fully conditional specification 25

is straightforward to implement and is available in a range of statistical software

packages. In the following sections we outline some other methods of MI. Note that

the MI methods differ only with respect to the imputation phase. The analysis and

pooling phases are the same for all MI methods.

It is important to note that MI (or any other method of handling missing data)

is not always superior to complete case analysis. Lee & Carlin [28] caution that

potential gains from MI may be mitigated by bias from an incorrectly specified

imputation model, particularly for high rates of missingness. Some guidelines for

specifying the imputation model are discussed in Section 3.6 in Chapter 3.

2.7 Fully conditional specification

Also known as multiple imputation with chained equations (MICE), fully condi-

tional specification (FCS) is a semi parametric method that generates imputations

based on a sequence of univariate imputation models, one for each incomplete vari-

able [40]. A variable with missing data is regressed on some or all of the other

variables and the missing values replaced by simulated draws from the correspond-

ing posterior predictive distribution. FCS accommodates a general missing data

pattern with missingness across different types of variables.

For each incomplete variable, the univariate imputation model will depend on the

type of variable being imputed. For example, normal linear regression is generally

used to impute continuous variables, while a logistic regression model is suitable

for imputing binary variables. An advantage of FCS is that a different type of

imputation model may be specified for each incomplete variable.

FCS uses the following iterative estimation algorithm [8, p.86].

1. First, the variables X1, . . . ,Xk are ordered so that the missingness pattern

is as close to monotone as possible (refer to Section 1.4 in Chapter 1). Stata

imputes variables in order from most to least observed [53, p.160].


2. To start the algorithm, the missing values for each incomplete variable Xj are

filled in by drawing, with replacement, from the observed values for Xj.

3. For each j = 1, . . . , k in turn, perform the following steps.

(a) Regress the observed values of Xj on the remaining variables, with miss-

ing values set at their current imputed values. If Xj is binary, use a

logistic regression model. If Xj is continuous, use a linear regression

model.

(b) Using the regression model in (a), impute the missing values of Xj.

Performing steps (a) and (b) for j = 1, . . . , k is referred to as a cycle [59].

To stabilise the estimates, a fixed number of cycles are performed to produce

a single imputed data set. Van Buuren [57] recommends between 5 and 20

cycles in most cases. White et al. [59] state that around 10 or 20 cycles are

generally sufficient to produce a single imputed data set, although more might

be required if the incomplete variables are strongly correlated.

4. Step 3 is performed m times to produce m imputed data sets.

White et al. [59] describe the procedure for imputing binary variables using FCS

as follows. Suppose that Z is an incomplete binary variable whose missing values

are imputed from a set of variables X using the logistic regression model

logit Pr(Z = 1 |X;β) = βX.

Let β be the estimated parameter from this regression model, with estimated variance-

covariance matrix V . Let β∗ represent a draw from the posterior distribution of β,

approximated by MVN(β,V ). For each missing value Zi, let

p∗i = [1 + exp (−β∗Xi)]−1.

2.8. Predictive mean matching 27

Draw an imputed value Z∗i = 1 if ui < p∗i , and 0 otherwise, where ui is a random

draw from a uniform distribution on (0, 1). White et al. [59] note that problems

can occur when one or more observations has a fitted probability of exactly 0 or 1,

which causes difficulty in drawing β∗. This is known as perfect prediction and can

occur when imputing binary, ordinal or nominal variables under FCS.

Another criticism of FCS is that the conditional densities do not always form

a multivariate joint conditional distribution. This is referred to as incompatibility

of conditionals [5]. To what extent incompatibility of the conditionals affects the

quality of the imputations is largely unknown. However, Van Buuren [57, p.228]

remarks that “in imputation, the objective is to augment the data and preserve the

relations in the data. In that case, the joint distribution is more like a nuisance

factor that has no intrinsic value”.

2.8 Predictive mean matching

This is a semi parametric imputation method developed by Little [30] that com-

bines normal linear regression with nearest neighbour imputation. It matches a

missing value to the observed value with the closest predicted mean or linear predic-

tion. Suppose we have an incomplete variable y = (y1, . . . , yn), with normal linear

regression model

yi | xi ∼ N(x′iβ, σ2), (2.6)

where xi = (xi1, . . . , xip)′ are the values of the predictors of y for observation i,

β = (β1, . . . , βp)′ are the unknown regression coefficients and σ2 is an unknown

variance.

The predictive mean matching algorithm applies the following steps.

1. Fit the regression model in (2.6) to the observed data to produce parameter

estimates β and σ2.


2. Draw new parameter estimates β∗ and σ2∗ from their joint posterior distribution

p(β, σ2) ∝ 1/σ2 under the noninformative improper prior.

3. For each incomplete case yj, perform the following steps.

i. Calculate the absolute difference |yj − ycd| between the linear prediction

yj for yj and the linear prediction ycd for each complete case ycd for

d = 1, 2, . . . , l where l is the number of complete cases in y.

ii. Determine the k minimum absolute differences and denote the corre-

sponding complete cases by yc1 , . . . , yck , where k is arbitrarily chosen.

iii. Randomly draw an imputed value for yj from yc1 , . . . , yck .

4. Repeat steps 2 and 3 above to produce m imputed data sets.

Choosing the number k of nearest neighbours is a trade-off between bias and variance

[50]. The smaller the value of k, the higher the variability of the MI estimates. On

the other hand, a large value of k may increase bias. An advantage of predictive

mean matching is that it produces plausible imputed values since they are drawn

from the observed values in the data set. Note that PMM may be used for the

conditional specifications within FCS.

2.9 Inverse probability weighting

In inverse probability weighting (IPW), the complete cases are weighted by the

inverse of their probability of being a complete case [25]. To illustrate this method,

consider a generalised linear model with outcome variable Y regressed on a set

of covariates X. The parameter estimates β are the values that solve the score

equations [51]n∑i=1

Ui(β) = 0, (2.7)

where Ui(β) is the first derivative with respect to β of the log-likelihood function.

Let Ci = 1 if case i is complete and Ci = 0 otherwise for C = (C1, C2, . . . , Cn). In

2.10. Methods for data missing not at random 29

the IPW approach, the parameter estimates β are the solution of the IPW score

equations [51]n∑i=1

CiwiUi(β) = 0, (2.8)

where wi is the weight for case i. Generally, a logistic regression model is fitted with

C as the response variable and predictors taken from X, Y and Z, where Z is a set

of measured variables not included in the substantive analysis model. The weights

wi are then taken as the inverse of the fitted probability that case i is complete [51].

IPW specifies a ‘missingness model’ while MI specifies an imputation model.

MI has two main advantages over IPW [51]. The first is that MI can use partially

observed variables. This is in contrast to IPW, which can only use fully observed

variables unless there is monotone missingness or a relational Markov model (RMM)

is used. Second, MI is generally more efficient than IPW. The advantage of IPW

is that it is, arguably, easier to understand than MI and simpler to use. Interested

readers may refer to [51] for further details.

2.10 Methods for data missing not at random

In this section, we outline methods of missing data handling when the data are

MNAR. Recall that an MNAR mechanism means that the missingness is dependent

on unmeasured variables. This type of missingness is also referred to as non-ignorable

missingness.

Under an MNAR mechanism, the data and the probability of missingness have

a joint distribution. Alternative factorisations of this joint distribution produce

two types of MNAR models: the selection model and the pattern mixture model

[15, p.290]. The selection model consists of the substantive analysis model and a

model that predicts the probability of missingness. The pattern mixture model,

on the other hand, groups the data set by missingness pattern and estimates the

substantive analysis model separately for each pattern.


It is important to note that both types of MNAR models depend on assumptions

that are not possible to verify. Consequently, Enders [15, p.327] states that the most

useful application of MNAR models is for sensitivity analysis. By applying different

models to the data, the sensitivity of the parameter estimates to various assumptions

can be determined.

2.10.1 Selection model Suppose we have a data set with variables X =

(X1, . . . ,Xk) and missingness indicator R. A selection model for this data set is

given by [8, p.17]

p(X,R) = p(R |X) p(X), (2.9)

where p(X,R) is the joint distribution of the missingness and the data, p(R | X)

is the conditional distribution of the missingness given the data, and p(X) is the

substantive analysis model.

An example of a classic selection model is the Heckman selection model [22],

which corrects bias in a regression model with MNAR missingness on the outcome

variable. Suppose that a researcher is interested in the factors determining wages

but only has wage data for those who are in paid employment. The wage data is

MNAR since people who are not in paid employment are excluded from the sample.

For example, women with low wages may decide not to work outside the home. The

regression equation for wages is

Wi = βXi + εi, (2.10)

where Wi is the wage, Xi are the explanatory variables, β are the regression coef-

ficients and εi is the error term for the ith subject. The propensity for missingness

on W is

R∗i = γZi + ζi, (2.11)

where R∗i is the latent propensity for missingness, Zi are the explanatory variables,

γ are the regression coefficients and ζi is the error term for the ith subject. The


binary missingness indicator Ri is a manifest indicator for R∗i and is estimated using

the probit regression model

p(R∗i > 0) = p(Ri = 1 | Zi) = Φ(γZi), (2.12)

where Φ is the cumulative standard normal distribution function. The error terms

ε and ζ have a bivariate normal distribution and are assumed to be independent of

the explanatory variables X and Z. The correlation between ε and ζ captures the

dependency between the outcome variable W and the propensity for missingness

R∗. A non-zero correlation between the error terms implies that missingness is

related to the outcome variable after controlling for the explanatory variables in the

substantive analysis model.

The parameters in equations 2.10 and 2.12 may be estimated using Heckman’s

two-step method based on ordinary least squares regression [22] or using maximum

likelihood estimation. Note that the selection model is sensitive to departures from

the bivariate normality assumption for the error terms [15, pp.293–294].

In order to reduce bias from MNAR missingness, the selection model must cor-

rectly specify the conditional distribution of the missingness given the data. Enders

[15, p.296] states that “...in many realistic scenarios, the model can produce esti-

mates that are even worse than those of MAR-based missing data handling meth-

ods.” Since the causes of missingness are generally unknown, it is not possible to

evaluate the performance of a selection model for a real life data set with missing

data.

2.10.2 Pattern mixture model A pattern mixture model uses the alterna-

tive factorisation of the joint distribution in (2.9) [8, p.17],

p(X,R) = p(X | R) p(R), (2.13)

where p(X | R) is the conditional distribution of the data given R, and p(R) is the

distribution of the missingness.


A pattern mixture model estimates parameters separately for each missing data

pattern then calculates a weighted average for each parameter to produce a final set

of parameter estimates. The ‘weight’ for a missing data pattern is the proportion of

cases in that pattern. Suppose we have two incomplete variables X1 and X2 with

three missing data patterns: (1) cases with observed values for both X1 and X2,

(2) cases with an observed value for X1 only, and (3) cases with an observed value

for X2 only. Estimating the parameters is straightforward for the first missing data

pattern since both variables are fully observed. However, the other patterns have

missingness in one of the variables. The model is said to be underidentified since

there is a set of inestimable parameters [15, p.299]. Estimating these parameters

requires assumptions, known as identifying restrictions. For example, Little’s [31]

complete case missing variable restriction replaces the inestimable parameters with

the parameter estimates from the complete cases.

Pattern mixture models may be used to model drop-out in longitudinal studies.

Hedeker & Gibbons [23] describe a pattern mixture model for psychiatric drug trial

data where cases with missing values are combined into a single missing data pattern.

This simplifies the model and avoids the need for parameter substitution methods.

Since the parameter estimates from a pattern mixture model are weighted aver-

ages of the estimates from each missing data pattern, additional steps are required

to obtain standard errors [15, p.309]. The delta method [23] is used to calculate ap-

proximate standard errors for parameter estimates obtained from a pattern mixture

model. The mathematical details of the delta method are beyond the scope of this

thesis but interested readers may refer to Hedeker & Gibbons [23], and Molenberghs

& Kenward [38].

The advantage of pattern mixture modelling over selection modelling is that the

former does not make distributional assumptions. However, its potential to reduce

bias depends on the appropriateness of the identifying restrictions [15, pp.300–301].


The sensitivity of the parameter estimates may be examined using a range of values

for the inestimable parameters.

2.10.3 Issues associated with MNAR data Most methods of handling

missing data assume that the data are MAR and that the missing data distribution

is ignorable. Thus the parameters of the missing data distribution are ignored when

performing an MAR-based analysis such as MI or ML estimation. Under an MNAR

mechanism, the parameters of the missing data distribution contain unique informa-

tion about the substantive model parameters [15, p.290]. Ignoring the missing data

distribution will therefore produce biased parameter estimates for MNAR data.

The MNAR-based methods described above aim to model the joint distribution

of the data and the probability of missingness. However, both of these methods rely

on assumptions that cannot be verified. The selection model makes distributional

assumptions while the pattern mixture model assumes values for inestimable pa-

rameters. Enders [15, p.287] notes that violation of these assumptions can produce

estimates that are even worse than those from an MAR-based analysis. Demirtas

& Schafer [13, p.2573] state that “the best way to handle drop-out is to make it

ignorable” and argue that an ignorability-based (MAR) analysis that includes good

predictors of attrition is often more plausible than an MNAR-based analysis.

35CHAPTER 3

Multiple Imputation

3.1 Introduction

Multiple imputation (MI) was introduced by Rubin [43] in 1978 and is considered

a ‘state of the art’ method of missing data handling [49]. It was originally developed

to handle missing data in complex surveys for creating large public-use data sets

[45], but is now used in a variety of research contexts.

MI consists of three distinct phases [44]:

1. The imputation phase: each missing value is replaced with m ≥ 2 imputed

values to produce m completed or ‘imputed’ data sets.

2. The analysis phase: each imputed data set is analysed separately using stan-

dard complete-data methods.

3. The pooling phase: the results obtained from the analysis phase are combined

using Rubin’s rules [44] to produce overall parameter estimates.

MI assumes that the data are missing at random (MAR) and that the missing

data distribution is ignorable [33], as defined in Section 1.7.

Although Rubin’s original justification for MI uses frequentist arguments, Rubin

[44] recommends creating the imputations using a Bayesian approach. This involves

specifying a parametric model for the complete data, a prior distribution for the

model parameters, and then making m draws from the conditional distribution of

the missing data given the observed data [47]. Schafer [46, pp.105–106] describes

multiple imputations as “Bayesianly proper” if they are independent realisations

of p(Xmis |Xobs), the posterior predictive distribution of the missing data under a

complete-data model and prior.

In practice, multiple imputation is performed using algorithms such as data aug-

mentation [56], which produce imputed values with stationary distribution p(Xmis |Xobs).

36 Chapter 3. Multiple Imputation

Thus multiple imputation can be described as a three stage approximation to a full

Bayesian analysis [8, p.48].

An appeal of MI as a method of missing data handling is that the imputation

phase is separate from the analysis phase. This has two main advantages. First, it

allows the imputation and analysis phases to be performed by different individuals.

Thus an incomplete data set, once imputed, may be used by different end users for a

wide range of statistical analyses. According to Rubin [45], data collectors generally

know more about the reasons for missingness and are better equipped to handle

missing data than end users. Second, it allows auxiliary variables not necessarily of

interest in the analysis phase to be included in the imputation phase. The auxiliary

variables are predictors of the incomplete variables and/or predictors of missingness

and help improve the quality of the imputations.

Schafer [47] discusses inconsistencies between the model of the imputer and the

analyst. If the model of the imputer is more general (makes fewer assumptions)

than that of the analyst, then the inferences obtained under MI will be valid, albeit

with some loss of power. If the model of the imputer is less general than that of

the analyst, and the additional assumptions by the imputer are plausible, then the

MI estimates may be more precise [36]. Rubin [45] refers to this as superefficiency.

However, MI estimates may be biased if the additional assumptions are not plausible.

Therefore the imputer should aim to preserve distributional features that will be used

in the analysis [47].

The outline of this chapter is as follows. We describe the imputation phase in

Section 3.2, followed by the analysis and pooling phases in Section 3.3. In Section 3.4

we present a detailed description of multivariate normal imputation (MVNI). A

comparison of MI and maximum likelihood (ML) estimation is given in Section 3.5.

Finally, in Sections 3.6 and 3.7 we discuss some important considerations in the

implementation of MI: specifying the imputation model and determining the number

3.2. Imputation phase 37

of imputations to perform.

3.2 Imputation phase

The first phase of MI, the imputation phase, involves replacing each missing

value with m ≥ 2 imputed values to create m completed data sets. From a Bayesian

perspective, the imputation phase alternates between two steps [44]:

1. random draws of the parameters β from their conditional distributions given

the observed data Xobs and the imputed values X∗; and

2. random draws of the missing values Xmis from their conditional distribution

given the observed data Xobs and the parameters β.

The process repeats steps 1 and 2 until convergence and produces p(β | Xobs), the

posterior distribution of the parameters given the observed data, and p(Xmis |Xobs),

the posterior distribution of the missing values given the observed data [44].

3.3 Analysis and pooling phases

In the analysis phase, each of the m imputed datasets is analysed using standard

complete data statistical methods. The results are then combined using Rubin’s

rules [44] in the pooling phase. For a single (scalar) parameter β, performing the

imputation and analysis phases produces m completed data sets with corresponding

estimates β1, . . . , βm and variances (squared standard errors) σ21, . . . , σ

2m . An overall

estimate of β is obtained by averaging the estimates from the m imputed data sets

using Rubin’s rules [44], giving

βMI =1

m

m∑j=1

βj. (3.1)

The variance of the parameter estimate is

V (βMI) = W +

(1 +

1

m

)B, (3.2)


where W is the average within-imputation variance given by

W =1

m

m∑j=1

σ2j , (3.3)

and B is the between-imputations variance given by

B =1

m− 1

m∑j=1

(βj − βMI)2. (3.4)

The within-imputation variance in (3.3) represents the ‘natural variability’ of the

dataset (had there been no missing data). The between-imputations variance in

(3.4) measures the variability of a parameter estimate across the m imputed data

sets and represents the additional sampling error due to the missing data. If the

number of imputations m is large, then from (3.2)

V (βMI) ≈ W + B. (3.5)

The fraction of missing information (FMI) is the proportion of the total sampling

variance that is due to the missing data. For large m, this is given by [15, p.225]

FMI ≈ B

W + B. (3.6)

The FMI depends on the missing data rate and the correlations among the variables

[15, p.204]. When the variables are uncorrelated, the FMI is approximately equal

to the missingness rate. However, when the variables are correlated, the FMI will

be less than the missingness rate. This is because correlation between the variables

offsets some of the loss of information. Including auxiliary variables that are highly

predictive of the incomplete variables mitigates information loss and decreases FMI.

The relative increase in variance (RIV) is the proportional increase in the sam-

pling variance due to the missing data and is given by [15, p.226]

RIV =FMI

1− FMI. (3.7)

3.4. Multivariate normal imputation 39

It is worth noting that for large m,

RIV ≈ B

W. (3.8)

Carpenter & Kenward [8, p.43] highlight three important points regarding mul-

tiple imputation.

1. Rubin’s rules [44] are generic and do not require model-specific calculations.

2. Rubin’s rules [44] should be applied to estimators that are normally or asymp-

totically normally distributed.

3. Multiple imputation has good frequentist properties for a relatively small num-

ber of imputations.

3.4 Multivariate normal imputation

Multivariate normal imputation (MVNI) [44] uses data augmentation [56], a

Bayesian iterative Markov chain Monte Carlo (MCMC) procedure, to impute missing

values assuming a multivariate normal distribution for the data. MCMC methods

generate pseudorandom draws from probability distributions through the use of

Markov chains [7]. The target distribution is the density f(.), which is often difficult

to draw from directly. Instead, we construct a Markov chain

M0,M1, . . . ,Mt, . . .

with a stationary distribution that converges to the target distribution f(.). For each

t ≥ 0, Mt+1 is sampled from the distribution p(Mt+1 |Mt) that does not depend

on the previous elements in the chain. Thus,

p(Mt+1 |M0,M1, . . . ,Mt) = p(Mt+1 |Mt).

If the value of t is large enough, Mt approximates a random draw from the target

distribution.


MVNI accommodates a general missing data pattern with a haphazard pattern

of missingness across variables in the data set. It replaces missing values by drawing

from the posterior predictive distribution of the missing data given the observed

data. Each iteration of MVNI consists of two steps: an imputation step (I-step) and

a posterior step (P-step). The I-step produces the imputations, while the P-step

generates the parameter estimates that are needed to produce the imputations in the

next iteration. Data augmentation starts with initial estimates of the mean vector

and covariance matrix. These are generally maximum likelihood (EM) estimates.

The I-steps and P-steps are then repeated until convergence is achieved, producing

a single imputed data set. The m imputed data sets are drawn from the data

augmentation chain(s) and used in the subsequent analysis and pooling phases.

Data augmentation was originally developed to approximate the posterior distri-

bution p(β |Xobs) of the model parameters β in missing data problems. Augmenting

the observed data Xobs with the unobserved data Xmis produces a posterior distri-

bution p(β |Xobs,Xmis) that is easier to simulate from. Little & Rubin [33, p.201]

describe data augmentation as the Bayesian analogue of the EM algorithm where

the I-step corresponds to the E-step and the P-step corresponds to the M-step. As in

the EM algorithm, DA involves the application of complete-data methods to missing

data problems.

Since MVNI assumes a multivariate normal distribution for the data, the im-

puted values produced are on a continuous scale. This leads to the issue of handling

imputed values for variables that are clearly not normally distributed, such as cate-

gorical variables. This issue is central to this thesis and will be examined further in

Chapter 5.

3.4.1 The I-step From a Bayesian perspective, the I-step in data augmen-

tation replaces missing values with draws from the posterior predictive distribution

of the missing data given the observed data and the current parameter estimates.


Thus

X∗t ∼ p(Xmis |Xobs,β∗t−1), t = 1, 2, . . . , (3.9)

where X∗t denotes the imputed values at iteration t, Xmis is the missing data, Xobs

is the observed data and β∗t−1 represents the current parameter estimates. The

posterior distribution in (3.9) approximates p(Xmis |Xobs).

In essence, the I-step performs what is known as stochastic regression imputation

[15, p.190]. It uses current draws of the mean vector and covariance matrix to

generate a set of regression equations that predict the incomplete variables from the

observed variables. For a multivariate analysis with a single incomplete variable X1

and observed variables X2 and X3, the imputation regression equation is

X1i = β0 + β1X2i + β2X3i + zi, i = 1, . . . , n,

where X1i is the imputed value for observation i, β0, β1 and β2 are the current

values of the regression coefficients and zi is a normally-distributed random residual

with a mean of zero and variance σ2X1|X2,X3

. The imputed values for the incomplete

variable are calculated by substituting the values for the observed variables into the

imputation regression equation and adding a normally distributed residual term.

This residual term adds variability to the imputed data.

In the case of two or more incomplete variables, each missing data pattern will

have its own regression equation. For a multivariate analysis with two incomplete

variables X1 and X2 and fully observed variable X3, the imputation regression

equations are

X1i = β0 + β1X2i + β2X3i + zi,

X2i = β0 + β1X1i + β2X3i + zi.

The residual distribution is multivariate normal and is given byZi ∼ N(0, ΣX1,X2|X3),

where ΣX1,X2|X3 is the residual covariance matrix from the multivariate regression

of the incomplete variables X1 and X2 on the fully observed variable X3.


3.4.2 The P-step This step uses Monte Carlo simulation to generate new pa-

rameter estimates from their conditional posterior distribution given the augmented

data, which consists of the observed data and the imputed data from the preceding

I-step. Thus

β∗t ∼ p(β |Xobs,X∗t ), t = 1, 2, . . . . (3.10)

At convergence, the posterior distribution in (3.10) gives p(β |Xobs).

At iteration t, the P-step uses the augmented data from the preceding I-step to

calculate the sample means µt and the sample sum of squares and cross products

matrix Λt [15, p.193]. These define the posterior distribution of the covariance

matrix at iteration t, given by

p(Σ | µt,X) ∼ W−1(n− 1, Λt), (3.11)

where X is the augmented data from the preceding I-step, W−1 is the inverse

Wishart distribution and n − 1 is the degrees of freedom for sample size n. A

new covariance matrix, Σ∗t , is then drawn from this posterior using Monte Carlo

simulation.

The posterior distribution of the mean vector at iteration t is

p(µ |X,Σ) ∼ N(µt, n−1Σ∗t ). (3.12)

Monte Carlo simulation is used to draw a new mean vector µ∗t from this posterior

distribution. The new estimates of the mean vector and covariance matrix are used

to calculate new parameter estimates β∗t , which are used in the I-step at the next

iteration.

It should be noted that the parameter values generated by the P-step may vary

considerably from one iteration to the next [15, p.199]. However, since the parameter

estimates at iteration t are used to generate the imputations at iteration t + 1, the

parameter values and imputations for successive iterations will be correlated.


3.4.3 Prior distributions In the Bayesian paradigm, the posterior distri-

bution is proportional to the product of the prior distribution and the likelihood

function. The posterior distribution of β is given by

p(β |Xobs,Xmis) ∝ p(β)L(Xobs,Xmis | β), (3.13)

where p(β) is the prior distribution and L(Xobs,Xmis | β) is the likelihood function.

Noninformative priors assign an equal probability to every possible value of the pa-

rameter, while informative priors assign different probabilities to values depending

on the (subjective) belief regarding their relative probabilities.

Multiple imputation generally uses a noninformative prior [15, p.186]. Thus the

posterior distribution is determined solely by the likelihood function. The prior

distribution for the mean is Jeffreys’ prior p(µ) = 1, while the prior distribution for

the covariance matrix is also a Jeffreys’ prior of the form [15, p.184]

p(Σ) ∝| Σ |−k+12 . (3.14)

This is a conjugate prior based on the inverse Wishart distribution, where | Σ | is

the determinant of Σ and k is the number of variables.

3.4.4 Convergence Data augmentation starts with initial estimates (µ0, Σ0)

of the mean vector and covariance matrix. These are usually taken as EM estimates.

The initial estimates could also be taken as the complete case ML estimates. The

I-step and the P-step are then applied successively to create an MCMC sequence

{(X∗t ,β∗t ) : t = 1, 2, . . . },

where X∗t is the set of imputed values from the I-step and β∗t is the set of parameter

estimates from the P-step at iteration t. Iterations continue until the sequence

stabilises or ‘converges’ to a stationary distribution. The sequence of imputed values

converges to p(Xmis |Xobs) and the sequence of parameter estimates converges to

p(β |Xobs).


Convergence of the MCMC sequence depends on the FMI and RIV (refer to

Section 3.3) and the initial parameter estimates [15]. Using EM estimates as initial

values usually leads to more rapid convergence [15, p.204]. Convergence is assessed

by examining the sequence of parameter estimates, since these are often easier to

work with than the sequence of imputations. Trace plots or time series plots that

graph the parameter estimates against the iteration number are usually examined.

The burn-in period b is the minimum number of iterations required to achieve sta-

tionarity. This is the point where the sequence of parameter estimates has stabilised.

Note that parameters tend to converge at different rates due to different rates of

missingness among variables. The value of b is constrained by the parameter that is

the slowest to converge.

Another method that is used to assess convergence of data augmentation is the

worst linear function (WLF) of the parameters [46]. This is a weighted sum of the

parameter estimates from the P-step at iteration t,

WLFt = νTβ∗t , (3.15)

where β∗t is a column vector of the parameter estimates and ν is a column vec-

tor of weights that represents the convergence rates of the corresponding maximum

likelihood (EM) estimates. Parameters that converge quickly are given a smaller

weighting, while parameters that converge more slowly are given a larger weighting.

A trace plot of the worst linear function provides a conservative estimate of conver-

gence. Stata [54] has an option called mcmconly that allows the user to obtain the

WLF estimates without performing multiple imputations.

As well as assessing convergence, dependence in the sequence of imputed values

also needs to be examined. This is because Bayesianly proper imputations [46,

pp.105–106] must be independent. The first step is to determine the number of

iterations k such that the imputations at iteration t + k are independent of the

imputations at iteration t. This may be done by examining an autocorrelation plot


to determine the lag k at which the autocorrelations for all parameter values have

fallen to zero. The value of k is determined from the parameter that is the slowest

to achieve serial independence.

3.4.5 Convergence issues Sometimes data augmentation fails to converge

for reasons including [15, pp.255–256]:

1. the number of variables is close to or greater than the number of observations;

2. groups of variables are concurrently missing;

3. values of a variable may be completely missing for certain values of another

variable.

Convergence issues can sometimes be alleviated by deleting the variables that are

causing the problem. Another option is to use the ridge prior distribution for the

covariance matrix, a semi-informative prior that smooths the correlation elements

in the covariance matrix towards zero. The ridge prior has an inverse Wishart

distribution with two parameters: degrees of freedom dfp and an estimate of the

sum of squares and cross products matrix Λ. The sum of squares and cross products

matrix at iteration t is [15, p.257]

Λt = dfpΣt, (3.16)

where Σt is a covariance matrix with correlation elements equal to zero and variance

elements obtained using the augmented data in the preceding I-step. The posterior

distribution of the covariance matrix with a ridge prior is [15, p.258]

p(Σ | µ,X) ∼ W−1(dfp + n− 1, [Λt + Λt]). (3.17)

The degrees of freedom is dfp + n− 1 and the sum of squares and cross products

matrix is Λt + Λt. This is in contrast to the posterior in (3.11), which has parameters

n− 1 and Λt.


The ridge prior solves convergence problems by increasing the sample size by dfp

and decreasing correlations between the variables. However, it also adds bias to the

parameter estimates and imputed values. To minimise bias, it is recommended that

dfp be as small as possible and this is determined on a case-by-case basis [15, p.258].

3.4.6 Obtaining the m imputed data sets The aim of the imputation

phase is to generate m imputed data sets that represent independent, random draws

from the distribution of missing values. Once convergence of the MCMC sequence

is achieved, the imputed data sets are drawn from the sequence of imputed values

in the data augmentation chain(s). Two methods are currently used.

The first is sequential data augmentation in which the m imputed data sets are

drawn from the imputed values at iterations b, b+ k, b+ 2k, . . . , b+ (m− 1)k, where

b is the burn-in period and k is the number of iterations required to achieve serial

independence.

The second method is parallel data augmentation. This method generates m

data augmentation chains and draws each imputed data set from the last iteration

in each chain. The number of iterations is determined from the greater of b and

k. Of the two methods, sequential data augmentation is easier to implement and

is used in statistical software packages such as Stata [54]. Provided there are no

problems with convergence, the two methods are likely to produce similar results

[15, p.212].

3.5 Comparison with ML estimation

If the sample size and number of imputations are large, the comparability of MI

and ML estimation depends on the variables that are included in the imputation and

analysis models as well as the relative complexity of the models [48]. The imputation

and analysis models are said to be congenial if they estimate the same number of

parameters and use the same variables [36]. If the imputation and analysis models

3.6. Specifying the imputation model 47

are congenial then MI and ML estimation will produce similar parameter estimates

and standard errors [48].

If the imputation and analysis model are uncongenial [36] but use the same set

of variables, then the parameter estimates produced by MI and ML estimation will

be similar, although standard errors under MI may be slightly higher [48]. However,

if the imputation model includes auxiliary variables that are not part of the analysis

model then MI and ML estimation will produce different results.

Schafer [47, p.7] notes that for smaller samples, MI may be better at identifying

certain features in the data set, such as skewness and multiple modes. This is

because it approximates the observed data posterior density by a finite mixture of

normal densities as opposed to a single normal density.

3.6 Specifying the imputation model

The imputation phase ‘fills in’ the missing values so that the data can be anal-

ysed using standard statistical methods. The imputation model should therefore

include features of the data that are of interest to the substantive analysis, such

as interactions between the variables [47]. In general, the imputation model should

include a larger set of variables than the substantive analysis model. Rubin [45]

recommends including as many variables as possible in the imputation model. How-

ever, including too many variables can lead to estimation problems. In particular,

the number of variables should not exceed the number of observations [15, p.201].

In general, the imputation model should include variables that predict the in-

complete variable(s) and/or predict the probability of missingness. White et al. [59]

note that including predictors of the incomplete variables improves the quality of

the imputations and reduces standard errors in addition to making the MAR as-

sumption more plausible. Spratt et al. [52] found that including variables related

to the variable with the most missing data had the greatest effect on estimates and


standard errors, while variables related only to the probability of missingness had

the smallest effect. Thus the most useful auxiliary variables are those that are highly

correlated with the incomplete variables (r > |0.40|) [15, p.133].

To avoid bias, all the variables in the substantive analysis model must be included

in the imputation model [46, p.140]. In particular, when imputing missing values

for covariates, the outcome must be included in the imputation model, as otherwise

the resulting regression coefficients will be biased towards zero [39].

When specifying the imputation model, it is important to address skewness in

continuous variables. A simulation study by Lee & Carlin [27] concluded that ig-

noring skewness in continuous variables led to large biases for the corresponding

regression parameter estimates. One approach for dealing with skewness is using a

log transformation [27]. An alternative approach involves using a log transforma-

tion with an offset such that the observed values of transformed variable have zero

skewness. This is referred to as the “log-skew()” transformation [27].

3.7 Number of imputations

An important issue in multiple imputation is determining the number m of im-

putations to perform. MI standard errors decrease as the number of imputations

increases — an infinite number of imputations produces the lowest possible standard

error [15, p.212]. The relative efficiency (RE) is the variance of an estimate based on

an infinite number of imputations divided by the variance based on m imputations.

Rubin [44] showed that this is approximately

RE =

(1 +

FMI

m

)−1, (3.18)

where FMI is the fraction of missing information (3.6). For example, if FMI = 0.3,

the standard error of an estimate with m = 3 imputations is√

1 + 0.3/3 = 1.0488

times as large as the standard error of an estimate with infinitely many imputations.

On that basis, early literature stated that a small number of imputations, such as 3 or

3.7. Number of imputations 49

5, would be adequate for statistical efficiency [46, pp.106–107]. However, subsequent

research [52, 59] indicates that a greater number of imputations may be necessary.

Simulation studies by Spratt et al. [52] showed that when only 5 or 10 imputa-

tions were performed, variability due to imputation was significant enough to affect

statistical inference. They recommended that at least 25 imputations be performed

to reduce the effect of random sampling from multiple imputation. White et al.

[59] suggest a rule of thumb that m should be at least equal to the percentage of

incomplete cases to ensure an adequate level of reproducibility. However, they state

that this rule may not be universally appropriate.

Graham et al. [20] concluded that the number of imputations has a greater effect

on statistical power than on relative efficiency. Performing more than 10 imputations

improved statistical power and 20 imputations were comparable to ML estimation

in terms of statistical power. Performing more than 20 imputations only improved

power if the FMI was very high. On that basis, 20 imputations may be regarded as

sufficient for most purposes. However, Enders [15, p.214] notes that it is possible to

use larger values of m while adding little to total processing time.

51CHAPTER 4

Exploratory Data Analysis

4.1 Introduction

The aim of this study is to develop new methods for rounding categorical vari-

ables under MVNI and compare their performance with existing methods. To com-

pare the methods, we performed large scale simulation studies in Stata [54] with

missingness imposed on an otherwise complete data set. In Section 4.2, we describe

the data set used in this study and provide summary statistics for each variable.

Section 4.3 explores the relationship between the outcome variable and the other

variables in the data set.

4.2 The NHANESIII data set

The data set used in this study was derived from the National Health and Nutri-

tion Survey (NHANESIII) conducted by the National Center for Health Statistics

(NCHS) in the United States between 1988 and 1994 [26, Chapter 6]. This was the

third in a series of surveys designed by the NCHS to collect health and nutrition

data on the population of the United States. Data were collected from physical

examinations and clinical and laboratory tests. For the purposes of this study, we

considered only adults aged 20 years or older comprising 17030 observations and 16

variables. From this, 67 subjects with incomplete records were deleted resulting in a

data set of 16963 subjects aged 20 years or older, with complete data on the variables

age, sex, race, height, weight, smoking category and high blood pressure. Summary

statistics for this data set are contained in Table 4.1. The outcome variable in our

study is high blood pressure, defined by an average systolic blood pressure of more

than 140 mmHg.

52 Chapter 4. Exploratory Data Analysis

Table

4.1:

Su

mm

arystatistics

forth

efu

lld

ataset

(n=

16963).

Varia

ble

Descrip

tion

Pro

portio

nR

ange

LQ

Media

nU

QM

ean

Std

Dev

ageage

(years)20.0–90.0

32.045.0

65.048.8

19.694

weigh

tb

ody

weigh

t(k

g)21.8–241.3

62.372.8

84.674.8

17.946

heigh

tstan

din

gheigh

t(cm

)118.6–206.5

159.0166.1

173.2166.2

9.934

BM

IB

ody

Mass

Index

(kg/m

2)11.7–79.4

23.026.1

29.927.0

5.833

sexgen

der

1=m

ale0.4674

0=fem

ale0.5326

racerace

1=C

aucasian

0.6821

0=oth

er0.3179

smoke

smok

ing

status

1=never

0.4933

2=form

er0.2497

3=cu

rrent

0.2570

hbp

high

blo

od

pressu

re1=

yes0.2049

0=no

0.7951

4.2. The NHANESIII data set 53

Age

The age of the subjects in the data set is a continuous variable ranging from 20

to 90 years. The histogram in Figure 4.1a shows a right skewed and multimodal

distribution. The boxplot in Figure 4.1b also indicates right skewness. Note the

data is left truncated since the ages are 20 or more.

Weight

The continuous variable weight represents the body weight (in kilograms) of subjects

in the data set. The weight of subjects ranges from 21.8 kg (for an 80 year old

female) to 241.3 kg (for a 33 year old male). The histogram in Figure 4.2a looks

fairly symmetric but with a slightly longer right tail. The boxplot in Figure 4.2b

shows many large values in the right tail indicating right skewness. The single low

value of 21.8kg is also visible in the boxplot.

Height

The continuous variable height represents the standing height (in centimetres) of

subjects in the data set. The height of subjects ranges from 118.6 cm (for the 80

year old female with the lowest body weight) to 206.5 cm (for a 30 year old male).

The histogram in Figure 4.3a and boxplot in Figure 4.3b show that the distribution

of height is symmetric.

Body Mass Index (BMI)

We created a new variable BMI representing Body Mass Index (BMI) as follows:

BMI =weight (kg)

(height (m))2.

The BMI of subjects ranges from 11.7 to 79.4, with a mean of 27.0 and a median

of 26.1. The histogram in Figure 4.4a is skewed to the right, while the boxplot in

Figure 4.4b has many large values in the right tail.


010

020

030

040

0Fr

eque

ncy

20 40 60 80 100AGE

Figure 4.1a: Histogram of the variable age.

20 40 60 80 100

AGE

Figure 4.1b: Boxplot of the variable age.


050

010

0015

0020

00Fr

eque

ncy

0 50 100 150 200 250Weight in kg

Figure 4.2a: Histogram of the variable weight (in kilograms).

0 50 100 150 200 250Weight in kg

Figure 4.2b: Boxplot of the variable weight (in kilograms).


050

010

0015

00Fr

eque

ncy

120 140 160 180 200Height in cm

Figure 4.3a: Histogram of the variable height (in cm).

120 140 160 180 200Height in cm

Figure 4.3b: Boxplot of the variable height (in cm).


050

010

0015

0020

0025

00Fr

eque

ncy

0 20 40 60 80BMI

Figure 4.4a: Histogram of BMI.

0 20 40 60 80BMI

Figure 4.4b: Boxplot of BMI.


Sex

The binary variable sex indicates the gender of subjects in the data set (sex=1 if

the subject is male and 0 if the subject is female).

Race

The binary variable race indicates the race of subjects in the data set (race=1 if a

subject is Caucasian and 0 otherwise).

Smoke

The variable smoke is a nominal variable that describes the smoking status of sub-

jects in the data set and consists of three categories as follows.

1. smoke=1 (never) if the subject did not smoke more than 100 cigarettes in their

lifetime,

2. smoke=2 (former) if the subject smoked more than 100 cigarettes in their

lifetime but does not currently smoke,

3. smoke=3 (current) if the subject smoked more than 100 cigarettes in their

lifetime and currently smokes.

High blood pressure

The binary variable hbp indicates the blood pressure status of subjects in the data set

(hbp=1 if the subject has high blood pressure and 0 otherwise). This is the outcome

variable of interest in our study. Note that high blood pressure for an individual

was defined as an average systolic blood pressure greater than 140 mmHg.

4.3 Relationship between high blood pressure and other variables

High blood pressure and BMI

Subjects with high blood pressure have a slightly higher median BMI. However,

subjects without high blood pressure have a higher range of BMI values, as shown

in Figure 4.5.

4.3. Relationship between high blood pressure and other variables 59

020

4060

80

no yes

BM

I

Figure 4.5: Boxplots of BMI by high blood pressure category.

High blood pressure and age

Older patients tend to have substantially higher blood pressure, as shown in Fig-

ure 4.6.

2040

6080

100

no yes

AG

E

Figure 4.6: Boxplot of age by high blood pressure category.


Two way tables

The frequency of high blood pressure by sex, race and smoking category is shown

in Tables 4.2–4.4.

Table 4.2: High blood pressure by sex.

sex

Female Male Total

hbpNo 7208 6279 13487

Yes 1826 1650 3476

Total 9034 7929 16963

Table 4.3: High blood pressure by race.

race

Caucasian Other Total

hbpNo 9136 4351 13487

Yes 2435 1041 3476

Total 11571 5392 16963

Table 4.4: High blood pressure by smoking category.

smoke

Never Former Current Total

hbpNo 6718 3083 3686 13487

Yes 1649 1153 674 3476

Total 8367 4236 4360 16963

61CHAPTER 5

Rounding methods for binary variables

5.1 Introduction

Multivariate normal imputation (MVNI) [44] is a popular method of handling

missing data since it accommodates a general missing data pattern (a haphazard

pattern of missingness across variables). However, it presents a dilemma when im-

puting discrete variables, such as binary or categorical variables, which are clearly

not normally distributed. When imputing a binary outcome, the continuous impu-

tations must be rounded to either 0 or 1, as in a logistic regression analysis. When

imputing a binary covariate, it is possible to use the continuous unrounded imputa-

tions; however, this may result in implausible values, for example a value of −0.65

for a sex variable. Since rounding is not strictly necessary for binary covariates,

should the continuous imputations be rounded, and if so, which method should be

used?

Until recently, the advice has been to round the imputed values for a binary

variable to the nearer of 0 or 1, essentially using a fixed threshold of 0.5. This

method is known as simple rounding or crude rounding [46]. Previous studies [3, 24]

compared unrounded MVNI with simple rounding and concluded that rounding

produced biased parameter estimates. However, other rounding methods have not

been evaluated in comparison with unrounded MVNI.

In contrast to simple rounding, adaptive rounding [6] does not use a fixed thresh-

old, but applies a rounding threshold to each imputed data set based on the normal

approximation to the binomial distribution. Another rounding method, known as

coin flipping [6], is based on a Bernoulli draw where the imputed value represents

the probability of a 1 (imputed values less than 0 or greater than 1 are rounded

to the nearer of 0 or 1). Simulation studies by Bernaards et al. [6] suggest that

adaptive rounding is superior to both simple rounding and coin flipping.

62 Chapter 5. Rounding methods for binary variables

Yucel et al. [60] proposed a two-stage rounding method, known as calibration,

which uses a subset of the imputed values to determine a rounding threshold that

reproduces the proportions of zeros and ones in the observed data. Under an MCAR

mechanism, by construction calibration produces unbiased estimates of means. How-

ever, under an MAR mechanism, “relationships of imputed values to other variables

are biased, engendering biases in means” [60, p.128]. Their simulations suggested

that with modest amounts of missing data these biases are likely to be tolerable.

When there is a large amount of missing data, there are more imputed values to

be rounded and hence more potential for bias. Although calibration is intuitively

appealing, the two stage process is time consuming to implement, particularly for

large data sets. To the best of our knowledge, there have been no studies to date

comparing calibration with adaptive rounding.

Demirtas [11] compared simple rounding and adaptive rounding with regression-

based rounding methods that incorporate information from other variables in the

data set. According to Demirtas [11, p.677], “a good rule should be driven by

borrowing information from other variables in the system rather than relying on

the marginal characteristics”. However, Lee et al. [29] note that regression-based

rounding is not a general approach, since the analyst must determine the variables to

be included in the regression model on a case-by-case basis. They state that a good

rounding method should preserve associations in the data as well as the marginal

distribution of the categorical variable.

We introduce our new method, which we call proportional rounding, where the

imputed values are rounded so that the overall proportions of zeros and ones match

those observed in the complete cases. Unlike regression-based methods, proportional

rounding is a general approach. Similarly to calibration, it preserves the marginal

proportions in the observed data and will therefore produce unbiased estimates of

marginal proportions if the complete cases are available at random (AAR). This is

5.2. Rounding methods 63

because if AAR holds, the observed data represents a simple random sample of the

full data set [16]. As discussed in Chapter 1, AAR is a weaker assumption than

MCAR and can apply to an MCAR, MAR or MNAR mechanism. The advantage

of proportional rounding over calibration is that duplication of the data set is not

required and imputation is performed only once, making implementation faster and

easier.

According to Lee & Carlin [28], MVNI is more likely to be of benefit when

missingness is in a confounding variable than when missingness is in a covariate

of interest. For this reason we will impose missingness on a binary confounding

variable and examine the effect on the covariate of interest.

In this chapter, we compare the performance of unrounded MVNI with simple

rounding, adaptive rounding, calibration and proportional rounding using a simula-

tion study. Simulations are performed for three missing data mechanisms and five

different sample sizes in the context of a logistic regression analysis with substantial

missingness in a binary confounding variable.

The outline of this chapter is as follows. In Section 5.2 we provide a descrip-

tion of existing rounding methods for binary variables under MVNI. In Section 5.3

we introduce proportional rounding, our new method. The data set used in this

study and the substantive analysis model are described in Section 5.4. Section 5.5

describes the method, and includes the missingness models and evaluation criteria.

In Section 5.6 we summarise the results, followed by a discussion in Section 5.7.

5.2 Rounding methods

5.2.1 Simple Rounding Imputed values are rounded to 0 if they are less

than 0.5; otherwise they are rounded to 1 [46, p.148]. The disadvantage of simple

rounding is that it uses a fixed threshold that does not take into account the marginal

distribution of the binary variable.


5.2.2 Adaptive rounding Introduced by Bernaards et al. [6], this method

uses the normal approximation to the binomial distribution to calculate a rounding

threshold for each imputed data set. If ωj is the mean of the (unrounded) imputed

binary variable for imputed data set j = 1, . . . ,m, then the corresponding rounding

threshold cj is given by

cj = ω − Φ−1(ωj)√ωj(1− ωj), (5.1)

where Φ−1 is the inverse of the standard normal cumulative distribution. Imputed

values that exceed the threshold are rounded to one, while the rest are rounded to

zero. Note that a rounding threshold must be calculated for each imputed data set.

Figure 5.1 shows the adaptive rounding thresholds for different values of ω.

According to Bernaards et al. [6], when a category is relatively rare, the adaptive

rounding threshold would reflect greater variability in the imputations than simple

rounding. Note that even if ω < 0.5, the adaptive rounding threshold can be greater

than 0.5 (refer to Figure 5.1). Similarly, even if ω > 0.5, the threshold can be less

than 0.5. We note that since adaptive rounding is based on the mean of the imputed

binary variable ω, any bias in the imputation model will affect the calculation of the

rounding threshold.

5.2.3 Calibration This is a two-stage approach that applies the following

steps [60].

Stage 1

1. Create a copy of the data set and in this delete the observed values of the

incomplete binary variable. This leaves no observed values for the binary

variable in the duplicated data set.

2. Vertically ‘stack’ the original and the duplicated data sets to create a single

stacked data set.


3. Impute the missing values in the entire stacked data set to create m imputed

data sets.

The following steps are performed for each imputed data set.

4. Identify the subset of imputed values in the duplicated data set that correspond

to observed values in the original data set.

5. For this subset of imputed values, identify a rounding threshold that produces

the same proportion of zeros and ones as in the observed data.

Stage 2

1. Restore the original data set and impute the missing values to create m im-

puted data sets.

2. For each imputed data set, use the rounding threshold obtained in stage 1 to

round the imputed values for the binary variable.

Thus imputation is performed twice, first to determine the rounding threshold, and

second to impute the missing values in the original data set. Note that a rounding

threshold must be calculated for each imputed data set.

Table 5.1 shows the original, duplicated and stacked data sets prior to imputation

for a variable with n = 5 consisting of 3 observed values (cases 1–3) and 2 missing

values (cases 4 & 5). The observed values are denoted by an asterisk (∗), while the

missing values are denoted by a dash (–). In the duplicated data, all 5 values are

designated as ‘missing’. The stacked data therefore has n = 10 with 3 observed

values and 7 missing values. When the stacked data set is imputed, all the missing

values will be replaced by imputed values. This means that cases 1–3 will also have

imputed values in the duplicated part of the stacked data set.

Re-imputation of the missing values in stage 2 is required for the following rea-

sons. Firstly, in stage 1, imputed values will be calculated for all subjects with


missing values in the stacked data set. This includes subjects that have observed

values for the incomplete binary variable in the original data set. This affects the

calculation of the sample mean and covariance matrix for the incomplete binary

variable in the posterior step (P-step) of MVNI, and subsequently, the calculation

of the imputed values in the imputation step (I-step). Secondly, the stacked data set

contains twice the number of observations as the original data set. Using a sample

size of 2n instead of n affects the posterior distributions for the mean vector and

covariance matrix in the P-step, and hence the calculation of the imputed values in

the I-step. It is therefore necessary to re-impute the missing values in stage 2 using

the original data set, as described above.

Note that the rounding thresholds calculated in stage 1 are based on the imputed

values obtained using the ‘stacked’ data set. These will be different to the imputed

values obtained in stage 2, for the reasons given above. The lack of ‘correspondence’

between the imputed values in stages 1 and 2 is another drawback of the calibration

method.

Table 5.1: The original, duplicated and stacked data sets for calibration prior to impu-

tation.

ID Original data Duplicated data Stacked data

1 ∗ – ∗2 ∗ – ∗3 ∗ – ∗4 – – –5 – – –1 –2 –3 –4 –5 –


0.0 0.2 0.4 0.6 0.8 1.0

0.3

0.4

0.5

0.6

0.7

Omega_bar

Thre

shol

d

Figure 5.1: Adaptive rounding thresholds for 0 < ω < 1.


5.3 Proportional rounding: a new rounding method

In our new method, the imputed values are rounded so that the overall proportion

of zeros and ones matches the observed proportions in the complete cases. The steps

in this method are as follows.

1. Determine the proportion p of ones and proportion 1−p of zeros in the complete

cases.

2. Calculate the required number of zeros, n0 = (1−p)×number of missing values.

Round this value to the nearest integer.

3. Impute the missing values to create m imputed data sets.


4. Sort the imputed values in ascending order.

5. Round the first n0 (sorted) imputed values to zero and the rest to one.

Note that there is no need to calculate any rounding thresholds. The only cal-

culation that is necessary is the required number of zeros and this will be the same

for each imputed data set. This makes proportional rounding considerably easier to

implement than adaptive rounding and calibration.

Proportional rounding assumes that the observed proportions reasonably ap-

proximate the true proportions in the data set; that is, the complete cases are AAR.

As discussed in Chapter 1, Galati & Seaton [16] demonstrated that AAR can hold

for an MCAR, MAR or MNAR mechanism provided that the probability of being

a complete case does not depend on the data. If AAR holds, the complete cases

constitute a simple random sample of the data set [16]. Thus proportional rounding

does not require an MCAR mechanism to produce unbiased estimates of marginal

proportions.

5.4. Substantive analysis model 69

5.4 Substantive analysis model



(NCHS) in the United States between 1988 and 1994 [26, Chapter 6]. A description

of the data is given in Chapter 4. The binary variable overweight was generated as

follows:

overweight =

1 if BMI > 25

0 otherwise.

Of the 16963 subjects, 59.23% are overweight. Thus the mean of the binary variable

overweight is 0.5923.

The substantive analysis is a logistic regression,

logit Pr(hbp) = β0 + β1age+ β2smoke1 + β3smoke2 + β4overweight

+ β5race+ error,

(5.2)

which calculates the log odds of high blood pressure based on a subject’s age (age),

smoking habits (smoke), overweight status (overweight) and race. Gender (sex )

was not a significant predictor of high blood pressure after adjusting for the other

covariates, so it was excluded from the model. Note that all the variables in this

regression model are categorical with the exception of the continuous predictor age.

Thus the variables in this study are not jointly multivariate normally distributed.

For illustration, the question we have chosen is: Are Caucasians more likely or

less likely to develop high blood pressure compared to other races, after adjusting for

all the other covariates? This can be answered by considering the coefficient β5

in (5.2). For the purposes of this analysis, missingness was imposed on the binary

covariate overweight, a confounding variable in this study.


5.5 Method

The rounding methods were compared for five data sets with sample sizes n=16963,

5000, 1000, 500 and 200, comprising the full data set and four subsamples. Each

subsample was obtained by drawing a simple random sample from the full dataset

of 16963 subjects described in Section 5.4. For each of the above data sets, the true

proportion p of overweight subjects was calculated and logistic regression was used

to obtain the true value of the race coefficient β5 and its standard error, shown in

Table 5.2 below.

Table 5.2: True values of race coefficient β5, its standard error and proportion p of

overweight subjects for each data set.

Data set β5 SE p

n = 16963 −0.4574 0.0490 0.5923n = 5000 −0.3365 0.0910 0.5974n = 1000 −0.7024 0.2025 0.5860n = 500 −0.6715 0.2926 0.5880n = 200 −1.3956 0.4694 0.5550

Missingness on the variable overweight was imposed for three different missing-

ness mechanisms: MCAR, MAR and MNAR, described in Subsection 5.5.1.

For each combination of five data sets, three missingness mechanisms and six

methods, we performed 1000 simulation replicates to produce a total of 90000 sim-

ulation runs. An overview of the simulations is provided in Figure 5.2.

5.5.1 Missingness models The model for each missingness mechanism is

described below.

MAR mechanism

Missingness was imposed on the binary variable overweight using a logistic regression

model, with the probability of missingness dependent on age, sex, race and hbp but

not overweight itself. All of these variables are observed variables in our analysis,

5.5. Method 71

in accordance with an MAR mechanism.

The MAR missingness model was

logit Pr(overweight missing) = 2− 0.025× age− sex− race+ hbp. (5.3)

The model coefficients above were chosen to create a substantial association between

the variables and missingness as well as a reasonable amount of missingness [27].

According to the model, subjects who are young, female, non-Caucasian and have

high blood pressure have the highest probability of missingness on the variable

overweight. For example, the probability of missingness for a 40 year old Caucasian

female with high blood pressure is calculated as follows:

logit Pr(overweight missing) = 2− 0.025× 40− 0− 1 + 1 = 1,

Pr(overweight missing) =e1

1 + e1= 0.731.

The missingness rates using this model ranged from 45% to 52%, so on average

around half of the observations were missing.

MNAR mechanism

Under an MNAR mechanism, missingness is dependent on the variable overweight.

The missingness model was

logit Pr(overweight missing) = −2 + 2.8624× overweight. (5.4)

Thus an overweight subject has a probability of missingness of 0.703, while a non-

overweight subject has a probability of missingness of only 0.119. The missingness

rates using this model ranged from 44% to 47%, so on average around half of the

observations were missing.

MCAR mechanism

The probability of missingness for the variable overweight was set to 48% for each

subject so that it was comparable to the average missingness rates for the MAR and


MNAR mechanisms above. Note that since there is only one incomplete variable

(overweight), AAR and MCAR are equivalent [16].

Imposing missingness

To determine if an observation for overweight was declared missing, a pseudo-random

number between 0 and 1 was generated from a uniform distribution. If the number

was less than the probability of missingness, as calculated above for each missingness

model, the observation was declared missing.

5.5.2 Simulations The following steps were performed for each simulation

replicate i = 1, . . . , 1000 for each combination of sample size, missingness mechanism

and rounding method.

1. Impose missingness on the variable overweight as appropriate, depending on

the missingness mechanism.

2. Impute the missing overweight values using MVNI to create 30 imputed data

sets. The variables hbp, age, sex, race and smoke were included in the imputa-

tion model. These are all the variables in the substantive analysis model (5.2)

and/or variables that are predictors of missingness [15, p.201].

3. Round the imputed values to either 0 or 1 using the rounding method.

4. Use the command mi estimate to combine the imputed data sets using Ru-

bin’s rules [44] and obtain an estimated regression coefficient β5i, its corre-

sponding standard error SEi and estimated proportion pi of overweight sub-

jects.

The estimates from the 1000 simulation replicates were averaged to produce β5, its

standard error SE and estimated proportion of overweight subjects p.

It is important that the simulations are performed in a way that ensures com-

parability between the rounding methods. We set the random seed in Stata at the

5.5. Method 73

Figure 5.2: Overview of simulations comparing methods for binary variables.


beginning of each set of 1000 simulation replicates to ensure that the pseudo-random

numbers generated for each sample size were the same for each method. This en-

sures that any differences in the results are due to the methods themselves and not

due to simulation error.

5.5.3 Evaluation criteria Using the notation in Subsection 5.5.2, the follow-

ing criteria were used to compare the methods.

1. Bias. For the parameter, β5, this is defined as E(β5) − β5. In this study,

E(β5) is estimated by 11000

∑1000i=1 β5i, the mean race coefficient across the 1000

simulation replicates.

2. Standard error (SE). This is calculated as 11000

∑1000i=1 SEi, the average standard

error for the race coefficient over the 1000 simulation replicates [27].

3. Standard deviation (s) of β5 across the 1000 simulation replicates, defined as

s =

√√√√ 1

1000

1000∑i=1

[β5i − E(β5)

]2.

4. Root mean square error (RMSE). In this study, this is defined as√√√√ 1

1000

1000∑i=1

(β5i − β5)2.

According to Demirtas [11, p.684], RMSE “. . . is arguably the best criterion

for evaluating (a parameter estimate) in terms of combined accuracy and pre-

cision”. Note that for a parameter estimate, RMSE can be written as

RMSE =√

bias2 + s2.

5. p − p. This is the difference between the estimated proportion p and true

proportion p of overweight subjects, where p = 11000

∑1000i=1 pi.

5.6. Results 75

5.6 Results

The results are summarised in Tables 5.3–5.5. For all missingness mechanisms

and sample sizes, complete case analysis produced inflated standard errors and RM-

SEs compared to MVNI. Standard errors and RMSEs increased as the sample size

decreased. The differences between the methods were more pronounced for the

smaller sample sizes.

MCAR mechanism

Rounding resulted in a slightly lower RMSE than unrounded MVNI for all sample

sizes except n = 16963 (Table 5.3). Adaptive rounding, calibration and proportional

rounding produced the lowest RMSEs, except for the smallest sample size (n =

200). All three rounding methods, which use the marginal distribution of the binary

variable, produced very similar results in terms of bias and RMSE.

Complete case analysis and proportional rounding had the lowest values of p− p

and therefore produced the best estimates of the proportion overweight for each

sample size (except n = 200). This was expected since both complete case analysis

and proportional rounding are based on the complete cases, which are a simple

random sample of the full data set under an MCAR missingness mechanism. As

noted previously, since there is only one incomplete variable, MCAR and AAR are

equivalent.

MAR mechanism

Rounding resulted in lower RMSEs compared to unrounded MVNI but only for

n ≤ 1000 (Table 5.4). No method was uniformly superior in terms of bias and

RMSE. Adaptive rounding, calibration and proportional rounding produced very

similar results.

For sample sizes n ≥ 5000, complete case analysis and proportional rounding

had the lowest values of p− p and thus produced the best estimates of proportions.

For sample sizes n ≤ 1000, no method was uniformly superior for estimating propor-


tions but adaptive rounding produced better estimates than proportional rounding.

This is because adaptive rounding uses the mean of the imputed binary variable

(observed and imputed values) to calculate the rounding threshold, whereas propor-

tional rounding is based on the proportions observed in the complete cases.

Under the MAR missingness model in (5.3), the complete cases are not AAR so

proportional rounding was expected to exhibit some bias in estimating proportions.

However, this was limited to the smaller sample sizes (n ≤ 1000).

MNAR mechanism

Rounding resulted in a slightly lower RMSE than unrounded MVNI for all sample

sizes except n = 16963 (Table 5.5). However, no method was uniformly superior in

terms of bias and RMSE.

All of the methods substantially underestimated the proportion of overweight

subjects. This is because under the MNAR missingness model in (5.4) the complete

cases are much less likely to be overweight. Not surprisingly, complete case analysis

and proportional rounding produced the worst estimates of proportions since they

are based on the proportions in the complete cases.

5.7 Discussion

This study highlights the advantages of MVNI over complete case analysis when

there is substantial missingness in a binary confounding variable. The results show

that there are clear benefits to using a rounding method in conjunction with MVNI

when imputing a binary variable.

Adaptive rounding, proportional rounding and calibration produced similar re-

sults and performed slightly better than simple rounding. This is because they

utilise the marginal distribution of the binary variable, in contrast to simple round-

ing which uses a fixed rounding threshold. For an MNAR mechanism, no method

was uniformly superior but complete case analysis was the worst-performing method

5.7. Discussion 77

in terms of bias, RMSE and estimates of proportion. Thus MVNI has considerable

advantages over complete case analysis even when the data are MNAR.

Although complete case analysis and proportional rounding produced identical

estimates of the proportion overweight, proportional rounding produced substan-

tially lower standard errors, biases and RMSEs due to the recovery of the missing

cases under MVNI. As a rounding method, proportional rounding is very straight-

forward to implement and has intuitive appeal. In contrast to calibration, there is no

need to duplicate the data set or perform two sets of imputations. There is also no

need to calculate any rounding thresholds. For the full data set with 16963 subjects

and 48% MCAR missingness, proportional rounding took, on average, one third of

the time to implement compared to calibration. This gives proportional rounding a

significant advantage over calibration in terms of computational efficiency. In con-

trast to simple rounding, proportional rounding uses the marginal distribution of

the binary variable.

To the best of our knowledge, this is the first study to compare adaptive round-

ing with the calibration method. The results of our simulation study show that

the performance of these two methods is very similar in terms of bias, RMSE and

estimates of proportions.

In this study, we imputed missing data in a single binary variable using MVNI.

Another multiple imputation method that could have been used is fully conditional

specification (FCS) [40], described in Chapter 2. In the case of a missing binary

variable, a logistic regression imputation model would be used. Lee & Carlin [27]

concluded that FCS and MVNI produced similar results and that MVNI performed

as well as FCS when imputing binary variables. An advantage of FCS is that a

separate regression imputation model can be specified for each incomplete variable.

However, this may result in inconsistencies between imputation models. An advan-

tage of MVNI over FCS is that it is easier to assess convergence [15, p.276].


In summary, adaptive rounding, proportional rounding and calibration produced

similar results and performed slightly better than simple rounding, particularly when

estimating proportions. However, proportional rounding was the fastest and sim-

plest method to implement as well as having intuitive appeal.

5.7. Discussion 79

Table 5.3: Comparison of rounding methods for binary variables under MCAR.

Method SE Bias RMSE s ∗ p− p

n = 16963, β5 = −0.4574 0.0490

Complete case analysis 0.0679 -0.0012 0.0477 0.0477 0.0000Unrounded MVNI 0.0491 0.0005 0.0035 0.0035 0.0000Simple rounding 0.0491 0.0005 0.0035 0.0035 0.0000Adaptive rounding 0.0490 -0.0017 0.0034 0.0030 -0.0001Calibration 0.0490 -0.0018 0.0034 0.0030 0.0036Proportional rounding 0.0490 -0.0017 0.0034 0.0030 0.0000

n = 5000, β5 = −0.3365 0.0910

Complete case analysis 0.1264 -0.0018 0.0912 0.0912 -0.0002Unrounded MVNI 0.0913 0.0011 0.0080 0.0080 -0.0002Simple rounding 0.0912 -0.0021 0.0074 0.0071 -0.0093Adaptive rounding 0.0911 -0.0020 0.0068 0.0065 -0.0004Calibration 0.0911 -0.0020 0.0068 0.0065 -0.0005Proportional rounding 0.0911 -0.0020 0.0068 0.0065 -0.0002

n = 1000, β5 = −0.7024 0.2025

Complete case analysis 0.2840 -0.0128 0.2020 0.2016 -0.0001Unrounded MVNI 0.2042 0.0109 0.0264 0.0240 0.0003Simple rounding 0.2031 0.0049 0.0199 0.0193 -0.0083Adaptive rounding 0.2032 0.0042 0.0199 0.0194 -0.0005Calibration 0.2031 0.0039 0.0197 0.0193 0.0137Proportional rounding 0.2032 0.0042 0.0198 0.0193 -0.0001

n = 500, β5 = −0.6715 0.2926

Complete case analysis 0.4153 -0.0022 0.2872 0.2871 -0.0002Unrounded MVNI 0.2970 0.0156 0.0433 0.0404 -0.0002Simple rounding 0.2954 0.0057 0.0342 0.0337 -0.0097Adaptive rounding 0.2947 0.0052 0.0320 0.0315 -0.0019Calibration 0.2947 0.0049 0.0318 0.0315 -0.0011Proportional rounding 0.2947 0.0050 0.0319 0.0315 -0.0002

n = 200, β5 = −1.3956 0.4694

Complete case analysis 0.7077 -0.1267 0.6421 0.6295 0.0020Unrounded MVNI 0.4851 0.0018 0.1010 0.1010 0.0025Simple rounding 0.4738 0.0105 0.0668 0.0659 -0.0054Adaptive rounding 0.4780 0.0055 0.0806 0.0804 -0.0006Calibration 0.4774 0.0031 0.0781 0.0780 0.0393Proportional rounding 0.4778 0.0055 0.0799 0.0797 0.0020

∗This is the standard deviation of β5 across the 1000 simulation replicates


Table 5.4: Comparison of rounding methods for binary variables under MAR.


n = 16963, β5 = −0.4574 0.0490

Complete case analysis 0.0792 0.2572 0.2642 0.0604 0.0001Unrounded MVNI 0.0490 -0.0096 0.0102 0.0034 -0.0015Simple rounding 0.0490 -0.0117 0.0121 0.0027 -0.0088Adaptive rounding 0.0490 -0.0104 0.0108 0.0028 -0.0013Calibration 0.0490 -0.0099 0.0102 0.0028 0.0024Proportional rounding 0.0490 -0.0102 0.0106 0.0027 0.0001

n = 5000, β5 = −0.3365 0.0910

Complete case analysis 0.1489 0.3001 0.3230 0.1196 -0.0001Unrounded MVNI 0.0914 -0.0113 0.0141 0.0084 -0.0012Simple rounding 0.0912 -0.0150 0.0165 0.0068 -0.0082Adaptive rounding 0.0912 -0.0127 0.0144 0.0068 -0.0012Calibration 0.0912 -0.0127 0.0143 0.0067 -0.0009Proportional rounding 0.0911 -0.0126 0.0142 0.0066 -0.0001

n = 1000, β5 = −0.7024 0.2025

Complete case analysis 0.3318 0.2976 0.3951 0.2598 -0.0120Unrounded MVNI 0.2037 -0.0059 0.0216 0.0208 -0.0047Simple rounding 0.2025 -0.0110 0.0191 0.0156 -0.0152Adaptive rounding 0.2027 -0.0103 0.0195 0.0166 -0.0065Calibration 0.2029 -0.0084 0.0188 0.0168 0.0014Proportional rounding 0.2026 -0.0121 0.0199 0.0158 -0.0120

n = 500, β5 = −0.6715 0.2926

Complete case analysis 0.4960 0.3782 0.5326 0.3751 -0.0121Unrounded MVNI 0.2973 0.0122 0.0411 0.0393 -0.0023Simple rounding 0.2945 -0.0030 0.0293 0.0291 -0.0082Adaptive rounding 0.2943 -0.0009 0.0287 0.0287 -0.0054Calibration 0.2943 -0.0022 0.0276 0.0275 -0.0096Proportional rounding 0.2940 -0.0037 0.0271 0.0269 -0.0121

n = 200, β5 = −1.3956 0.4694

Complete case analysis 0.9409 0.1045 0.9921 0.9866 0.0050Unrounded MVNI 0.5024 0.0238 0.1327 0.1305 0.0068Simple rounding 0.4768 0.0236 0.0733 0.0694 -0.0133Adaptive rounding 0.4840 0.0199 0.0932 0.0911 0.0026Calibration 0.4838 0.0392 0.0940 0.0855 0.0524Proportional rounding 0.4823 0.0161 0.0870 0.0855 0.0050


5.7. Discussion 81

Table 5.5: Comparison of rounding methods for binary variables under MNAR.


n = 16963, β5 = −0.4574 0.0490


n = 5000, β5 = −0.3365 0.0910


n = 1000, β5 = −0.7024 0.2025


n = 500, β5 = −0.6715 0.2926


n = 200, β5 = −1.3956 0.4694

Complete case analysis 0.6761 0.0412 0.4672 0.4654 -0.2596Unrounded MVNI 0.4804 0.0068 0.0856 0.0853 -0.2366Simple rounding 0.4723 0.0169 0.0604 0.0580 -0.2263Adaptive rounding 0.4752 0.0127 0.0700 0.0688 -0.2361Calibration 0.4761 0.0106 0.0689 0.0681 -0.2122Proportional rounding 0.4738 0.0149 0.0693 0.0677 -0.2596


83CHAPTER 6

Rounding methods for ordinal variables

6.1 Introduction

The previous chapter considered rounding for binary variables under multivariate

normal imputation (MVNI). We now extend our approach to ordinal variables with

more than two categories. Under MVNI, an ordinal variable may be imputed as

either a single continuous variable or as a set of indicator variables. In either case,

the imputed values are then assigned or ‘rounded’ to one of the ordinal categories.

Note that it is not possible to use the unrounded imputed values if the substantive

analysis involves estimating the relationship between the levels of an ordinal variable

and an outcome [29]. For this reason, we will not consider unrounded MVNI in this

chapter.

Crude rounding [46], calibration [61] and mean indicator-based rounding (MIBR)

[29] impute an ordinal variable as a single continuous variable. We refer to these

methods as continuous methods. Schafer [46, p.148] recommends crude rounding

when the variable has similar proportions in each category and the percentage of

missing data is not very high. However, this method was shown to introduce bias into

the marginal distribution of the categorical variable [24, 61]. Calibration and MIBR

perform well in some settings but are two-stage methods that are computationally

intensive and time-consuming to implement, particularly for large data sets [29].

Both distance-based rounding (DBR) [12] and projected distance-based rounding

(PDBR) [2] impute an ordinal variable as a set of indicator variables. We refer to

these methods as indicator-based methods. Demirtas [12] showed that DBR was

slightly better than crude rounding at estimating proportion in each category but

noted that it performs best when the number of categories is small.

Galati et al. [17] compared PDBR with DBR and crude rounding. They demon-

strated, both empirically and theoretically, that PDBR was superior to both DBR

84 Chapter 6. Rounding methods for ordinal variables

and crude rounding. However, they noted that none of the three rounding methods

take into account the marginal distribution of the ordinal variable and all introduce

bias into the marginal distribution.

We extend our new method for rounding binary variables, proportional rounding,

to ordinal variables with more than two categories. Under proportional rounding,

an ordinal variable may be imputed as either a single continuous variable, a method

we call continuous proportional rounding (CPR), or as a set of indicator variables,

a method we call indicator-based proportional rounding (IBPR). IBPR can also be

used for nominal variables. By construction, proportional rounding preserves the

proportions in the observed data and should therefore produce unbiased estimates of

proportions when the complete cases are AAR. As noted previously, the advantage

of proportional rounding over calibration is that duplication of the data set is not

required. CPR and IBPR are one-stage methods so they require only one set of

imputations, in contrast to two-stage methods such as calibration and MIBR.

In addition, we introduce an alternative new method, which we call ordinal

rounding. This is a one-stage continuous method that is suitable for ordinal variables

only.

Ordinal variables may also be imputed using fully conditional specification (FCS)

[40], a method of multiple imputation using chained equations, described in Chap-

ter 2. Under FCS, an ordinal variable is imputed using ordinal logistic regression.

Note that each imputed value is one of the ordinal categories so no rounding is

necessary under FCS.

An advantage of FCS is that a separate regression imputation model may be

specified for each type of incomplete variable. However, this may result in incon-

sistencies between imputation models [27]. We compare the performance of the

MVNI-based rounding methods with FCS when there is an ordinal exposure and a

binary outcome for an MCAR and an MAR mechanism. To the best of our knowl-

6.2. Existing indicator-based methods 85

edge, there have been no studies to date comparing MVNI-based rounding methods

with FCS in the context of estimating the relationship between the levels of an

ordinal exposure and an outcome.

The outline of this chapter is as follows. In Section 6.2 we describe existing

indicator-based methods, followed by a comparison of DBR and PDBR in Sec-

tion 6.3. An overview of existing continuous methods is given in Section 6.4. In

Sections 6.5–6.7 we describe our new methods CPR, IBPR and ordinal rounding.

The data set used in this study and the substantive analysis model are described

in Section 6.8. Section 6.9 describes the method, including the missingness models

and evaluation criteria. In Section 6.10 we summarise the results, followed by a

discussion in Section 6.11.

6.2 Existing indicator-based methods

In general, an ordinal variable with k levels can be represented by k−1 indicator

(‘dummy’) variables, one for each ordinal level excluding the reference group. Each

indicator variable is a binary variable denoting membership of the corresponding

category. An observation that belongs to the reference group will have a value of

‘0’ for each of the indicator variables. Otherwise, an observation that belongs to

category j will have a value of ‘1’ for the indicator variable corresponding to category

j and a value of ‘0’ for each of the other indicator variables.

Indicator-based methods impute an incomplete ordinal variable as a set of k− 1

indicator variables. Thus after imputation, each missing observation will have a set

of k − 1 imputed values. The imputed value corresponding to the reference group

is calculated by subtracting the sum of the k − 1 imputed values from 1. Note that

since the imputed values are generated from a multivariate normal distribution, it

is possible to have imputed values that are less than 0 or greater than 1.

Example: Consider an ordinal variable weight with three categories: under-


weight, normal and overweight, with underweight as the reference group. This ordi-

nal variable has two indicator variables: one for normal and another for overweight.

Each observation may be represented as a vector (In, Iow), where In has a value of

1 if the subject is normal weight and 0 otherwise; similarly, Iow has a value of 1 if

the subject is overweight and 0 otherwise. If the imputed values for a missing ob-

servation are (0.3, 0.1) then the imputed value for the reference group underweight

is 1− (0.3 + 0.1) = 0.6.

6.2.1 Projected distance-based rounding Proposed by Allison [2], PDBR

assigns an incomplete observation to the category with the highest imputed value.

In the example above, the missing observation would be assigned to the underweight

category as this has the highest imputed value of 0.6. PDBR can be used to round

nominal (unordered) or ordinal variables. It is unclear how PDBR would assign an

observation in the event that two or more indicators have the same imputed value.

6.2.2 Distance-based rounding This method was proposed by Demirtas

[12] and can be used to round nominal or ordinal variables. In DBR, a missing

observation is assigned to the ordinal category with the smallest Euclidean distance

to its imputed values. Let w = (w1, . . . , wk−1) be the vector of imputed values for

an incomplete observation and let vj = (I1j, . . . , I(k−1)j) be the vector corresponding

to category j where

Iij =

1 if i = j

0 if i 6= j.

The Euclidean distance dj from the set of imputed values to category j is

dj =

√√√√k−1∑i=1

(wi − Iij)2, j = 1, . . . , k. (6.1)

A three-level ordinal variable has three indicator vectors: v1 = (1, 0), v2 = (0, 1)

and the reference group v3 = (0, 0). In the example above, the vector of imputed

values is (0.3, 0.1) and the unit vectors representing normal and overweight are (1, 0)

6.2. Existing indicator-based methods 87

and (0, 1) respectively. The Euclidean distance to each of the weight categories is

calculated as follows:

dunderweight =√

(0.3− 0)2 + (0.1− 0)2 = 0.32,

dnormal =√

(0.3− 1)2 + (0.1− 0)2 = 0.71,

doverweight =√

(0.3− 0)2 + (0.1− 1)2 = 0.95.

The missing observation would be assigned to the category underweight since the

imputed values have the smallest Euclidean distance to this category. It is unclear

how DBR would assign an observation in the event that two or more indicators have

the same Euclidean distance.

We show that for a binary variable, PDBR and DBR are equivalent to simple

(crude) rounding for w 6= 0.5. Suppose that a binary variable has a missing obser-

vation with an imputed value of w. The Euclidean distance to 0 is√

(w − 0)2 = |w|

and the Euclidean distance to 1 is√

(w − 1)2 = |w − 1|. Under DBR, the imputed

value is rounded to 0 if

|w| < |w − 1| ,

i.e. if w < 0.5, and rounded to 1 if

|w − 1| < |w| ,

i.e. if w > 0.5. This is equivalent to simple rounding for binary variables (for

w 6= 0.5).

A similar argument can be made for PDBR. If a missing observation has an

imputed value of w, then the imputed value corresponding to the reference group

is 1 − w. Since PDBR rounds to the category with the highest imputed value, the

missing observation is rounded to 0 if

1− w > w,

i.e. if w < 0.5, and rounded to 1 if

w > 1− w,


i.e. if w > 0.5. Thus PDBR is also equivalent to simple rounding for binary variables

(for w 6= 0.5).

6.3 Comparison of DBR and PDBR

Galati et al. [17] compared DBR and PDBR from a theoretical standpoint. We

illustrate their argument in more detail with the following example. Suppose we

have a variable with k = 3 categories and w = (w1, w2) is the vector of k − 1 = 2

imputed values for an incomplete observation. Let category 1 be represented by

the unit vector (1,0), category 2 by the unit vector (0,1) and the reference group

(category 3) by the origin (0,0). Under DBR, the squared Euclidean distance from

w to the reference group denoted by the origin (0,0) is given by

d2 = w21 + w2

2. (6.2)

The squared Euclidean distance to category 1 represented by the unit vector (1,0)

is given by

d2 = (w1 − 1)2 + w22

= −2w1 + 1 + w21 + w2

2.

(6.3)

The squared Euclidean distance to category 2 represented by the unit vector (0,1)

is given by

d2 = w21 + (w2 − 1)2

= −2w2 + 1 + w21 + w2

2.

(6.4)

Under DBR, an incomplete observation is assigned to the category 1 if the

squared Euclidean distance to (1,0) is less than each of the squared distances to

(0,0) and (0,1). Using (6.2-6.4) we have

−2w1 + 1 + w21 + w2

2 < w21 + w2

2

∴ w1 > 1/2

(6.5)

6.3. Comparison of DBR and PDBR 89

and

−2w1 + 1 + w21 + w2

2 < −2w2 + 1 + w21 + w2

2

∴ w1 > w2.

(6.6)

Thus an incomplete observation will be assigned to category 1 if w1 > 0.5 and

w1 > w2. That is, it will be assigned to category 1 if w1 is the maximum of w1, w2, w3

where w3 = 1−w1 −w2 is the imputed value corresponding to the reference group.

Note that under PDBR, if w1 is the maximum of w1, w2, w3 then the incomplete

observation will also be assigned to category 1. Thus if DBR assigns an observation

to category 1 then PDBR will also assign the observation to category 1.

A similar argument can be made for assigning observations to category 2. Under

DBR, an incomplete observation is assigned to category 2 if the squared Euclidean

distance to (0,1) is less than each of the squared distances to (1,0) and (0,0). Using

(6.2-6.4) we have

−2w2 + 1 + w21 + w2

2 < w21 + w2

2

∴ w2 > 1/2

(6.7)

and

−2w2 + 1 + w21 + w2

2 < −2w1 + 1 + w21 + w2

2

∴ w2 > w1.

(6.8)

Thus an incomplete observation will be assigned to category 2 if w2 is the maximum

of w1, w2, w3. On that basis, PDBR will also assign the observation to category 2.

Under DBR, an incomplete observation is assigned to the reference group when

the squared distance to (0,0) is less than each of the squared distances to (1,0) and

(0,1). Using (6.2-6.4) we have

w21 + w2

2 < −2w1 + 1 + w21 + w2

2

∴ w1 < 1/2

(6.9)

and

w21 + w2

2 < −2w2 + 1 + w21 + w2

2

∴ w2 < 1/2.

(6.10)


Thus an incomplete observation is assigned to the reference group if w1 < 0.5 and

w2 < 0.5. However, this does not mean that PDBR will assign the observation to

the reference group since w3 is not necessarily the maximum of w1, w2, w3.

The above arguments can be extended to variables with k ≥ 3 categories and

may be summarised as follows [17]:

1. DBR and PDBR assign an observation to the same category if neither of them

assigns it to the reference group.

2. DBR and PDBR differ only with respect to rounding imputed values to the

reference group.

Galati et al. [17] also demonstrated that DBR biases the rounding of imputed

values towards the reference group and that the bias increases with the number of

categories k. In general, if there are k categories the average vector of imputed

values is (1/k, 1/k, . . . , 1/k). As k increases, the value of 1/k approaches 0, with the

result that observations are more likely to be assigned to the reference group [17].

For this reason, DBR performs best when the number of categories is small.

6.4 Existing continuous methods

These methods impute an incomplete ordinal variable as a single continuous

variable. Thus after imputation, each missing observation will have only one corre-

sponding imputed value. Note that continuous methods cannot be used for nominal

(unordered) categorical variables.

6.4.1 Crude rounding In crude rounding, the imputed values are rounded

to the nearest category [46, p.148]. Using Example 1 in Section 6.2, if we denote

underweight by ‘0’, normal by ‘1’ and overweight by ‘2’, the imputed values would

6.4. Existing continuous methods 91

be rounded as follows:

rounded value =

0 (underweight) if imputed value < 0.5

1 (normal) if 0.5 ≤ imputed value < 1.5

2 (overweight) if imputed value ≥ 1.5.

Note that crude rounding is a fixed threshold method that does not take into account

the marginal distribution of the ordinal variable.

6.4.2 Calibration The calibration method for rounding binary variables [60]

is readily extended to ordinal variables [61]. This two-stage approach applies the

following steps [61].

Stage 1

1. Create a copy of the data set and in this delete the observed values of the

incomplete ordinal variable. This leaves no observed values for the ordinal

variable in the duplicated data set.

2. Vertically ‘stack’ the original and the duplicated data sets to create a single

stacked data set.

3. Impute the ordinal variable in the stacked data set as a single continuous

variable.


4. Identify the subset of imputed values in the duplicated data set that correspond

to observed values in the original data set.

5. For this subset of imputed values, determine rounding thresholds that produce

the same proportions in each category as in the observed data.


Stage 2

1. Restore the original data set and impute the ordinal variable as a single con-

tinuous variable.

2. For each imputed data set, use the corresponding rounding thresholds obtained

in stage 1 to round the imputed values for the ordinal variable.

Note that rounding thresholds must be calculated for each imputed data set. The

disadvantages of calibration are that duplication of the data is required and imputa-

tion must be performed twice, making it time consuming to implement. As noted in

Chapter 5, the rounding thresholds calculated in stage 1 are based on the imputed

values obtained using the ‘stacked’ data set. These will be different to the imputed

values obtained in stage 2. As stated previously, the lack of ‘correspondence’ be-

tween the imputed values in stages 1 and 2 is a further drawback of the calibration

method.

Yucel et al. [61] presented an indicator-based approach to calibration for nominal

(unordered) variables. However, they did not specify how the method should be

implemented in order to prevent missing observations being unassigned or assigned

to more than one category.

6.4.3 Mean indicator-based rounding Lee et al. [29] proposed an alterna-

tive two-stage approach for rounding ordinal variables as follows. In the first stage,

k− 1 indicator variables are imputed using MVNI. The mean of each indicator vari-

able is then calculated for the entire imputed data set (consisting of the observed

and imputed values). The indicator mean j for a category j = 1, 2, . . . , k represents

an estimate of the proportion of observations in category j.

In the second stage, the original data set is restored and the ordinal variable is

imputed as a single continuous variable. The imputed values are rounded so that

the proportion of observations in category j is equal to j. The above steps are

6.5. Continuous proportional rounding 93

performed for each imputed data set. Note that MIBR assumes that the imputation

model accurately estimates the first-order moments of the multivariate distribution

[29].

Lee et al. [29] showed that MIBR preserves the marginal distribution of the

ordinal variable. However, a disadvantage of the method is that imputations must

be performed twice.

6.5 Continuous proportional rounding

CPR is similar to proportional rounding for binary variables, except that there

are k > 2 categories. The preliminary step involves determining the proportion

p1, p2, . . . , pk of observations in each category for the complete cases, where k is the

highest ordinal category. The corresponding number n1, n2, . . . , nk of ones required

in each category is then calculated. For category j, the required number of ones is

nj = pj × number of missing values, j = 1, 2, . . . , k,

where nj is rounded to the nearest integer. The following steps are then applied.

1. Impute the ordinal variable as a single continuous variable.


2. Sort the imputed values for the continuous variable in descending order.

3. Round the first nk imputed values to the highest ordinal category, the next

nk−1 imputed values to the second highest ordinal category and so on until

the last n1 imputed values have been rounded to the lowest ordinal category.

Note that there is no need to calculate any rounding thresholds. The only calcu-

lations that are necessary are the required number of ones in each category, which

will be the same for each imputed data set.


In contrast to MIBR, estimates of proportions are based on the proportions ob-

served in the complete cases rather than the post-imputation indicator means. Thus

the estimates of the proportions in each category are unaffected by any bias in the

imputation model. However, proportional rounding assumes that the observed pro-

portions reasonably approximate the marginal distribution of the ordinal variable.

For an MCAR mechanism, there is little benefit to using MIBR over CPR since the

observed proportions represent unbiased estimates of proportions.

6.6 Indicator-based proportional rounding

In IBPR, the indicator variable corresponding to the category with the highest

proportion of observations is rounded first, followed by the indicator variables corre-

sponding to the other categories, in order of size. Rounding each indicator variable

in turn avoids the issue of missing observations being unassigned or assigned to more

than one category.

First, the proportion p1 ≤ p2 ≤ . . . ≤ pk of observations in each category is cal-

culated for the complete cases, where k is the category with the highest proportion

of observed values. The corresponding number n1 ≤ n2 ≤ . . . ≤ nk of ones required

in each category is then calculated. For category j, the required number of ones is

nj = pj × number of missing values, j = 1, 2, . . . , k,

where nj is rounded to the nearest integer. The following steps are then applied.

1. Impute the ordinal variable as a set of k − 1 indicator variables.

2. For each missing observation, calculate the imputed value for the reference

category by subtracting the sum of the k − 1 imputed values from 1. There

are now k ‘filled in’ indicator variables.


6.7. Ordinal rounding 95

3. Set j = k, the category with the highest proportion of observed values.

4. For the indicator variable corresponding to category j, sort the imputed values

in descending order.

5. Assign the first nj of these imputed values to category j. Thus the largest

nj imputed values for the indicator variable corresponding to category j are

assigned to category j.

6. If j > 1 decrement j to j − 1 (the category with the next highest proportion

of observations) and return to step 4.

Note that there is no need to calculate any rounding thresholds. The only calcu-

lations that are necessary are the required number of ones in each category, which

will be the same for each imputed data set.

DBR and PDBR deal with each missing case in isolation and do not take into ac-

count the marginal distribution of the ordinal variable. On the other hand, IBPR ex-

amines all of the imputed values for an indicator variable and preserves the observed

proportions in the data. The observed proportions represent unbiased estimates of

the marginal proportions when the complete cases are AAR.

Since the ordering of the values of the categorical variable has not been used,

IBPR can also be employed for nominal variables.

6.7 Ordinal rounding

We introduce another new approach, which we call ordinal rounding. This

method may be used to round ordinal variables but is not suitable for nominal

variables.

For an ordinal variable X considered as a continuous variable, let x be the mean

and s2 = 1n−1

∑ni=1(xi − x)2 the variance of the complete cases. Let p1, p2, . . . , pk

be the observed proportion in each category for the complete cases, where k is the


highest ordinal category. The ordinal variable is imputed as a single continuous

variable and the rounding threshold for each category is calculated as follows.

1. Put j = k.

2. For category j, the threshold is

tj = x− Φ−1(k∑i=j

pi)s,

where Φ−1 is the inverse of the standard normal cumulative distribution. Im-

puted values greater than tj that have not already been assigned to a category

are rounded to category j.

3. Decrement j to j − 1.

4. If j > 1 return to step 2.

For example, an ordinal variable with three categories will have two rounding

thresholds t3 and t2. Imputed values greater than t3 are assigned to the highest

ordinal category while imputed values between t2 and t3 are assigned to the second

(middle) category. The remaining imputed values are assigned to the lowest ordinal

category. Note that the rounding threshold for each category is the same for each

imputed data set.

Ordinal rounding is based on the information available from the complete cases

and should therefore produce unbiased estimates if the complete cases are AAR.

When the variable has only two categories (k = 2), ordinal rounding is similar to

adaptive rounding except that it uses the mean of the complete cases instead of the

mean of the imputed binary variable.




6.8. Substantive analysis model 97


of the data is given in Chapter 4.

For the purposes of this analysis, weight was divided into three categories based

on Body Mass Index (BMI):

weight =

0 (underweight) if BMI < 20

1 (normal) if 20 ≤ BMI ≤ 25

2 (overweight) if BMI > 25.

Of the 16963 subjects, 59.23% were overweight, 33.58% normal weight and 7.19%

underweight as shown in Figure 6.1. The ordinal variable weight is therefore asym-

metrical with the underweight category having a very low prevalence and the over-

weight category predominating. When represented as a continuous variable, weight

has a mean of 1.5204 with a standard deviation of 0.6272. The proportion of obser-

vations with high blood pressure was 14.27% for underweight subjects, 16.89% for

normal weight subjects and 23.29% for overweight subjects, as shown in Figure 6.2.

0.0

5.1

.15

.2.2

5.3

.35

.4.4

5.5

.55

.6.6

5P

ropo

rtion

Underweight Normal Overweight

Figure 6.1: Proportion by weight category in the full data set (n = 16963).


The substantive analysis is a logistic regression

logit Pr(hbp) = β0 + β1normal + β2overweight+ error, (6.11)

which calculates the log odds of high blood pressure for normal and overweight sub-

jects compared with the reference group of underweight subjects. The parameters

of interest are the coefficients β1 and β2 with corresponding odds ratios eβ1 and eβ2 .

The odds of high blood pressure for normal weight subjects is 1.2202, while for

overweight subjects it is 1.8234, as shown in Figure 6.3. This indicates that there

is a positive relationship between high blood pressure and weight category in this

data set.

6.9 Method

From the full data set with 16963 subjects, the true proportion p of subjects in

each category was calculated and logistic regression was used to obtain the ‘true

values’ of the regression coefficients β1 and β2.

0.0

2.0

4.0

6.0

8.1

.12

.14

.16

.18

.2.2

2.2

4P

ropo

rtion


Figure 6.2: Proportion of observations with high blood pressure by weight category in

the full data set (n = 16963).

6.9. Method 99

11.

21.

41.

61.

8O

dds

Rat

io


Figure 6.3: Odds of high blood pressure by weight category in the full data set (n =

16963).

Note that, due to time constraints, simulations were performed using the full data

set only. Missingness was imposed on the ordinal variable weight for an MCAR and

an MAR missingness mechanism, as described in Subsection 6.9.1 below.

6.9.1 Missingness Models The MCAR and MAR missingness models are

the same as those in Chapter 5. In the MCAR model, the probability of missingness

for the variable weight was set to 48% for each subject. As noted previously, since

there is only one incomplete variable, AAR and MCAR are equivalent.

In the MAR model, missingness was imposed on weight using a logistic regression

model, with the probability of missingness dependent on age, sex, race and hbp. The

MAR missingness model was

logit Pr(weight missing) = 2− 0.025× age− sex− race+ hbp. (6.12)

To determine if a weight observation was declared missing, a pseudo-random number

between 0 and 1 was generated from a uniform distribution. If the number was less


than the probability of missingness, as calculated for each missingness model, the

observation was declared missing. The average missingness rate was 46.5% for the

MAR model compared to 48% for the MCAR model. Thus, on average, just under

half of the observations were missing.

6.9.2 Simulations The following steps were performed for each simulation

replicate i = 1, . . . , 1000.

1. Impose missingness on the weight variable.

2. Depending on the method, impute the missing values as either a single con-

tinuous variable or as a set of indicator variables to create 30 imputed data

sets. The outcome variable hbp was included in the imputation model, as well

as the auxiliary variables age, sex, race and smoke.

3. Where applicable, round the imputed values to one of the ordinal categories

using the rounding method.

4. Use the command mi estimate to combine the imputed data sets and obtain

estimates of the regression coefficients, corresponding standard errors and pro-

portions in each category.

The estimates from the 1000 simulation replicates were averaged to produce overall

estimates of the regression coefficients, corresponding standard errors and propor-

tions in each category for each method and missingness mechanism.

For comparability and reproducibility, we used the same random seed in Stata

at the beginning of each set of 1000 simulation replicates.

6.9.3 Evaluation criteria The following criteria were used to compare the

methods.

1. Bias. For the regression coefficient βj, this is defined as E(βj)−βj for j = 1, 2,

where βj is the true value of the regression coefficient obtained from the full

6.10. Results 101

data set. In this study, E(βj) is estimated by 11000

∑1000i=1 βij for j = 1, 2 and

simulation replicates i = 1, . . . , 1000.

2. Standard error (SE). For each regression coefficient, this is calculated as the

average standard error over the 1000 simulation replicates [27].

3. Root mean square error (RMSE). For the regression coefficient βj, j = 1, 2,

this is defined as√E[(βj − βj)2], estimated by√∑1000

i=1 (βij − βj)2

1000

over 1000 simulation replicates.

4. Distance. This is the Euclidean distance between the actual and estimated

proportions, given by

√(p1 − E(p1))2 + (p2 − E(p2))2 + (p3 − E(p3))2,

where p1, p2, p3 are the true proportions in the full data set and E(p1), E(p2), E(p3)

are estimated by the corresponding average proportions (over the 1000 simu-

lation replicates) of underweight, normal weight and overweight subjects re-

spectively.

6.10 Results

The results of the simulations are presented in Tables 6.1–6.4. The continuous

methods, including FCS, produced better estimates of the regression coefficients

than the indicator-based methods, in terms of bias and RMSE (Table 6.1 and 6.2).

This suggests that continuous methods produce better estimates of regression co-

efficients for a positive exposure-outcome relationship. Our results show that the

performance of FCS is comparable to the continuous MVNI-based rounding meth-

ods.


All of the indicator-based methods underestimated the odds of high blood pres-

sure for normal weight and overweight subjects. DBR was the worst method overall

for estimating regression coefficients and odds ratios, in terms of bias and RMSE.

The best methods for estimating proportions were FCS as well as MVNI-based

rounding methods CPR, IBPR, calibration, MIBR and ordinal rounding that use the

marginal distribution of the ordinal variable (Table 6.3 and 6.4). The worst method

for estimating proportions was crude rounding, followed by DBR and PDBR (in

that order). Graphs comparing the Euclidean distances for each method are given

in Figure 6.8 and 6.9.

Crude rounding underestimated the proportions in the underweight and over-

weight categories and overestimated the proportions in the normal category. This

is because, in the full data set, the proportion of normal weight subjects is 0.3358,

while the continuous variable weight has a mean of 1.5204 and a standard deviation

of 0.6272. If weight is imputed as a normally distributed continuous variable, we

would expect roughly 43.5% of the imputed values to be between 0.5 and 1.5 (the

rounding cut-offs for the normal category). Thus crude rounding was biased towards

the normal category in this scenario.

Consistent with the findings of Galati et al. [17], DBR was biased towards the

reference group and overestimated the proportion of underweight subjects.

We note that there were larger biases, RMSEs and Euclidean distances for an

MAR mechanism compared to an MCAR mechanism despite a slightly lower average

missingness rate (46.5% for the MAR model compared to 48% for the MCAR model).

Overall, the results show that ordinal rounding, IBPR and CPR are competitive

with existing methods under an MCAR and an MAR mechanism. Note that the

new methods performed well even when the complete cases were not AAR (under

the MAR missingness model in Subsection 6.9.1).

6.10. Results 103

MCAR mechanism

Complete case analysis produced the lowest bias for both regression coefficients but

standard errors were inflated as a result of the reduction in sample size (Table 6.1).

The continuous methods and PDBR produced lower RMSEs than complete case

analysis for the regression coefficient β1 (normal). FCS produced the lowest RMSE

overall for β1. For the regression coefficient β2 (overweight), only the continuous

methods produced lower RMSEs when compared with complete case analysis. Crude

rounding produced the lowest RMSE overall for β2. Graphs comparing RMSEs for

β1 and β2 are given in Figure 6.4 and 6.5.

MAR mechanism

The continuous methods and PDBR produced lower RMSEs than complete case

analysis for the regression coefficient β1 (Table 6.2). CPR produced the lowest bias

and RMSE overall for β1.

All the methods produced large biases and RMSEs for the regression coefficient

β2. Crude rounding was the only method to produce a (very slightly) lower RMSE

than complete case analysis. Graphs comparing RMSEs for β1 and β2 are given in

Figures 6.6 and 6.7.

In terms of Euclidean distance, MIBR produced the best estimates of propor-

tions, while crude rounding was the worst-performing method (Table 6.4).


Table 6.1: Estimates of coefficients β1 and β2 under MCAR.

Normal (β1) Odds Ratio Coefficient SE Bias RMSE

Full data 1.2202 0.1990 0.0892Complete case 1.2212 0.1998 0.1239 0.0008 0.0864

Indicator-basedPDBR 1.1536 0.1429 0.1148 -0.0561 0.0856IBPR 1.0815 0.0783 0.1082 -0.1207 0.1324DBR 1.0508 0.0496 0.0997 -0.1494 0.1567

ContinuousFCS 1.2039 0.1856 0.1094 -0.0134 0.0495Ordinal 1.2487 0.2221 0.1098 0.0231 0.0565MIBR 1.2487 0.2221 0.1099 0.0231 0.0566CPR 1.2488 0.2222 0.1098 0.0232 0.0566Calibration 1.2490 0.2224 0.1097 0.0234 0.0567Crude 1.2845 0.2504 0.1128 0.0514 0.0761

Overweight (β2) Odds Ratio Coefficient SE Bias RMSE



ContinuousCrude 1.8199 0.5988 0.1130 -0.0019 0.0683Calibration 1.7740 0.5732 0.1091 -0.0275 0.0691CPR 1.7725 0.5724 0.1092 -0.0283 0.0695Ordinal 1.7724 0.5723 0.1092 -0.0284 0.0695FCS 1.8077 0.5921 0.1094 -0.0086 0.0697MIBR 1.7724 0.5723 0.1093 -0.0284 0.0697

6.10. Results 105

Table 6.2: Estimates of coefficients β1 and β2 under MAR.


Full data 1.2202 0.1990 0.0892Complete case 1.1627 0.1507 0.1285 -0.0483 0.0985

Indicator-basedPDBR 1.1480 0.1380 0.1171 -0.0610 0.0877IBPR 1.0504 0.0492 0.1096 -0.1498 0.1590DBR 0.9756 -0.0247 0.0994 -0.2237 0.2284

ContinuousCPR 1.1809 0.1663 0.1109 -0.0327 0.0567Calibration 1.1787 0.1644 0.1103 -0.0346 0.0570Ordinal 1.1783 0.1641 0.1103 -0.0349 0.0574Crude 1.2643 0.2345 0.1142 0.0355 0.0619MIBR 1.1617 0.1499 0.1091 -0.0491 0.0662FCS 1.1484 0.1384 0.1092 -0.0606 0.0746




ContinuousCrude 1.5699 0.4510 0.1160 -0.1497 0.1631FCS 1.5395 0.4315 0.1118 -0.1692 0.1798CPR 1.5222 0.4201 0.1118 -0.1806 0.1903Calibration 1.5143 0.4150 0.1116 -0.1857 0.1951Ordinal 1.5141 0.4149 0.1114 -0.1858 0.1952MIBR 1.4972 0.4036 0.1101 -0.1971 0.2056


Table 6.3: Estimates of proportions in each category under MCAR.

Method Underweight Normal Overweight Distance

Full data 0.0719 0.3358 0.5923Complete case 0.0719 0.3357 0.5924 0.0001

Indicator-basedIBPR 0.0719 0.3357 0.5924 0.0001PDBR 0.0606 0.3512 0.5882 0.0195DBR 0.0936 0.3351 0.5712 0.0303

ContinuousCPR 0.0719 0.3358 0.5923 0.0001MIBR 0.0719 0.3358 0.5922 0.0001Calibration 0.0720 0.3358 0.5923 0.0001Ordinal 0.0720 0.3358 0.5922 0.0002FCS 0.0720 0.3360 0.5920 0.0003Crude 0.0624 0.3834 0.5542 0.0618

Table 6.4: Estimates of proportions in each category under MAR.




ContinuousMIBR 0.0714 0.3378 0.5908 0.0025CPR 0.0672 0.3405 0.5924 0.0066Calibration 0.0684 0.3428 0.5888 0.0086Ordinal 0.0685 0.3428 0.5887 0.0086FCS 0.0683 0.3430 0.5887 0.0088Crude 0.0597 0.3870 0.5532 0.0656

6.10. Results 107

.04

.06

.08

.1.1

2.1

4.1

6R

MS

E

FCS Ord MIBR CPR Cal Crude PDBR CCA IBPR DBR

Figure 6.4: RMSEs for β1 under MCAR.

.05

.07

.09

.11

.13

.15

.17

.19

RM

SE

Crude Cal CPR Ord FCS MIBR CCA PDBR IBPR DBR

Figure 6.5: RMSEs for β2 under MCAR.


.05

.07

.09

.11

.13

.15

.17

.19

.21

.23

RM

SE

CPR Cal Ord Crude MIBR FCS PDBR CCA IBPR DBR

Figure 6.6: RMSEs for β1 under MAR.

.15

.2.2

5.3

.35

RM

SE

Crude CCA FCS CPR Cal Ord MIBR PDBR IBPR DBR

Figure 6.7: RMSEs for β2 under MAR.

6.10. Results 109

0.0

1.0

2.0

3.0

4.0

5.0

6D

ista

nce

CCA IBPR CPR MIBR Cal Ord FCS PDBR DBR Crude

Figure 6.8: Euclidean distances under MCAR.

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8D

ista

nce

MIBR CPR IBPR CCA Cal Ord FCS PDBR DBR Crude

Figure 6.9: Euclidean distances under MAR.


6.11 Discussion

The results show that for a positive exposure-outcome relationship, the best

estimates of the regression coefficients are obtained from methods that impute the

ordinal variable as a single continuous variable rather than as a set of indicators.

Our new one-stage methods, CPR and ordinal rounding, produced comparable

results to MIBR and calibration but were easier to implement and faster to run.

The worst method overall was DBR, which substantially underestimated regression

coefficients and odds ratios, particularly for an MAR mechanism.

The best estimates of proportions in terms of Euclidean distance were obtained

using either FCS or MVNI-based rounding methods that utilise the marginal distri-

bution of the ordinal variable. Not surprisingly, crude rounding produced the worst

estimates of proportions as it uses fixed rounding thresholds [17, 29].

We note that CPR and ordinal rounding performed well compared to existing

methods when applied to an asymmetrical ordinal variable where one of the cate-

gories (underweight) had a very low prevalence.

In general, the performance of FCS was comparable to the continuous MVNI-

based methods. However, FCS may be difficult to implement in a general missing

data setting in which missingness occurs across different types of variables [29]. We

note that FCS is also susceptible to perfect prediction when imputing categorical

variables, although this problem did not occur in our study. Perfect prediction occurs

when a categorical outcome completely separates one or more explanatory variables.

That is, the explanatory variable(s) are perfect predictors of the outcome variable.

To mitigate this issue, Stata includes an option known as augmented regression, in

which a few observations with small weightings are added to the data during estima-

tion [53, p.138]. Perfect prediction is a common problem when imputing categorical

variables, particularly for smaller sample sizes, and is an important consideration

when using FCS. White et al. [58] provide a detailed discussion of methods used

6.11. Discussion 111

to handle perfect prediction, including bootstrapping, penalised regression methods

and augmented regression.

We note that MI is not always superior to complete case analysis when esti-

mating the relationship between the levels of an ordinal exposure and an outcome.

For an MCAR mechanism, all of the continuous methods produced lower RMSEs

than complete case analysis for both regression coefficients. However, for an MAR

mechanism, the MI methods produced large biases and RMSEs for the regression

coefficient β2. Thus the performance of MI may vary considerably for different levels

of an ordinal exposure.

113CHAPTER 7

Rounding ordinal variables: non-linear relationship

7.1 Introduction

In the previous chapter we considered rounding methods for ordinal variables

under MVNI. We now extend our approach to ordinal variables in the case of a

non-linear exposure-outcome relationship.

To date, very few studies have examined rounding methods for ordinal variables

in this context. A recent study by Lee et al. [29] concluded that methods that

impute an ordinal exposure variable as continuous tended to ‘flatten’ a non-linear

exposure-outcome relationship. However, they noted that methods that imputed an

ordinal variable as a set of indicator variables preserved the non-linear relationship

but not the proportion of observations in each category. They concluded that further

work was needed to develop a method that would preserve the non-linear association

as well as the marginal distribution of the ordinal variable. The method is expected

to be an indicator-based method in order to preserve the non-linear association.

Note that Lee et al. [29] examined MVNI-based methods only and did not include

FCS in their study.

We observe that there are two types of indicator-based rounding methods. The

first type examines each missing case in isolation, for example projected distance-

based rounding (PDBR) and distance-based rounding (DBR). The second type ex-

amines all the imputed values for the indicator variable. Our new method, indicator-

based proportional rounding (IBPR) introduced in the previous chapter is of the

second type. The advantage of IBPR is that it preserves the proportions in the

observed data and thus produces unbiased estimates of proportions if the complete

cases are AAR.

We compare the performance of the MVNI-based rounding methods with FCS

(ordinal logistic regression) for the case of a v-shaped relationship between an ordinal

114 Chapter 7. Rounding ordinal variables: non-linear relationship

exposure variable with three categories and a binary outcome. If there are more than

three categories, this relationship is described as u-shaped [29]. The methods are

compared for MCAR and MAR mechanisms.

The outline of this chapter is as follows. In Section 7.2, we describe the data set

and substantive analysis model. Section 7.3 outlines the method and in Section 7.4

we summarise the results, followed by a discussion in Section 7.5.





of the data is given in Chapter 4.

To create a non-linear (v-shaped) relationship between hbp and weight, we deleted

400 subjects of normal weight with high blood pressure. For computational ease, we

drew a simple random sample from the remaining data to create a subsample with

5000 subjects. This subsample consists of 60.54% overweight, 31.98% normal weight

and 7.48% underweight subjects as shown in Figure 7.1. Thus the underweight cat-

egory has a very low prevalence in this data set. When represented as a continuous

variable, weight has a mean of 1.5306 with a standard deviation of 0.6315. A graph

of the proportion of observations with high blood pressure by weight category based

on the subsample of 5000 observations is given in Figure 7.2. The proportion of

observations with high blood pressure was 14.71% for underweight subjects, 10.57%

for normal weight subjects and 24.05% for overweight subjects.

The substantive analysis is a logistic regression,

logit Pr(hbp) = β0 + β1normal + β2overweight+ error, (7.1)

which calculates the log odds of high blood pressure for normal and overweight sub-

jects compared with the reference group of underweight subjects. The parameters

7.3. Method 115

of interest are the coefficients β1 and β2 with corresponding odds ratios eβ1 and eβ2 .

A graph of the odds of high blood pressure by weight category is shown in

Figure 7.3. The odds ratio for normal weight subjects is 0.6854, while the odds

ratio for overweight subjects is 1.8366. Thus normal weight subjects have lower

odds of high blood pressure, and overweight subjects have higher odds of high blood

pressure, compared to underweight subjects.

7.3 Method

Simulations were performed to compare the methods using the data set with 5000

subjects described in Section 7.2. The missingness models, simulation procedure and

evaluation criteria are described in Chapter 6.

0.0

5.1

.15

.2.2

5.3

.35

.4.4

5.5

.55

.6.6

5

Pro

porti

on


Figure 7.1: Proportion by weight category in the data set with n = 5000.


.05

.1.1

5.2

.25

Pro

porti

on w

ith h

igh

bloo

d pr

essu

re


Figure 7.2: Proportion of observations with high blood pressure by weight category in

the data set with n = 5000.

.51

1.5

2

Odd

s of

hig

h bl

ood

pres

sure


Figure 7.3: Odds of high blood pressure by weight category in the data set with n = 5000.

7.4. Results 117

7.4 Results

MCAR mechanism

Complete case analysis produced the lowest bias for both regression coefficients.

However, standard errors were inflated as a result of the reduction in sample size

(Table 7.1).

All of the indicator-based methods produced lower RMSEs than complete case

analysis. DBR produced the lowest RMSE for both regression coefficients. In con-

trast, all of the continuous methods (including FCS) produced large biases and

RMSEs for both regression coefficients, substantially overestimating the odds of

high blood pressure for normal weight and overweight subjects. Graphs comparing

RMSEs are given in Figure 7.4 and 7.5.

Complete case analysis, IBPR, CPR and ordinal rounding produced the best

estimates of proportions, in terms of Euclidean distance (Table 7.3). In contrast,

crude rounding produced the largest Euclidean distance. A graph comparing the

Euclidean distances for each method is given in Figure 7.8.

MAR mechanism

All of the indicator-based methods produced lower biases and RMSEs than complete

case analysis for the regression coefficient β1 (Table 7.2). IBPR produced the lowest

bias and RMSE for this coefficient. In contrast, all of the continuous methods

(including FCS) produced large biases and RMSEs for β1, overestimating the odds

of high blood pressure for normal weight subjects.

The continuous methods performed better than the indicator-based methods in

estimating the regression coefficient β2. The indicator-based methods substantially

underestimated the odds of high blood pressure for overweight subjects. This is

in contrast to the results obtained for an MCAR mechanism where the indicator-

based methods performed better than the continuous methods for both regression

coefficients. Graphs comparing RMSEs are given in Figure 7.6 and 7.7.


In terms of Euclidean distance, MIBR produced the best estimates of propor-

tions, while crude rounding was the worst-performing method (Table 7.4 and Fig-

ure 7.9).

7.5 Discussion

The results show that for an MCAR mechanism and a non-linear exposure-

outcome relationship, the best estimates of regression coefficients are obtained using

indicator-based rounding methods such as PDBR, DBR and IBPR. The continuous

methods produced large biases and RMSEs for both regression coefficients. FCS

was comparable to the continuous MVNI-based methods in terms of bias, RMSE

and estimates of proportions.

A study by Lee et al. [29] concluded that methods that impute an ordinal

exposure variable as continuous tended to ‘flatten’ a non-linear exposure-outcome

relationship. However, methods that impute an ordinal variable as a set of indicator

variables preserved the non-linear relationship but not the proportion of observa-

tions in each category. We found that for an MCAR mechanism, the continuous

methods distorted the non-linear relationship by overestimating the odds of high

blood pressure for both normal weight and overweight subjects. All these methods

produced odds ratios that were close to 1 for normal weight subjects. IBPR was

the only method that preserved the non-linear relationship as well as the marginal

distribution of the ordinal variable.

For an MAR mechanism, the results were not so clear. While the indicator-

based methods produced the best estimates for the regression coefficient β1, the

continuous methods produced better estimates for β2. For both an MCAR and

an MAR mechanism, FCS produced the lowest RMSE for β1 but had the highest

RMSE for β2. This indicates that the performance of MI may vary considerably for

different levels of an ordinal exposure, as noted in the previous chapter.

7.5. Discussion 119

The best estimates of proportions in terms of Euclidean distance were obtained

using either FCS or MVNI-based rounding methods that preserve the marginal dis-

tribution of the ordinal variable (MIBR, CPR, IBPR, ordinal rounding and calibra-

tion). Not surprisingly, crude rounding produced the worst estimates of proportions

as it uses fixed rounding thresholds.

In general, for a non-linear exposure-outcome relationship, an incomplete ordinal

variable should be rounded using an indicator-based method. IBPR is recommended

over other indicator-based methods as it preserves the non-linear relationship as well

as the marginal distribution of the ordinal variable. We note that IBPR was superior

to existing indicator-based methods even when one of the categories had a very low

prevalence.


Table 7.1: Estimates of coefficients β1 and β2 for a non-linear exposure-outcome rela-

tionship under MCAR.


Full data 0.6854 -0.3777 0.1671Complete case 0.6897 -0.3716 0.2329 0.0061 0.1634

Indicator-basedDBR 0.7231 -0.3243 0.1936 0.0534 0.1104IBPR 0.7221 -0.3256 0.2061 0.0521 0.1196PDBR 0.7589 -0.2759 0.2152 0.1018 0.1606

ContinuousFCS 0.9328 -0.0695 0.2127 0.3082 0.3234MIBR 0.9958 -0.0042 0.2098 0.3735 0.3868Calibration 0.9957 -0.0043 0.2101 0.3734 0.3871CPR 0.9966 -0.0034 0.2094 0.3743 0.3876Ordinal 0.9969 -0.0031 0.2097 0.3746 0.3879Crude 1.0297 0.0293 0.2124 0.4070 0.4214



Indicator-basedDBR 1.7153 0.5396 0.1765 -0.0683 0.1126PDBR 1.7473 0.5581 0.1997 -0.0498 0.1277IBPR 1.6998 0.5305 0.1902 -0.0774 0.1287

ContinuousMIBR 2.0910 0.7377 0.2011 0.1298 0.1761CPR 2.0918 0.7380 0.2008 0.1301 0.1763Calibration 2.0906 0.7375 0.2013 0.1296 0.1764Ordinal 2.0924 0.7383 0.2010 0.1304 0.1764Crude 2.1074 0.7455 0.2067 0.1376 0.1884FCS 2.2965 0.8314 0.2045 0.2235 0.2559

7.5. Discussion 121

Table 7.2: Estimates of coefficients β1 and β2 for a non-linear exposure-outcome rela-

tionship under MAR.


Full data 0.6854 -0.3777 0.1671Complete case 0.6101 -0.4942 0.2380 -0.1165 0.1959

Indicator-basedIBPR 0.6768 -0.3904 0.2047 -0.0127 0.1009DBR 0.6490 -0.4324 0.1900 -0.0547 0.1047PDBR 0.7227 -0.3248 0.2161 0.0529 0.1288

ContinuousFCS 0.8889 -0.1177 0.2099 0.2600 0.2756MIBR 0.9356 -0.0666 0.2065 0.3111 0.3238CPR 0.9457 -0.0558 0.2080 0.3219 0.3348Calibration 0.9460 -0.0555 0.2081 0.3222 0.3351Ordinal 0.9467 -0.0548 0.2077 0.3229 0.3355Crude 1.0272 0.0268 0.2121 0.4045 0.4170




ContinuousOrdinal 1.7466 0.5577 0.2032 -0.0502 0.1229CPR 1.7483 0.5586 0.2036 -0.0493 0.1230Calibration 1.7474 0.5582 0.2033 -0.0497 0.1237MIBR 1.7379 0.5527 0.2014 -0.0552 0.1238Crude 1.7795 0.5763 0.2102 -0.0316 0.1264FCS 1.9165 0.6505 0.2051 0.0426 0.1296


Table 7.3: Estimates of proportions in each category for a non-linear exposure-outcome

relationship under MCAR.




ContinuousOrdinal 0.0749 0.3197 0.6054 0.0002CPR 0.0747 0.3197 0.6057 0.0003MIBR 0.0747 0.3196 0.6057 0.0004Calibration 0.0747 0.3196 0.6057 0.0004FCS 0.0749 0.3194 0.6057 0.0005Crude 0.0636 0.3721 0.5643 0.0675

Table 7.4: Estimates of proportions in each category for a non-linear exposure-outcome

relationship under MAR.




ContinuousMIBR 0.0740 0.3198 0.6061 0.0011FCS 0.0709 0.3247 0.6044 0.0064CPR 0.0705 0.3247 0.6048 0.0066Ordinal 0.0713 0.3255 0.6032 0.0070Calibration 0.0711 0.3255 0.6035 0.0071Crude 0.0611 0.3741 0.5648 0.0692

7.5. Discussion 123

.1.1

5.2

.25

.3.3

5.4

.45

RM

SE

DBR IBPR PDBR CCA FCS MIBR Cal CPR Ord Crude

Figure 7.4: RMSEs for β1 for a non-linear relationship under MCAR.

.1.1

2.1

4.1

6.1

8.2

.22

.24

.26

RM

SE

DBR PDBR IBPR CCA MIBR CPR Cal Ord Crude FCS

Figure 7.5: RMSEs for β2 for a non-linear relationship under MCAR.


.1.1

5.2

.25

.3.3

5.4

RM

SE

IBPR DBR PDBR CCA FCS MIBR CPR Cal Ord Crude

Figure 7.6: RMSEs for β1 for a non-linear relationship under MAR.

.1.1

2.1

4.1

6.1

8.2

.22

.24

.26

.28

.3R

MS

E

Ord CPR Cal MIBR Crude FCS PDBR CCA IBPR DBR

Figure 7.7: RMSEs for β2 for a non-linear relationship under MAR.

7.5. Discussion 125

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7D

ista

nce

CCA IBPR Ord CPR MIBR Cal FCS PDBR DBR Crude

Figure 7.8: Euclidean distances for a non-linear relationship under MCAR.

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7D

ista

nce

MIBR CCA IBPR FCS CPR Ord Cal PDBR DBR Crude

Figure 7.9: Euclidean distances for a non-linear relationship under MAR.

127CHAPTER 8

Discussion and Conclusion

The aim of this study was to evaluate existing methods and develop new methods

of rounding categorical variables under MVNI. In Chapter 5 we introduced our new

method, proportional rounding, and compared its performance with existing round-

ing methods for binary variables. The results highlighted the clear benefits of using

a rounding method in conjunction with MVNI when imputing a binary confound-

ing variable. Adaptive rounding, proportional rounding and calibration produced

similar results and performed better than simple rounding, particularly when esti-

mating proportions. Calibration was the most difficult method to implement as it

involves duplicating the data set and performing two sets of imputations. Propor-

tional rounding has a similar intuitive appeal to calibration but takes, on average,

one third of the time to implement.

In Chapters 6 and 7 we compared existing rounding methods for ordinal vari-

ables with our new one-stage methods, continuous proportional rounding (CPR),

indicator-based proportional rounding (IBPR) and ordinal rounding. In contrast

to two-stage methods, such as mean indicator-based rounding (MIBR) and calibra-

tion, our new methods require only one set of imputations. The results indicated

that for a positive exposure-outcome relationship, the best estimates of regression

coefficients are obtained from methods that impute the ordinal variable as a single

continuous variable. CPR and ordinal rounding performed as well as or better than

existing continuous methods in terms of bias, RMSE and estimates of proportions.

The main advantages of the new methods are their ease of implementation and

increased computational speed compared to calibration and MIBR.

In general, when the exposure-outcome relationship is non-linear, the best esti-

mates of regression coefficients are obtained using indicator-based rounding methods.

Our new method IBPR is recommended over existing methods because it preserves

the non-linear relationship as well as the marginal distribution of the ordinal vari-

128 Chapter 8. Discussion and Conclusion

able.

We note that CPR, IBPR and ordinal rounding performed well compared to

existing methods even when one of the ordinal categories had a very low prevalence

(less than 10%).

Our results showed that the performance of ordinal logistic regression (FCS) was

comparable to that of the continuous MVNI-based rounding methods for substantial

missingness in an ordinal exposure variable. However, MVNI is often easier to

implement in a general missing data setting. We note that FCS is susceptible to the

problem of perfect prediction when imputing categorical variables. MVNI does not

have this problem since the imputations are produced under a multivariate normal

distribution.

In this study, we examined our new rounding methods for covariates with two or

three categories. Enders [15, p.261] notes that “at an intuitive level, it is reasonable

to expect the effects of rounding to diminish as the number of ordinal response

options increases”. This is a fruitful area for further research.

A limitation of our new methods is they assume the complete cases reasonably

approximate the true proportions in the data set, that is, the complete cases are

available at random (AAR). However, since AAR can hold for an MCAR, MAR

or MNAR mechanism, our new methods do not require an MCAR mechanism to

produce valid estimates of proportions. Although AAR may be regarded as a fairly

restrictive assumption, we note that even in settings where AAR did not hold, our

new methods performed well compared with existing methods.

Our simulation studies were based on a real data set and were designed to model

realistic missing data scenarios. However, we acknowledge that it may be difficult

to draw general conclusions on the basis of simulation studies.

Our findings confirmed the results of previous research, which showed that mul-

tiple imputation is not always superior to complete case analysis. While MVNI

Chapter 8. Discussion and Conclusion 129

had substantial benefits over complete case analysis in the case of missingness in

a binary confounding variable, the results were not so clear for missingness in an

ordinal variable of interest. For an MAR mechanism, we found inconsistencies in the

performance of multiple imputation across levels of the ordinal exposure variable.

The reasons for this are unclear.

Further work is required to determine the settings in which multiple imputation

is likely to perform better than complete case analysis, particularly for missingness

in a covariate of interest.

130 Chapter 8. Discussion and Conclusion

131

Bibliography

[1] Aitkin, M., and Aitkin, I. A hybrid EM/Gauss-Newton algorithm for max-

imum likelihood in mixture distributions. Statistics and Computing 6 (1996),

127–130.

[2] Allison, P. Missing data. Sage, Newbury Park, CA, 2002.

[3] Allison, P. Imputation of categorical variables with proc mi. In 2005 SAS

Users Group International Conference. (2005).

[4] Andridge, R., and Little, R. A review of hot deck imputation for survey

non-response. International Statistical Review 78 (2010), 40–64.

[5] Arnold, B., Castillo, E., and Sarabia, J. Conditional specification of

statistical models. Springer-Verlag, New York, 1999.

[6] Bernaards, C., Belin, T., and Schafer, J. Robustness of a multivariate

normal approximation for imputation of incomplete binary data. Statistics in

Medicine 26 (2007), 1368–1382.

[7] Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of

Markov Chain Monte Carlo. Chapman & Hall/CRC, Boca Raton, FL, 2011.

[8] Carpenter, J., and Kenward, M. Multiple imputation and its application.

John Wiley & Sons, Chichester, UK, 2013.

[9] Cohen, J. Statistical power analysis for the behavioral sciences. Academic

Press, New York, 1977.

[10] Collins, L., Schafer, J., and Kam, C. A comparison of inclusive and

restrictive strategies in modern missing data procedures. Psychological Methods

6 (2001), 330–351.

132 Bibliography

[11] Demirtas, H. Rounding strategies for multiply imputed binary data. Biomet-

rical Journal 51 (2009), 677–688.

[12] Demirtas, H. A distance-based rounding strategy for post-imputation ordinal

data. Journal of Applied Statistics 37 (2010), 489–500.

[13] Demirtas, H., and Schafer, J. On the performance of random-coefficient

pattern-mixture models for non-ignorable drop-out. Statistics in Medicine 22

(2003), 2553–2575.

[14] Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from in-

complete data via the EM algorithm (with discussion). Journal of the Royal

Statistical Society 39 (1977), 1–38.

[15] Enders, C. Applied missing data analysis. The Guildford Press, New York,

2010.

[16] Galati, J., and Seaton, K. MCAR is not necessary for the complete cases to

constitute a simple random subsample of the target sample. Statistical Methods

in Medical Research (2013). DOI: 10.1177/0962280213490360.

[17] Galati, J., Seaton, K., Lee, K., Simpson, J., and Carlin, J. Round-

ing non-binary categorical variables following multivariate normal imputation:

evaluation of simple methods and implications for practice. Journal of Statisti-

cal Computation and Simulation (2012). DOI: 10.1080/00949655.2012.727815.

[18] Graham, J. Missing data analysis and design. Springer, New York, 2012.

[19] Graham, J., Hofer, S., and MacKinnon, D. Maximising the usefulness of

data obtained with planned missing value patterns: an application of maximum

likelihood procedures. Multivariate Behavioural Research 31 (1996), 197–218.

Bibliography 133

[20] Graham, J., Olchowski, A., and Gilreath, T. How many imputations

are really needed? Some practical clarifications of multiple imputation theory.

Prevention Science 8 (2007), 206–213.

[21] Graham, J., Taylor, B., Olchowski, A., and Cumsille, P. Planned

missing data designs in psychological research. Psychological Methods 11

(2006), 323–343.

[22] Heckman, J. Sample selection bias as a specification error. Econometrica 47

(1979), 153–161.

[23] Hedeker, D., and Gibbons, R. Application of random-effects pattern-

mixture models for missing data in longitudinal studies. Psychological Methods

2 (1997), 64–78.

[24] Horton, N., Lipsitz, S., and Parzen, M. A potential for bias when

rounding in multiple imputation. The American Statistician 57 (2003), 229–

232.

[25] Horvitz, D., and Thompson, D. A generalization of sampling without

replacement from a finite universe. Journal of the American Statistical Associ-

ation 47 (1952), 663–685.

[26] Hosmer, D., and Lemeshow, S. Applied logistic regression. John Wiley &

Sons, Hoboken, NJ, 2000.

[27] Lee, K., and Carlin, J. Multiple imputation for missing data: fully condi-

tional specification versus multivariate normal imputation. American Journal

of Epidemiology 171 (January 27 2010), 624–632.

[28] Lee, K., and Carlin, J. Recovery of information from multiple imputation:

a simulation study. Emerging Themes in Epidemiology 9 (2012).

134 Bibliography

[29] Lee, K., Galati, J., Simpson, J., and Carlin, J. Comparison of methods

for imputing ordinal data using multivariate normal imputation: a case study

of non-linear effects in a large cohort study. Statistics in Medicine 31 (2012),

4164–4174.

[30] Little, R. Missing data adjustments in large surveys. Journal of Business

and Economic Statistics 6 (1988), 287–296.

[31] Little, R. Pattern-mixture models for multivariate incomplete data. Journal

of the American Statistical Association 88 (1993), 125–134.

[32] Little, R., and Rubin, D. Statistical analysis with missing data. John Wiley

& Sons, Hoboken, NJ, 1987.

[33] Little, R., and Rubin, D. Statistical analysis with missing data (2nd ed.).

John Wiley & Sons, Hoboken, NJ, 2002.

[34] Louis, T. Finding the observed information matrix when using the EM algo-

rithm. Journal of the Royal Statistical Society Series B 44 (1982), 226–233.

[35] McLachlan, G., and Krishnan, T. The EM algorithm and extensions.


[36] Meng, X.-L. Multiple imputation inferences with uncongenial sources of in-

put. Statistical Science 9 (1994), 538–558.

[37] Meng, X.-L., and Rubin, D. Using EM to obtain asymptotic variance-

covariance matrices: the SEM algorithm. Journal of the American Statistical

Association 86 (1991), 899–909.

[38] Molenberghs, G., and Kenward, M. Missing data in clinical studies.


Bibliography 135

[39] Moons, K., Donders, R., Stijnen, T., and Harrell Jnr, F. Using the

outcome for imputation of missing predictor values was preferred. Journal of

Clinical Epidemiology 59 (2006), 1092–1101.

[40] Raghunathan, T., Lepkowski, J., Van Hoewyk, J., and Solen-

berger, P. A multivariate technique for multiply imputing missing values

using a sequence of regression models. Survey Methodology 27 (2001), 85–95.

[41] Redner, R., and Walker, H. Mixture densities, maximum likelihood and

the EM algorithm. SIAM Review. 26 (1984), 195–239.

[42] Rubin, D. Inference and missing data. Biometrika 63 (1976), 581–592.

[43] Rubin, D. Multiple imputations in sample surveys — a phenomenological

Bayesian approach to nonresponse. Proceedings of the Survey Research Methods

Section of the American Statistical Association (1978), 30–34.

[44] Rubin, D. Multiple imputation for nonresponse in surveys. John Wiley &

Sons, Hoboken, NJ, 1987.

[45] Rubin, D. Multiple imputation after 18+ years. Journal of the American

Statistical Association 91 (June 1996), 473–489.

[46] Schafer, J. Analysis of incomplete multivariate data. Chapman & Hall/CRC,

Boca Raton, FL, 1997.

[47] Schafer, J. Multiple imputation: a primer. Statistical methods in medical

research 8 (1999), 3–15.

[48] Schafer, J. Multiple imputation in multivariate problems when the imputa-

tion and analysis models differ. Statistica Neerlandica 57 (2003), 19–35.

[49] Schafer, J., and Graham, J. Missing data: our view of the state of the

art. Psychological Methods 7 (2002), 147–177.

136 Bibliography

[50] Schenker, N., and Taylor, J. Partially parametric techniques for multiple

imputation. Computational Statistics & Data Analysis 22 (1996), 425–446.

[51] Seaman, S., and White, I. Review of inverse probability weighting for

dealing with missing data. Statistical Methods in Medical Research 22 (2011),

278–295.

[52] Spratt, M., Carpenter, J., Sterne, J., Carlin, J., Heron, J., Hen-

derson, J., and Tilling, K. Strategies for multiple imputation in longitu-

dinal studies. American Journal of Epidemiology 172 (July 8 2010), 478–487.

[53] StataCorp. Stata Multiple Imputation Reference Manual Release 12. Stata-

CorpLP, 2011.

[54] StataCorp. Stata: Release 12. Statistical Software. StataCorpLP, 2011.

[55] Sterne, J., White, I., Carlin, J., Spratt, M., Royston, P., Ken-

ward, M., Wood, A., and Carpenter, J. Multiple imputation for miss-

ing data in epidemiological and clinical research: potential and pitfalls. British

Medical Journal (2009), 338:b2393.

[56] Tanner, M., and Wong, W. The calculation of posterior distributions

by data augmentation (with discussion). Journal of the American Statistical

Association 82 (1987), 528–550.

[57] van Buuren, S. Multiple imputation of discrete and continuous data by fully

conditional specification. Statistical Methods in Medical Research 16 (2007),

219–242.

[58] White, I., Daniel, R., and Royston, P. Avoiding bias due to perfect

prediction in multiple imputation of incomplete categorical variables. Compu-

tational Statistics & Data Analysis 54 (2010), 2267–2275.

Bibliography 137

[59] White, I., Royston, P., and Wood, A. Multiple imputation using chained

equations: issues and guidance for practice. Statistics in Medicine 30 (2011),

377–399.

[60] Yucel, R., He, Y., and Zaslavsky, A. Using calibration to improve

rounding in imputation. The American Statistician 62 (2008), 125–129.

[61] Yucel, R., He, Y., and Zaslavsky, A. Gaussian-based routines to impute

categorical variables in health surveys. Statistics in Medicine 30 (2011), 3447–

3460.

improved rounding methods for binary and ordinal variables ... · binary and ordinal variables...

Documents