random forest missing data approaches...1. apply random forest to (yobs, xobs), using k bootstraps...

University of MiamiScholarly Repository

Open Access Dissertations Electronic Theses and Dissertations

2017-05-02

Random Forest Missing Data ApproachesFei TangUniversity of Miami, [email protected]

Follow this and additional works at: https://scholarlyrepository.miami.edu/oa_dissertations

This Open access is brought to you for free and open access by the Electronic Theses and Dissertations at Scholarly Repository. It has been accepted forinclusion in Open Access Dissertations by an authorized administrator of Scholarly Repository. For more information, please [email protected].

Recommended CitationTang, Fei, "Random Forest Missing Data Approaches" (2017). Open Access Dissertations. 1852.https://scholarlyrepository.miami.edu/oa_dissertations/1852

https://scholarlyrepository.miami.edu?utm_source=scholarlyrepository.miami.edu%2Foa_dissertations%2F1852&utm_medium=PDF&utm_campaign=PDFCoverPages

https://scholarlyrepository.miami.edu/oa_dissertations?utm_source=scholarlyrepository.miami.edu%2Foa_dissertations%2F1852&utm_medium=PDF&utm_campaign=PDFCoverPages

https://scholarlyrepository.miami.edu/etds?utm_source=scholarlyrepository.miami.edu%2Foa_dissertations%2F1852&utm_medium=PDF&utm_campaign=PDFCoverPages

https://scholarlyrepository.miami.edu/oa_dissertations?utm_source=scholarlyrepository.miami.edu%2Foa_dissertations%2F1852&utm_medium=PDF&utm_campaign=PDFCoverPages

https://scholarlyrepository.miami.edu/oa_dissertations/1852?utm_source=scholarlyrepository.miami.edu%2Foa_dissertations%2F1852&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

UNIVERSITY OF MIAMI

RANDOM FOREST MISSING DATA APPROACHES

By

Fei Tang

A DISSERTATION

Submitted to the Facultyof the University of Miami

in partial fulfillment of the requirements forthe degree of Doctor of Philosophy

Coral Gables, Florida

May 2017

c�2017Fei Tang

All Rights Reserved

UNIVERSITY OF MIAMI

A dissertation submitted in partial fulfillment ofthe requirements for the degree of

Doctor of Philosophy

RANDOM FOREST MISSING DATA APPROACHES

Fei Tang

Approved:

Hemant Ishwaran, Ph.D. J. Sunil Rao, Ph.D.Professor of Biostatistics Professor of Biostatistics

Lily Wang, Ph.D. Guillermo Prado, Ph.D.Professor of Biostatistics Dean of the Graduate School

Panagiota V. Caralis, M.D.Professor of Internal Medicine

TANG, FEI ( Ph.D., Biostatistics )Random Forest Missing Data Approaches ( May 2017 )

Abstract of a dissertation at the University of MiamiDissertation supervised by Professor Hemant IshwaranNo. of pages in text. ( 99 )

Random forest (RF) missing data algorithms are an attractive approach for im-

puting missing data. They have the desirable properties of being able to handle

mixed types of missing data, they are adaptive to interactions and nonlinearity, and

they have the potential to scale to big data settings. Currently there are many dif-

ferent RF imputation algorithms, but relatively little guidance about their efficacy.

Using a large, diverse collection of data sets, imputation performance of various

RF algorithms was assessed under different missing data mechanisms. Algorithms

included proximity imputation, on the fly imputation, and imputation utilizing mul-

tivariate unsupervised and supervised splittingthe latter class representing a gener-

alization of a new promising imputation algorithm called missForest. Our findings

reveal RF imputation to be generally robust with performance improving with in-

creasing correlation. Performance was good under moderate to high missingness,

and even (in certain cases) when data was missing not at random. Real data analysis

using the RF imputation methods was conducted on the MESA data.

TABLE OF CONTENTS

Page

LIST OF FIGURES ..................................................................................................... iv

LIST OF TABLES ....................................................................................................... vi

Chapter

1 PREFACE ...................................................................................................... 1

2 CART TREES AND RANDOM FOREST ..................................................... 9

3 EXISTING RF MISSING DATA APPROACHES ........................................ 24

4 NOVEL ENHANCEMENTS OF RF MISSING DATA APPROACHES .... 30

5 IMPUTATION PERFORMANCE .................................................................. 42

6 RF IMPUTATION ON VIMP AND MINIMAL DEPTH ............................. 65

7 MESA DATA ANALYSIS ............................................................................ 85

Bibliography .................................................................................................. 96

iii

List of Figures

5.1 Summary values for the 60 data sets. . . . . . . . . . . . . . . . . . 44

5.2 ANOVA effect size for the log-information. . . . . . . . . . . . . . 50

5.3 Relative imputation error, ER

(I ). . . . . . . . . . . . . . . . . . . 51

5.4 Relative imputation error stratified by correlation. . . . . . . . . . . 52

5.5 Log of computing time versus log-complexity. . . . . . . . . . . . . 57

5.6 Relative log-computing time versus log-complexity. . . . . . . . . . 58

5.7 Log of computing time versus log-dimension. . . . . . . . . . . . . 63

5.8 Mean relative imputation error under different sample sizes. . . . . 64

6.1 Simulation 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 Simulation 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.3 Simulation 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.4 Simulation 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.5 Simulation 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.6 Importance measurement for all four variables. . . . . . . . . . . . 79

6.7 Minimal Depth measurement for all four variables. . . . . . . . . . 80

6.8 The correlation change due to missForest imputation. . . . . . . . . 81

6.9 Simulation 5 with nodesize of 200. . . . . . . . . . . . . . . . . . 82

6.10 The effect of different imputations methods with the Diabetes data. 83

iv

6.11 The effect of different imputations methods in GLM. . . . . . . . . 84

7.1 The percent of missingness for all the variables in MESA data. . . . 88

v

List of Tables

5.1 Experimental design used for large scale study . . . . . . . . . . . . 43

5.2 Relative imputation error ER

(I ). . . . . . . . . . . . . . . . . . . 53

6.1 Summary characteristics for the Simulated models. . . . . . . . . . 68

7.1 Top importance variables with different imputation methods . . . . 89

7.2 Top importance variables explanation . . . . . . . . . . . . . . . . 89

7.3 Prediction error rates using four strategies . . . . . . . . . . . . . . 95

vi

Chapter 1

Preface

Classification and regression trees (CART), is a machine learning method that is

commonly used in data mining. It constructs trees by conducting binary splits of

certain predicting variables of the data, aiming to produce homogeneous subsets

of the data with respect to the outcome. Although CART is a very intuitive way

for understanding data, it is known for its instability in prediction, meaning the

prediction based on CART can change substantially with minor perturbation of the

data. One method for reducing the variance of a predictor is bagging, or ”bootstrap

aggregation”. Random forest(RF, Breiman (2001)) is bagged trees with randomness

injected to minimize the correlation between the trees. It is appreciated for its

improvement of prediction accuracy over a single CART tree and its predicting

stability. In addition, it is a good method for high dimensional data, especially

when complex relations exit between the predictive variables. As a result, it has

gained popularity in many research fields.

Missing data is a real world problem frequently encountered in medical settings.

Because statistical analyses almost always require complete data, medical

1

2

researchers are forced to either impute data or to discard missing values when miss-

ing data is encountered. Of course, to simply discard observations with missing

values(complete case analysis) is usually not a reasonable practice, as valuable in-

formation may be lost and inferential power compromised(Enders , 2010).It can

even cause selection bias in some cases. In addition, deleting observations with

missing values may result in very few observations left when there is a high number

of predictive variables that contain missing values, especially for high dimensional

data. As a result, it is advisable to impute the data before any analysis is perform.

Many statistical imputation methods have been developed for missing data.

However, in high dimensional and large scale data settings, such as genomic, pro-

teomic, neuroimaging, and other high-throughput problems, many of these methods

perform poorly. For example, it is recommended that all variables be included in

multiple imputation to make it proper in general and in order to not create bias in

the estimate of the correlations(Rubin, 1996). This can lead to overparameteriza-

tion when there are a large number of variables but the sample size is moderate; a

scenario often seen with modern genomic data. In addition, computational issues

can arise in the implementation of standard methods. An example is the occur-

rence of non-convexity due to missing data when maximizing the log-likelihood,

which is problematic and challenging to optimize using traditional methods such

as the EM algorithm(?). Among the available imputation methods, some are for

continuous data(KNN for instance)(Troyanskaya et al, 2001), some are for categor-

ical data(saturated multinomial model). MICE(Multivariate Imputation by Chained

Equations)(Van Buuren, 2007) is for mixed data types(i.e., data having both nomi-

nal and categorical variables), depending on tuning parameters or specification of a

parametric model. High dimensional data often feature mixed data types with com-

plicated interactions among variables, making it unfeasible to specify any paramet-

3

ric model. In addition, implementation of these methods can often break down in

challenging data settings(Liao et al , 2014). These methods tend to perform poorly

in high dimensional and large scale data settings, as they were never designed to

be regularized or they cannot be applied due to computational issues. An exam-

ple of the latter occurs when maximizing the log-likelihood, which can become

non-convex due to missing data, and therefore will be challenging to optimize us-

ing traditional methods such as the EM algorithm(?). Another serious issue is that

most methods cannot deal with complex interactions and nonlinearity of variables,

which is common with data from medical research. Standard multiple imputation

approaches do not automatically incorporate interaction effects, and not surpris-

ingly, this leads to biased parameter estimates when interactions are present(Doove

et al. , 2014). Although some techniques such as fully conditional specification of

the covariate(Bartlett et al. , 2015) can be used to try to solve this problem, these

techniques may be hard and inefficient to implement in settings where interactions

are expected to be complicated.

For these reasons there has been much interest in using machine learning meth-

ods for missing data imputation. A promising approach can be based on Breimans

random forests (abbreviated hereafter as RF; Breiman (2001)). RF have the de-

sired characteristic of being able to: (1) handle mixed types of missing data; (2)

address interactions and nonlinearity; (3) scale to high-dimensions, while being

free of data assumptions; (4) avoid overfitting; (5) address settings when there

are more variables than observations; and (6) yield measures of variable impor-

tance that potentially can be used for variable selection. Currently there are sev-

eral different RF missing data algorithms. This includes the original RF proximity

algorithm proposed by Breiman (Breiman, 2003) implemented in the randomFor-

est R-package (Liaw and Wiener, 2002). A different class of algorithms are the

4

on-the-fly-imputation (OTFI) algorithms implemented in the randomSurvivalFor-

est R-package (abbreviated as RSF; Ishwaran et al. (2008)), which allow data to

be imputed while simultaneously growing a survival tree. These algorithms have

been adopted within the randomForestSRC R-package (abbreviated as RF-SRC)

to include not only survival, but classification and regression as well as other set-

tings(Ishwaran et al., 2016). Recently, Stekhoven and Buhlmann (2012) introduced

missForest which takes a different approach by recasting the missing data problem

as a prediction problem. Data is imputed by regressing each variable in turn against

all other variables and then predicting missing data for the dependent variable using

the fitted forest. In applications to both atomic and mixed data settings, missForest

was found to outperform well known methods such as k-nearest neighbors (Troy-

anskaya et al, 2001) and parametric MICE (multivariate imputation using chained

equation; Van Buuren (2007)). These findings have been confirmed in independent

studies; see for example Waljee et al. (2013).

Because of the desired characteristics of RF mentioned above, some missing

data algorithms have recently been developed to incorporate RF with traditional

imputation methods. For instance, Doove et al. (2014) proposed using random for-

est for multiple imputation, within the MICE framework. The algorithm is briefly

summarized as follows.

To impute Y using fully observed (x1,xp

):

1. Apply random forest to (yobs, xobs), using k bootstraps

2. For a given subject with missing Y with predictor values x1,... ,xp

, take the

observed values of Y in the terminal nodes of all k trees

3. Randomly sample one observed value of Y from these as the imputation of

the missing Y .

5

This process was embedded into MICE, and repeated to create multiple imputa-

tions. The approach is included in van Buurens MICE package in R. Independently

of Doove et al , Shah et al. (2014) also proposed using random forest for imputation

using a somewhat different approach:

1. Take a bootstrap sample (yobs,bs, xobs,bs) from (yobs, xobs)

2. Standard random forest is applied to (yobs,bs, xobs,bs)

3. Missing Y values are imputed by taking a normal draw, with residual variance

equal to the out of bag mean square error.

A recent comparison of random forest-based MICE algorithm with MICE showed

that both methods produced unbiased estimates of (log) hazard ratios using a real

life data, but random forest MICE was more efficient and produced narrower confi-

dence intervals. However, when a simulated data where nonlinearity exits, param-

eter estimates were less biased using random forest MICE, and confidence interval

coverage was better(Shah et al. , 2014). This suggests that random forest impu-

tation may be useful for imputing complex data sets. However, a drawback of the

approach however is the assumption of conditional normality and constant variance.

The out of bag error is not residual variance; it is residual variance plus bias(Mendez

and Lohr , 2011).

Besides random forest, other machine learning paradigms, consisting elements

such as Neuro-Fuzzy (NF) networks, Multilayer Perception (MLP) Neural Net-

works (NN) and Genetic Algorithms (GA)(Abdella et al. , 2005), et al. have been

considered for missing data imputation as well. An empirical study showed that

random forest were superior in imputing missing data in terms both of accuracy

and of computation time, compared to Auto-associative Neuro-Fuzzy configura-

tions(Pantanowitz and Marwala, 2008).

6

Depending on whether missingness is related to the data values, three types

of missingness are distinguished, namely, missing completely at random(MCAR),

missing at random(MAR), and missing not at random(MNAR)(Rubin, 1976). For-

mally, let Z denote the n⇥ (p+ 1) data matrix, which include the observed values

Z

obs

and missing values Z

mis

, let R be the missing data indicator matrix R, with

(i, j)th element Ri,j

= 1 if Zij

is observed and R

i,j

= 0 if Zij

is missing. The

notion of missing data mechanism was then formalized in terms of conditional dis-

tribution of R given Z. Missing Completely at Random(MCAR) implies the distri-

bution of R does not dependend on Z

obs

or Zmis

; Missing at Random(MAR) allows

the R to depend on Z

obs

but not on Z

mis

; while Missing Not at Random(MNAR)

allows R to depend on both Z

obs

and Z

mis

. In other words, these three types of

missingness can be distinguished as follows:

MCAR : P (R|Zcomp

) = P (R)

MAR : P (R|Zcomp

) = P (R|Zobs

)

MNAR : P (R|Zcomp

) = R(R|Zobs

,Z

mis

)

The missingness mechanism in regression was illustrated by Little(Little, 1992)

in a univariate missing data example. In this dataset consisting of X = (X1, ..., Xp

)

and Y , only X1 has missing values. The probability that X1 is missing for a case

may be independent of data values (MCAR), dependent on the value of X1 for that

case (MNAR), (c) depend on the values of X2,...,Xp

and Y for that case(MAR). In

this study, all three types of missingness were investigated.

In this study, we first proposed and implemented some novel enhancements to

RF imputation algorithms, including multivariate missForest(mForest), which im-

7

proves greatly at the computation speed compared to missForest. We then com-

pared several RF imputation methods in details. The goal of the investigation is

two folds: 1. how is the performance, in terms of the imputation accuracy and

speed, of random forest based imputation methods compared with some popular

non-random-forest based methods,; 2. If the data analyst intends to handle the miss-

ing values using a random forest based methods followed by carrying out variable

selection using random forest, which imputation method will give the best result.

In another word, we would like to identify the imputation method with which the

effect of missing values on the result of variable selection can be minimized. To

answer the first question, we performed a large scale empirical study using several

RF missing data algorithms. Performance was assessed by imputation accuracy and

computational speed. Different missing data mechanisms (missing at random and

not missing at random) were used to assess robustness. In addition to the RF al-

gorithms described above, we also considered several new algorithms, including a

multivariate version of missForest, refered to as mForest. Despite the superior per-

formance of missForest, the algorithm is computationally expensive to implement

in high-dimensions as a separate RF must be run for each variable. The mForest

algorithm helps to alleviate this problem by grouping variables and running a mul-

tivariate forest using each group in turn as the set of dependent variables. This

replaces p regressions, where p is the number of variables, with ⇡ 1/↵ regressions,

where 0 < ↵ < 1 is the user specified group fraction size. Computational savings

are found to be substantial for mForest but without overly compromising accuracy

even for relatively large . Other RF algorithms studied included a new multivariate

unsupervised algorithm and algorithms utilizing random splitting.

8

The organization of the thesis is as follows.

1. In chapter 2, we give a review on CART tree and random forest, including

the splitting rules, VIMP and minimal depth, which are measures used for

variable selection.

2. Chapter 3 reviews the existing random forest based approaches for handling

incomplete data.

3. We introduced the novel enhancements of RF missing data methods in Chap-

ter 4.

4. Chapter 5 presented the imputation performance comparison of these meth-

ods, in terms of the imputation accuracies and speed.

5. Chapter 6 shows how VIMP and minimal depth are affected by different

methods of handling missing data.

Chapter 2

CART Trees and Random Forest

Classification and regression tree(CART) is a commonly used machine learningmethod in data mining. It constructs trees by conducting binary splits of predictingvariables of the data, aiming to produce homogeneous subsets of the data. Despitethe advantages of the CART trees, they are unstable. Random forest is bagged treeswith randomness injected by 1)bootstrapping 2) selecting from a random subset ofvariables to split on. Often used univariate splitting rules include 1) the twoing cri-terion 2) the entropy criterion 3) the entropy criterion. The multivariate regressionsplitting rule is based on the weighted sample variance. The multivariate classifi-cation splitting rule is an extension of Gini splitting rule. The multivariate mixedsplitting rule is a combination of the weighted sample variance, and weighted multi-variate Gini splitting rule. VIMP(Variable Importance Measure) and minimal depthmeasures how important a variable is in terms of predicting the outcome. They canbe used for variable selection.

2.1 Classification and regression trees

Classification trees use recursive partitioning to classify a p-dimensional feature

x 2 X into one of J class labels for a categorical outcome Y 2 {C1, . . . , CJ

}.

The tree is constructed from a learning sample (X1, Y1), . . . , (Xn

, Y

n

). Many clas-

sification rules work by partitioning X into disjoint regions R

n,1, Rn,2, . . .. For a

9

10

given x 2X , such a classifier is defined as

ˆ

C

n

(x) = argmax

1jJ

nX

i=1

1{Yi=Cj}1{Xi2Rn(x)} (2.1)

where R

n

(x) 2 {Rn,1, Rn,2, . . .} is the partition region (cell) containing x.

Regression trees are constructed using recursive partitioning similar to classifi-

cation trees, with the outcome variable being continuous, instead of categorical. It

seeks the split-point that minimizes the weighted sample variance.

2.2 Splitting rules

The general principle of splitting rule is the reduction of tree impurity, because it

encourages the tree to push dissimilar cases apart.

2.2.1 Univariate splitting

Often used splitting rules include the twoing criterion, the entropy criterion, and the

Gini criterion. The Gini splitting rule is arguably the most popular and is defined

as follows. Let h be a tree node that is being split. Let pj

(h) denote the proportion

of class j cases in h. Let s be a proposed split for a variable x that splits h into left

and right daughter nodes hL

and h

R

, where

h

L

:= {xi

: x

i

s}, h

R

:= {xi

: x

i

> s}.

11

Let N = |h|, NL

= |hL

|, and N

R

= |hR

| denote the number of cases in h, hL

, and

h

R

(note that N = N

L

+N

R

). The Gini node impurity for h is defined as

ˆ

�(h)�JX

j=1

p

j

(h)(1� p

j

(h)).

The Gini node impurity for hL

is

ˆ

�(h

L

) =

JX

j=1

p

j

(h

L

)(1� p

j

(h

L

)).

where p

j

(h

L

) is the class frequency for class j in h

L

. ˆ

�(h

R

) is defined in a similar

way. The decrease in the node impurity is

ˆ

�(h)�hp(h

L

)

ˆ

�(h

L

) + p(h

R

)

ˆ

�(h

R

)

i.

where p(h

L

) = N

L

/N and p(h

R

) = N

R

/N are the proportion of observations

in h

L

and h

R

, respectively. The quantity

ˆ

✓(s, h) =

hp(h

L

)

ˆ

�(h

L

) + p(h

R

)

ˆ

�(h

R

)

i.

is refered to as the Gini index or Gini splitting rule. The best split on x is the

split-point s = s maximizing the decrease in node impurity, which is equivalent to

minimizing the Gini index, ˆ✓(s, h), with respect to s.

The twoing criterion (Breiman et al., 1984, pp 104-106) is designed to find the

grouping of all J classes into two superclasses that leads to the greatest decrease in

node impurity when considered as a two-class problem. Under Gini impurity, the

12

best twoing split is the value of s maximizing

ˆ

✓(s, h) =

ˆ

P (h

L

)

ˆ

P (h

R

)

4

"JX

j=1

��pj

(h

L

)� p

j

(h

R

)

��

#2.

The entropy impurity is

ˆ

�(h) = �JX

j=1

p

j

(h) log p

j

(h).

Thus, the best split on x under entropy is the value of s maximizing

ˆ

✓(s, h) =

ˆ

P (h

L

)

JX

j=1

p

j

(h

L

) log p

j

(h

L

) +

ˆ

P (h

R

)

JX

j=1

p

j

(h) log p

j

(h

R

).

2.2.2 Multivariate splitting

Regression multivariate splitting

We shall denote the learning data by (X

i

,Y

i

)1in

, where Xi

= (X

i,1, . . . , Xi,p

) is

a p-dimensional feature and Y

i

= (Y

i,1, . . . , Yi,q

) is a q � 1 dimensional response.

We shall denote a generic coordinate of the multivariate feature by X and refer to X

as a variable (i.e. covariate). For example, Xi

refers to the value of the covariate for

X

i

. Multivariate regression corresponds to the case when Y

i,j

are continuous. To

define the multivariate regression splitting rule we begin by considering univariate

(q = 1) regression.

Consider splitting a regression tree T at a node t. Let s be a proposed split for

a variable X that splits t into left and right daughter nodes tL

:= t

L

(s) and t

R

:=

t

R

(s) where t

L

are the cases {Xi

s} and t

R

are the cases {Xi

> s}. Regression

13

node impurity is determined by within node sample variance. The impurity of t is

ˆ

�(t) =

X

Xi2t

(Y

i

� Y

t

)

2/N,

where Y

t

=

PXi2t Yi

/N is the sample mean for t and N = |t| is the sample size

of t (note that N = n only when t is the root node). The within sample variance for

a daughter node, say t

L

, is

ˆ

�(t

L

) =

1

N

L

X

i2X(s,t)

(Y

i

� Y

tL)2, X(s, t) = {X

i

2 t,X

i

s}

where Y

tL is the sample mean for tL

and N

L

is the sample size of tL

. The decrease

in impurity under the split s for X equals

ˆ

�(s, t) =

ˆ

�(t)�hp(t

L

)

ˆ

�(t

L

) + p(t

R

)

ˆ

�(t

R

)

i,

where p(t

L

) = N

L

/N and p(t

R

) = N

R

/N . The optimal split-point sN

maximizes

the decrease in impurity ˆ

�(s, t) (Chapter 8.4; Breiman et al., 1984), which is equiv-

alent to minimizing

ˆ

D

W

(s, t) = p(t

L

)

ˆ

�(t

L

) + p(t

R

)

ˆ

�(t

R

)

which can be written as

ˆ

D

W

(s, t) =

1

N

X

i2tL

(Y

i

� Y

tL)2+

1

N

X

i2tR

(Y

i

� Y

tR)2.

In other words, CART seeks the split-point sN

that minimizes the weighted sample

variance.

14

We extend the weighted sample variance rule to the multivariate case q > 1 by

applying the splitting rule to each coordinate separately. We seek to minimize

ˆ

D

⇤W

(s, t) =

qX

j=1

W

j

(X

i2tL

(Y

i,j

� Y

tLj)

2+

X

i2tR

(Y

i,j

� Y

tRj)

2

)

where 0 < W

j

< 1 are prespecified weights for weighting the importance of

coordinate j of the response Y, and Y

tLjand Y

tRjare the sample means of the

j-th coordinate in the left and right daughter nodes. Notice that such a splitting

rule can only be effective if each of the coordinates of Y are measured on the same

scale, otherwise we could have a coordinate j with say enormous values and its

contribution would dominate ˆ

D

⇤W

(s, t). We can calibrate ˆ

D

⇤W

(s, t) by using W

j

but

it will be more convenient to assume that each coordinate has been standardized.

We will asssume

1

N

X

i2t

Y

i,j

= 0,

1

N

X

i2t

Y

2i,j

= 1, 1 j q. (2.2)

The standardization is applied prior to splitting the node t. With some elemen-

tary manipulations, it is easily verified that minimizing ˆ

D

⇤W

(s, t) is equivalent to

maximizing

ˆ

D

?

W

(s, t) =

qX

j=1

W

j

8<

:1

N

L

X

i2tL

Y

i,j

!2

+

1

N

R

X

i2tR

Y

i,j

!29=

; . (2.3)

Rule (4.1) is the multivariate splitting rule used for multivariate continuous out-

comes.

15

Categorical (classification) multivariate splitting

Now we consider the effect of Gini splitting when Y

i,j

is categorical. First consider

the univariate case (i.e., the multiclass problem) where the outcome Y is a class

label Y 2 {1, . . . , K} where K � 2. Consider growing a classification tree using

Gini splitting. Let ˆ�k

(t) denote the class frequency for class k in a node t. The Gini

node impurity for t is defined as

ˆ

�(t) =

KX

k=1

ˆ

�

k

(t)(1� ˆ

�

k

(t)).

The decrease in the node impurity from splitting X at s is

ˆ

�(s, t) =

ˆ

�(t)�hp(t

L

)

ˆ

�(t

L

) + p(t

R

)

ˆ

�(t

R

)

i.

The quantity

ˆ

G(s, t) = p(t

L

)

ˆ

�(t

L

) + p(t

R

)

ˆ

�(t

R

)

is the Gini index which is minimized with respect to s.

Let Nk,L

=

Pi2tL 1{Yi=k} and N

k,R

=

Pi2tR 1{Yi=k}. With some algebra one

can show minimizing ˆ

G(s, t) is equivalent to maximizing

KX

k=1

N

2k,L

N

L

+

KX

k=1

N

2k,R

N

R

=

KX

k=1

2

4 1

N

L

X

i2tL

Z

i(k)

!2

+

1

N

R

X

i2tR

Z

i(k)

!23

5, (2.4)

where Z

i(k) = 1{Yi=k}. Notice the similarity to (4.1). When q = 1, the regression

16

splitting rule (4.1) can be written as

ˆ

D

?

W

(s, t) =

8<

:1

N

L

X

i2tL

Y

i

!2

+

1

N

R

X

i2tR

Y

i

!29=

; .

This is equivalent to each of the summands in (2.4) if we set Yi

= Z

i(k). In-

deed, (2.4) is equivalent to (4.1), if in (4.1) we replace j with k, q with K, and

Y

i,j

= Z

i(k), and assume uniform weighting W1 = · · · = W

K

= 1/K.

When q > 1 we apply Gini splitting to each coordinate of Y yielding the ex-

tended Gini splitting rule

ˆ

G

⇤(s, t) =

qX

j=1

2

4 1

K

j

KjX

k=1

8<

:1

N

L

X

i2tL

Z

i(k),j

!2

+

1

N

R

X

i2tR

Z

i(k),j

!29=

;

3

5, (2.5)

where Z

i(k),j = 1{Yi,j=k}. Notice that (4.2) is equivalent to (4.1), but with an addi-

tional summation over the class labels k = 1, . . . , K

j

for each j. The normalization

1/K

j

employed for a coordinate j is required to standardize the contribution of the

Gini split from that coordinate.

Multivariate mixed outcome splitting

The equivalence between Gini splitting for categorical responses and weighted vari-

ance splitting for continuous responses now points to a means to split multivariate

mixed outcomes. Let QR

✓ {1, . . . , q} denote the coordinates of the continuous

outcomes in Y and let QC

denote the coordinates having categorical outcomes.

17

Notice that QR

SQ

C

= {1, . . . , q}. The mixed outcome splitting rule is

⇥(s, t) =

X

j2QC

W

j

2

4 1

K

j

KjX

k=1

8<

:1

N

L

X

i2tL

Z

i(k),j

!2

+

1

N

R

X

i2tR

Z

i(k),j

!29=

;

3

5

+

X

j2QR

W

j

8<

:1

N

L

X

i2tL

Y

i,j

!2

+

1

N

R

X

i2tR

Y

i,j

!29=

; .

The standard practice is to use uniform weighting W1 = · · · = W

q

= 1. Also recall

that Yi,j

for j 2 Q

R

are standardized as in (4.0.4) prior to splitting t.

2.2.3 Survival outcome splitting

Segal Intrator and LeBlanc and Crowley use as the prediction rule the Kaplan-Meier

estimate of the survival distribution, and as the splitting rule a test for measuring

differences between distributions adapted to censored data such as the log-rank test

or Wilcoxon test. These statistics are weighted versions of the log-rank statistic

where the weights allow exibility in emphasizing differences between two survival

curves for early times (the left tail of the distribution), middle times or late times(the

right tail of the distribution). In particular an observation at time t is weighted by

Q(t) =

ˆ

S(t)

⇢

(1� S(t))

�

where S is the Kaplan Meier estimate of the survival curve for both samples com-

bined. Thus we can obtain sensitivity to early occurring differences by taking ⇢ > 0

and � ⇡ 0 while we emphasize differences in the middle by taking ⇢ ⇡ 0 and � ⇡ 1

and we emphasize late differences if ⇢ ⇡ 0and� > 0.

18

Three different survival splitting rules can be used: (i) a log-rank splitting rule,

the default splitting rule(ii) a conservation of events splitting rule (iii) a logrank

score rule.

Notation

Assume we are at node h of a tree during its growth and that we seek to split h

into two daughter nodes.We introduce some notation to help discuss how the the

various splitting rules work to determine the best split. Assume that within h are n

individuals. Denote their survival times and 0-1 censoring information by (T1, �1),

. . . ,(Tn, �n

). An individual l will be said to be right censored at time T

l

if �l

= 0,

otherwise the individual is said to have died at Tl

if �l = 1. In the case of death, Tl

will be refered to as an event time,and the death as an event. An individual l who

is right censored at Tl

simply means the individual is known to have been alive at

T

l

, but the exact time of death is unknown. A proposed split at node h on a given

predictor x is always of the form x c and x > c. Such a split forms two daughter

nodes (a left and right daughter) and two new sets of survival data. A good split

maximizes survival differences across the two sets of data. Let t1 < t2 < < t

N

be

the distinct death times in the parent node h, and let di,j

and Y

i,j

equal the number

of deaths and individuals at risk at time t

i

in the daughter nodes j = 1, 2. Note that

Y

i,j

is the number of individuals in daughter j who are alive at time t

i

, or who have

an event (death) at time t

i

. More precisely,

Y

i,1 = #{Tl

� t

i

, x

l

c}, Yi,2 = #{T

l

� t

i

, x

l

> c},

where x

l

is the value of x for individual l = 1, ..., n. Finally, define Y

i

= Y

i,1 + Y

i,2

19

and d

i

= d

i,1+ d

i,2. Let nj

be the total number of observations in daughter j. Thus,

n = n1 + n2. Note that n1 = #{l : xl

c} and n2 = #{l : xl

> c}.

Log-rank splitting

The log-rank test for a split at the value c for predictor x is

L(x, c) =

NX

i=1

(d

i,1 � Y

i,1d

i

Y

i

)

vuutNX

i=1

Y

i,1

Y

i

(1� Y

i,1

Y

i

)(

Y

i

� d

i

Y

i

� 1

)d

i

The value |L(x, c)| is the measure of node separation. The larger the value for

|L(x, c)|, the greater the difference between the two groups, and the better the split

is. In particular, the best split at node h is determined by finding the predictor x?

and split value c

? such that |L(x?

, c

?

)| � |L(x, c)| for all x and c.

Log-rank score splitting

Another useful splitting rule available is the log-rank score test introduced by Hothorn

and Lausen(Hothorn et al., 2003). To describe this rule, assume the predictor x has

been ordered so that x1 x2 ... x

n

. Now, compute the ranks for each survival

time T

l

,

a

l

= �

l

��lX

k=1

�

k

n� �

k

+ 1

where �

k

= #{t : Tt

T

k

}. The log-rank score test is defined as

S(x, c) =

Pxlc

a

l

� n1apn1(1� n1

n

)s

2a

20

where a and s

2a

are the sample mean and sample variance of {al

: l = 1, ..., n}.

Log-rank score splitting defines the measure of node separation by |S(x, c)|. Max-

imizing this value over x and c yields the best split.

2.3 Random forest

Random forest, introduced by Leo BreimanBreiman (2001), can be loosely viewed

as ensemble CART trees. In random forests, the base learner is a binary recur-

sive tree grown using random input selection. Its random feature is formed by 1.

selecting a small group of input variables at random to split on at each node, 2.

bootstrapping of the original data. It is similar to bagging in that the terminal nodes

of the tree contain the predicted values which are tree-aggregated to obtain the pre-

dictor. In random forest, the bootstrapped sample for each tree if called in-bag data,

while the data that were not sampled are called out of bag (OOB) data. The pre-

diction accuracy of random forest is assessed by OOB data, which are observations

that were not used in constructing of the tree. As a result, a more realistic estimate

of the predicting performance can be provided. Unlike a single CART tree, the

trees in random forest usually are not pruned. BreimanBreiman (2001) showed that

random forests do not overfit even with rising number of trees.

The algorithm of random forest is as follows.

Algorithm 1 Random Forest algorithm1: Draw a bootstrap sample of the original data.2: Grow a tree using data from Step 1. When growing the tree, at each node in the

tree, determine the optimal split for the node using m < p randomly selectedvariables. Grow the tree so that each terminal node contains no fewer thann � 1 cases.

3: Repeat Steps 1-2, B > 1 times independently.4: Combine the B trees to form the ensemble predictor.

21

2.4 VIMP

Random forest is appreciated not only for its superiority in prediction accuracy, but

also for its intrinsic ability to deal with complicated interactions. Multiple methods

have been developed to perform variable selection using Random Forest. These

methods are based on two measures: variable importance measure(VIMP), or min-

imal depth measure.

Variable importance, or VIMP, equals the amount of prediction error change,which

can be either increase or decrease, when a particular variable is noised up by permu-

tation, or randomly assignment of observations to the child nodes when the parent

nodes is split on this given variable.

The permutation importance is assessed by a comparing the prediction accuracy,

in terms of correct classification rate or mean squared error(MSE), of a tree before

and after random permutation of a predictor variable. By permutation the original

association with the response is destroyed and the accuracy is supposed to drop

for a relevant predictor. Therefore, when the accuracies between before and after

permutation are small, it indicates that the importance of Xj

is small in predicting

of the response variable. In contrast, if the prediction accuracy drops substantially

after the permutation, it indicates a strong association between X

j

and the response

variable. One algorithm of computing the importance score by permutation is as

follows.

Algorithm 2 Importance score calculation by permutation(by Forest)1: for i = 1, ..., n do2: Compute the OOB accuracy of a RF3: Permute the predictor variable of interest in the OOB observations4: Compute the OOB accuracy of the RF5: end for6: The importance score the average difference

22

Another strategy for calculating VIMP is to calculate the change of prediction

accuracy by permuting for each tree(Ishwaran et al., 2007). A given variable x

v

is

randomly permuted in the out-of-bag (OOB) data and the noised up OOB data is

dropped down the tree grown from the in-bag data. This is done for each tree in the

forest and an out-of-bag estimate of prediction error is computed. The difference

between this new prediction error and the original out-of-bag prediction error with-

out the permutation is the VIMP of xv

. The R package randomForetSRC that

we use in this study uses this strategy. The algorithm based on this strategy is as

follows.

Algorithm 3 Importance score calculation by permutation(by Trees)1: for k = 1, ..., N do2: Compute the OOB accuracy of the kth tree3: Permute the predictor variable of interest (x

v

) in the OOB observations4: Compute the OOB accuracy of the kth tree using the permuted data5: Compute the change of OOB accuracy for the kth tree6: end for7: The importance score the average change of OOB accuracy of all N trees

As indicated by its name, variable importance measures rate a variable’s rele-

vance for prediction. Not surprisingly, it has been used for variable selection as

well. Variable selection methods using VIMP fall into one of the two categories:

performance based methods, or test based methods. One example of the perfor-

mance based method is an algorithm proposed by Diaz et al(Diaz-Uriate and Al-

varez de Andres, 2006), which use random forest and VIMP to select genes by

iteratively fittting random forests. At each iteration, a new forest was built after dis-

carding those variables (genes) with the smallest variable importances; the selected

set of genes is the one that yields the same OOB error rate, within predefined stan-

dard error of the minimum error rate of all forest. In another example, Genuer et

al(Genuer et al , 2010) proposed a strategy involving a ranking of explanatory vari-

23

ables using VIMP and a stepwise ascending variable introduction strategy. The test

based variable selection methods apply a permutation test framework to estimate

the significance of a variable’s importance. For instance, in a method suggested by

Hapfelmeier and Ulm, VIMP for a variable was recomputed after it was permuted.

This procedure was repeated many times to assess the empirical distribution of im-

portances under the null hypothesis of a certain variable being not predictive of the

outcome variable. A p-values, reflecting the likelihood of the original VIMP within

these empirical distribution, then can be calculated. Variables with p-values less

than a predefined threshold were selected.

2.5 Minimal depth

Minimal depth, which is defined as the distance from the root node to the root of

the closest maximal v-subtree for variable v, is another measure that can be used for

variable selection(Ishwaran et al., 2010). It measures how far a case travels down

a tree to encounter the first split on variable v. Small minimal depth for a variable

implies high predictive ability of the variable. The smallest possible minimal depth

is 0, which means this variable is split at the root node.

Chapter 3

Existing RF Missing DataApproaches

The existing RF missing data approaches are: proximity approach, ”on-the-fly-

imputation”(OTFI) approach, and missForest approach. A disadvantage of the

proximity approach is that OOB (out-of-bag) estimates for prediction error are bi-

ased. MissForest appears to be superior in imputation accuracy. However, it is

computationally costly. KNN imputation algorithm was also listed here because

trees are adaptive nearest neighbors.

3.1 Missing data approaches in CART

One commonly used algorithms for treating missing data in CART was based on

the idea of a surrogate split [Chapter 5.3, Breiman et al. (1984)]. If s is the best

split for a variable x, the surrogate split for s is the split s? using some other variable

x

? such that s? and s are closest to one another in terms of predictive association

[Breiman et al. (1984)]. The CART algorithm uses the best surrogate split among

those variables not missing for the case to assign one that have a missing value for

the variable used to split a node.24

25

Surrogate splitting method is not suited for random forests, since RF randomly

selects variables when splitting a node. A reasonable surrogate split may not exit

within a node, as the randomly selected variables to be split on may be uncorrelated.

Speed is one issue. A second concern is that to find a surrogate split is computation-

ally intensive and may become infeasible when growing a large number of trees for

random forests. A third concern is that surrogate splitting alters the interpretation

of a variable, which affects measures such as VIMP. For these reasons, a different

strategy is required for RF.

3.2 Overall stratigies of Mssing data approaches in

RF

Three strategies can and have been used in the random forest missing value impu-

tation.

1. Preimpute the data, then grow the forest, and update the original missing

values using certain criteria (such as proximity), based on the grown forest;

iterate for improved results.

2. Impute as the trees are grown; update by summary imputation; iterate for

improved results.

3. Preimpute, grow forest for each variable that has missing values, predict the

missing values using the grown forest, update the missing values with the

predicted values; iterate for improved results.

26

3.3 Proximity imputation

The proximity approach (Breiman, 2003; Liaw and Wiener, 2002) uses the strategy

one, while the adaptive tree method (Ishwaran et al., 2008) uses the strategy two. A

new imputation method named MissForest (Stekhoven and Buhlmann, 2012), fea-

turing predicting the missing values using forest grown using variable with missing

values as response variable, was proposed in 2011, and it falls into the third strategy.

The proximity approach works as follows. First, the data are roughly imputed by

replacing missing values for continuous variables with the median of non-missing

values, and by replacing missing values for categorical variable with the most fre-

quent occurring non-missing value. A RF is then fit with the roughly imputed data,

and a ’proximity matrix’ is calculated from the fitted RF. The proximity matrix, an

n ⇥ n symmetric matrix whose (i, j) entry records the frequency that case i and j

occur within the same terminal node, is then used for imputing the data. For contin-

uous variables, the missing values are imputed with the proximity weighted average

of the non-missing data; for integer variables, the missing values are imputed with

the integer value having the largest average proximity over non-missing data. The

updated data are then used as an input in RF, and the procedure is iterated. The

iterations end when a stable solution is reached.

The disadvantage of proximity approach is that OOB estimates for prediction

error are biased, generally on the order of 10-20% (Breiman, 2003). Further, be-

cause prediction error is biased, so are other measures based on it, such as VIMP. In

addition, the proximity approach does not work on predicting test data with missing

values. The adaptive tree imputation method addressed these issues by adaptively

impute missing data as a tree is grown, drawing randomly from the set of

27

non-missing in-bag data within the working node. The imputing procedure is sum-

marized as follows:

1. For each node h, impute missing data by drawing a random value from the

in-bag non-missing value prior to splitting.

2. After the splitting, reset the imputed data in the daughter nodes to missing.

Proceed as in Step 1 until the tree can no longer be split.

3. The final summary imputed value is the average of the case’s imputed in-bag

values for a continuous variable, or the most frequent in-bag imputed value

for a categorical variable.

Different splitting rules, such as outcome splitting, random splitting, unsuper-

vised splitting can be used for random forest. Accordingly, these different splitting

rules can be used in the process of random forest based missing data imputation. In

random splitting, nsplit, a non-zero positive integer, need to be defined by the user.

A maximum of nspit split points are chosen randomly for each of the potential split-

ting variables within a node. This is in contrast to non-random splitting, where all

possible split points for each of the potential splitting variables are considered. The

splitting rule is applied to the nsplit randomly selected split points and the node is

split on the variable of which the random split point yields the best value, as mea-

sured by the splitting rules such as weighted mean-squared error, or Gini index.

Depending on the splitting rules, a outcome variable may or may not required. In

outcome splitting, an outcome variable is required, while in complete random spit-

ting and unsupervised splitting, no outcome variable is required. The comparison

of different splitting rules, supervised v.s. unsupervised splitting in random forest

based missing data imputation was carried out in this study.

28

Missforest uses an iterative imputation scheme by training a RF on observed val-

ues for each variable, followed by predicting the missing values and then proceed-

ing iteratively until the stopping criteria is met (Stekhoven and Buhlmann, 2012).

3.4 RF imputation

RF imputation refers to the ”on-the-fly-imputation”(OTFI) (or adaptive tree) impu-

tation method(Ishwaran et al., 2008) that is implemented in the R package random-

ForestSRC. In this method, missing data are replaced with values drawn randomly

from the non-missing in-bag data – the bootstrap samples that are to be used to grow

the trees, in contrast to out-of-bag data which are only used for model validation

– within the splitting node. That it, missing data are imputed prior to splitting at

each node. As a result, the daughter nodes contain no missing data as the parent

node is imputed before splitting. Therefore, imputation is carried out adaptively as

a tree is grown and all the missing values are imputed at the end of each iteration.

Weighted mean-squared error splitting was used for continuous outcome, and Gini

index splitting was used for categorical outcome(Breiman et al., 1984).

3.5 KNNimputation algorithm

The KNN-based imputation was proposed in 2001 as follows. If we consider row

A that has one missing value in column 1, the KNNimpuation method would find

K other rows, which do not have the value missing in column 1, with the least

Euclidean distance with row A. A weighted average of values in column 1 from

the K closest rows is then used as an estimate for the missing value in row A. See

Algorithm 2 for more details

29

Algorithm 4 KNNimpute algorithmRequire: X an m rows(genes), and n columns(experiments) matrix, rowmax and

colmax

1: Check if there exists any column that has more than colmax missing values2: if Yes then3: Halt and report error4: else5: log-transform the data6: end if7: for i = 1, ...,m do8: if row i has more than rowmax missing values then9: Use mean column average as imputation of the missing values

10: else11: if row i has at least one but no more than rowmax missing values then12: posi record the positions where row i has missing values13: Find rows that have no missing values in column posi

14: Calculate the average Euclidean metric of row i with these rows15: k nearest neighbors for row i k rows with the smallest average Eu-

clidean distance with row i

16: Use the corresponding average value of these k nearest neighbors asimputed value

17: end if18: end if19: end for

Chapter 4

Novel Enhancements of RF MissingData Approaches

Three general strategies have been used for RF missing data imputation:

(A) Preimpute the data; grow the forest; update the original missing values using

proximity of the data. Iterate for improved results.

(B) Simultaneously impute data while growing the forest; iterate for improved

results.

(C) Preimpute the data; grow a forest using in turn each variable that has missing

values; predict the missing values using the grown forest. Iterate for improved

results.

Proximity imputation (Breiman, 2003) uses strategy A, on-the-fly-imputation (Ish-

waran et al., 2008) (OTFI) uses strategy B, and missforest (?) uses strategy C.

Below we detail each of these strategies and describe various algorithms which uti-

lize one of these three approaches. These new algorithms take advantage of new

splitting rules, including random splitting, unsupervised splitting, and multivariate

splitting (Ishwaran et al., 2016).

30

31

4.0.1 Strawman imputation

We first begin by describing a “strawman imputation” which will be used through-

out as our baseline reference value. While this algorithm is rough, it is also very

rapid, and for this reason it was also used to initialize some of our RF procedures.

Strawman imputation is defined as follows. Missing values for continuous variables

are imputed using the median of non-missing values, and for missing categorical

variables, the most frequently occurring non-missing value is used (ties are broken

at random).

4.0.2 Proximity imputation: RFprx and RFprxR

Here we describe proximity imputation (strategy A). In this procedure the data is

first roughly imputed using strawman imputation. A RF is fit using this imputed

data. Using the resulting forest, the n ⇥ n symmetric proximity matrix (n equals

the sample size) is determined where the (i, j) entry records the inbag frequency

that case i and j share the same terminal node. The proximity matrix is used to im-

pute the original missing values. For continuous variables, the proximity weighted

average of non-missing data is used; for categorical variables, the largest average

proximity over non-missing data is used. The updated data are used to grow a new

RF, and the procedure is iterated.

We use RFprx to refer to proximity imputation as described above. However,

when implementing RFprx we use a slightly modified version that makes use of

random splitting in order to increase computational speed. In random splitting,

nsplit, a non-zero positive integer, is specified by the user. A maximum of

nspit-split points are chosen randomly for each of the randomly selected mtry

splitting variables. This is in contrast to non-random (deterministic) splitting typ-

32

ically used by RF, where all possible split points for each of the potential mtry

splitting variables are considered. The splitting rule is applied to the nsplit ran-

domly selected split points and the tree node is split on the variable with random

split point yielding the best value, as measured by the splitting criterion. Random

splitting evaluates the splitting rule over a much smaller number of split points and

is therefore considerably faster than deterministic splitting.

The limiting case of random splitting is pure random splitting. The tree node is

split by selecting a variable and the split-point completely at random—no splitting

rule is applied; i.e. splitting is completely non-adaptive to the data. Pure random

splitting is generally the fastest type of random splitting. We also apply RFprx using

pure random splitting; this algorithm is denoted by RFprxR .

As an extension to the above methods, we implement iterated versions of RFprx

and RFprxR . To distinguish between the different algorithms, we write RFprx.k and

RFprxR.k when they are iterated k � 1 times. Thus, RFprx.5 and RFprxR.5 indicates

that the algorithms were iterated 5 times, while RFprx.1 and RFprxR.1 indicates that

the algorithms were not iterated. However, as this latter notation is somewhat cum-

bersome, for notational simplicity we will simply use RFprx to denote RFprx.1 and

RFprxR for RFprxR.1 .

4.0.3 On-the-fly-imputation (OTFI): RFotf and RFotfR

A disadvantage of the proximity approach is that OOB (out-of-bag) estimates for

prediction error are biased (Breiman, 2003). Further, because prediction error

is biased, so are other measures based on it, such as variable importance. The

method is also awkward to implement on test data with missing values. The OTFI

method (Ishwaran et al., 2008) (strategy B) was devised to address these issues.

33

Specific details of OTFI can be found in (Ishwaran et al., 2008, 2016), but for

convenience we summarize the key aspects of OTFI below:

1. Only non-missing data is used to calculate the split-statistic for splitting a tree

node.

2. When assigning left and right daughter node membership if the variable used

to split the node has missing data, missing data for that variable is “imputed”

by drawing a random value from the inbag non-missing data.

3. Following a node split, imputed data are reset to missing and the process

is repeated until terminal nodes are reached. Note that after terminal node

assignment, imputed data are reset back to missing, just as was done for all

nodes.

4. Missing data in terminal nodes are then imputed using OOB non-missing

terminal node data from all the trees. For integer valued variables, a maximal

class rule is used; a mean rule is used for continuous variables.

It should be emphasized that the purpose of the “imputed data” in Step 2 is only

to make it possible to assign cases to daughter nodes—imputed data is not used to

calculate the split-statistic, and imputed data is only temporary and reset to missing

after node assignment. Thus, at the completion of growing the forest, the resulting

forest contains missing values in its terminal nodes and no imputation has been

done up to this point. Step 4 is added as a means for imputing the data, but this step

could be skipped if the goal is to use the forest in analysis situations. In particular,

step 4 is not required if the goal is to use the forest for prediction. This applies

even when test data used for prediction has missing values. In such a scenario, test

data assignment works in the same way as in step 2. That is, for missing test values,

values are imputed as in step 2 using the original grow distribution from the training

34

forest, and the test case assigned to its daughter node. Following this, the missing

test data is reset back to missing as in step 3, and the process repeated.

This method of assigning cases with missing data, which is well suited for

forests, is in contrast to surrogate splitting utilized by CART (Breiman et al., 1984).

To assign a case having a missing value for the variable used to split a node, CART

uses the best surrogate split among those variables not missing for the case. This

ensures every case can be classified optimally, whether the case has missing values

or not. However, while surrogate splitting works well for CART, the method is not

well suited for forests. Computational burden is one issue. Finding a surrogate

split is computationally expensive even for one tree, let alone for a large number

of trees. Another concern is that surrogate splitting works tangentially to random

feature selection used by forests. In RF, variables used to split a node are selected

randomly, and as such they may be uncorrelated, and a reasonable surrogate split

may not exist. Another concern is that surrogate splitting alters the interpretation of

a variable, which affects measures such as variable importance measures (Ishwaran

et al., 2008).

To denote the OTFI missing data algorithm, we will use the abbreviation RFotf .

As in proximity imputation, to increase computational speed, RFotf is implemented

using nsplit random splitting. We also consider OTFI under pure random split-

ting and denote this algorithm by RFotfR . Both algorithms are iterated in our studies.

RFotf , RFotfR will be used to denote a single iteration, while RFotf.5 , RFotfR.5 denotes

five iterations. Note that when OTFI algorithms are iterated, the terminal node im-

putation executed in step 4 uses inbag data and not OOB data after the first cycle.

This is because after the first cycle of the algorithm, no coherent OOB sample ex-

ists.

35

Remark 1. As noted by one of our referees, missingness incorporated in attributes

(MIA) is another tree splitting method which bypasses the need to impute data Twala

et al. (2008); Twala and Cartwright (2010). Again, this only applies if the user is

interested in a forest analysis. MIA accomplishes this by treating missing values

as a category which is incorporated into the splitting rule. Let X be an ordered or

numeric feature being used to split a node. The MIA splitting rule searches over all

possible split values s of X of the following form:

Split A: {X s or X = missing} versus {X > s}.

Split B: {X s} versus {X > s or X = missing}.

Split C: {X = missing} versus {X = not missing}.

Thus, like OTF splitting, one can see that MIA results in a forest ensemble con-

structed without having to impute data.

4.0.4 Unsupervised imputation: RFunsv

RFunsv refers to OTFI using multivariate unsupervised splitting. However unlike

the OTFI algorithm, RFotf , the RFunsv algorithm is unsupervised and assumes there

is no response (outcome) variable. Instead a multivariate unsupervised splitting

rule (Ishwaran et al., 2016) is implemented. As in the original RF algorithm, at

each tree node t, a set of mtry variables are selected as potential splitting variables.

However, for each of these, as there is no outcome variable, a random set of ytry

variables is selected and defined to be the multivariate response (pseudo-outcomes).

A multivariate composite splitting rule of dimension ytry (see below) is applied

to each of the mtry multivariate regression problems and the node t split on the

variable leading to the best split. Missing values in the response are excluded when

computing the composite multivariate splitting rule: the split-rule is averaged over

36

non-missing responses only (Ishwaran et al., 2016). We also consider an iterated

RFunsv algorithm (e.g. RFunsv.5 implies five iterations, RFunsv implies no iterations).

Here is the description of the multivariate composite splitting rule. We begin by

considering univariate regression. For notational simplicity, let us suppose the node

t we are splitting is the root node based on the full sample size n. Let X be the

feature used to split t, where for simplicity we assume X is ordered or numeric. Let

s be a proposed split for X that splits t into left and right daughter nodes tL

:= t

L

(s)

and t

R

:= t

R

(s), where t

L

= {Xi

s} and t

R

= {Xi

> s}. Let nL

= #t

L

and

n

R

= #t

R

denote the sample sizes for the two daughter nodes. If Yi

denotes the

outcome, the squared-error split-statistic for the proposed split is

D(s, t) =

1

n

X

i2tL

(Y

i

� Y

tL)2+

1

n

X

i2tR

(Y

i

� Y

tR)2

where Y

tL and Y

tR are the sample means for tL

and t

R

respectively. The best split

for X is the split-point s minimizing D(s, t). To extend the squared-error splitting

rule to the multivariate case q > 1, we apply univariate splitting to each response

coordinate separately. Let Yi

= (Y

i,1, . . . , Yi,q

)

T denote the q � 1 dimensional

outcome. For multivariate regression analysis, an averaged standardized variance

splitting rule is used. The goal is to minimize

D

q

(s, t) =

qX

j=1

(X

i2tL

(Y

i,j

� Y

tLj)

2+

X

i2tR

(Y

i,j

� Y

tRj)

2

)

where Y

tLjand Y

tRjare the sample means of the j-th response coordinate in the

left and right daughter nodes. Notice that such a splitting rule can only be effective

if each of the coordinates of the outcome are measured on the same scale, otherwise

we could have a coordinate j, with say very large values, and its contribution would

37

dominate Dq

(s, t). We therefore calibrate Dq

(s, t) by assuming that each coordinate

has been standardized according to

1

n

X

i2t

Y

i,j

= 0,

1

n

X

i2t

Y

2i,j

= 1, 1 j q.

The standardization is applied prior to splitting a node. To make this standardiza-

tion clear, we denote the standardized responses by Y

⇤i,j

. With some elementary

manipulations, it can be verified that minimizing D

q

(s, t) is equivalent to maximiz-

ing

D

⇤q

(s, t) =

qX

j=1

8<

:1

n

L

X

i2tL

Y

⇤i,j

!2

+

1

n

R

X

i2tR

Y

⇤i,j

!29=

; . (4.1)

For multivariate classification, an averaged standardized Gini splitting rule is used.

First consider the univariate case (i.e., the multiclass problem) where the outcome

Y

i

is a class label from the set {1, . . . , K} where K � 2. The best split s for X is

obtained by maximizing

G(s, t) =

KX

k=1

2

4 1

n

L

X

i2tL

Z

i(k)

!2

+

1

n

R

X

i2tR

Z

i(k)

!23

5

where Zi(k) = 1{Yi=k}. Now consider the multivariate classification scenario r > 1,

where each outcome coordinate Yi,j

for 1 j r is a class label from {1, . . . , Kj

}

for Kj

� 2. We apply Gini splitting to each coordinate yielding the extended Gini

splitting rule

G

⇤r

(s, t) =

rX

j=1

2

4 1

K

j

KjX

k=1

8<

:1

n

L

X

i2tL

Z

i(k),j

!2

+

1

n

R

X

i2tR

Z

i(k),j

!29=

;

3

5 (4.2)

where Z

i(k),j = 1{Yi,j=k}. Note that the normalization 1/K

j

employed for a co-

38

ordinate j is required to standardize the contribution of the Gini split from that

coordinate.

Observe that (4.1) and (4.2) are equivalent optimization problems, with opti-

mization over Y ⇤i,j

for regression and Z

i(k),j for classification. As shown in (Ish-

waran, 2015) this leads to similar theoretical splitting properties in regression and

classification settings. Given this equivalence, we can combine the two splitting

rules to form a composite splitting rule. The mixed outcome splitting rule ⇥(s, t)

is a composite standardized split rule of mean-squared error (4.1) and Gini index

splitting (4.2); i.e.,

⇥(s, t) = D

⇤q

(s, t) +G

⇤r

(s, t),

where p = q + r. The best split for X is the value of s maximizing ⇥(s, t).

Remark 2. As discussed in (Segal and Yuanyuan, 2011), multivariate regression

splitting rules patterned after the Mahalanobis distance can be used to incorpo-

rate correlation between response coordinates. Let YL

and Y

R

be the multivariate

means for Y in the left and right daughter nodes, respectively. The following Ma-

halanobis distance splitting rule was discussed in (Segal and Yuanyuan, 2011)

M

q

(s, t) =

X

i2tL

�Y

i

�Y

L

�T

V

�1L

�Y

i

�Y

L

�+

X

i2tR

�Y

i

�Y

R

�T

V

�1R

�Y

i

�Y

R

�

where V

L

and V

R

are the estimated covariance matrices for the left and right

daughter nodes. While this is a reasonable approach in low dimensional problems,

recall that we are applying D

q

(s, t) to ytry of the feature variables which could be

large if the feature space dimension p is large. Also, because of missing data in the

features, it may be difficult to derive estimators for VL

and V

R

, which is further

complicated if their dimensions are high. This problem becomes worse as the tree

39

is grown because the number of observations decreases rapidly making estimation

unstable. For these reasons, we use the splitting rule D

q

(s, t) rather than M

q

(s, t)

when implementing imputation.

4.0.5 mForest imputation: mRFa and mRF

The missForest algorithm recasts the missing data problem as a prediction problem.

Data is imputed by regressing each variable in turn against all other variables and

then predicting missing data for the dependent variable using the fitted forest. With

p variables, this means that p forests must be fit for each iteration, which can be

slow in certain problems. Therefore, we introduce a computationally faster version

of missForest, which we call mForest. The new algorithm is described as follows.

Do a quick strawman imputation of the data. The p variables in the data set are

randomly assigned into mutually exclusive groups of approximate size ap where

0 < a < 1. Each group in turn acts as the multivariate response to be regressed

on the remaining variables (of approximate size (1 � a)p). Over the multivariate

responses, set imputed values back to missing. Grow a forest using composite mul-

tivariate splitting. As in RFunsv , missing values in the response are excluded when

using multivariate splitting: the split-rule is averaged over non-missing responses

only. Upon completion of the forest, the missing response values are imputed using

prediction. Cycle over all of the ⇡ 1/a multivariate regressions in turn; thus com-

pleting one iteration. Check if the imputation accuracy of the current imputed data

relative to the previously imputed data has increased beyond an ✏-tolerance value

(see Section 3.3 for measuring imputation accuracy). Stop if it has, otherwise repeat

until convergence.

40

To distinguish between mForest under different a values we use the notation

mRFa

to indicate mForest fit under the specified a. Note that the limiting case a =

1/p corresponds to missForest. Although technically the two algorithms missForest

and mRFa

for a = 1/p are slightly different, we did not find significant differences

between them during informal experimentation. Therefore for computational speed,

missForest was taken to be mRFa

for a = 1/p and denoted simply as mRF.

The key steps for mRFmRF.↵ is sketched as in Algorithm 1.

Let X be the n ⇥ p matrix with missing values, X? and X?? be two possible dif-

ferent imputed matrices for X. We define E (X

?

, X

??

), the difference of imputation

accuracy as follows. As described above, values of Xj

were made missing under

various missing data assumptions. Let Ij

= (I1,j, . . . , In,j) be a vector of zeroes and

ones indicating which values of Xj

were artificially made missing. Define I

i,j

= 1

if Xi,j

is artificially missing; otherwise I

i,j

= 0. Let nj

=

Pn

i=1 Ii,j be the number

of artificially induced missing values for Xj

.

Let N and C be the set of nomimal (continuous) and categorical (factor) vari-

ables with more than one artificially induced missing value. That is,

N = {j : Xj

is nominal and n

j

> 1}

C = {j : Xj

is categorical and n

j

> 1}.

41

E (X

?

, X

??

) =

1

#N

X

j2N

vuuuuuuut

nX

i=1

I

i,j

�X

⇤⇤i,j

�X

⇤i,j

�2/n

j

nX

i=1

I

i,j

⇣X

⇤i,j

�X

⇤j

⌘2/n

j

+

1

#C

X

j2C

Pn

i=1 Ii,j 1{X⇤⇤i,j

6= X

⇤i,j

}n

j

�,

where X

⇤j

=

Pn

i=1

�I

i,j

X

⇤i,j

�/n

j

. Note that standardized root-mean-squared er-

ror (RMSE) was used to assess imputation difference for nominal variables, and

misclassification error for factors.

Algorithm 5 multivariate missforest algorithmRequire: X an n ⇥ p matrix, stopping criterion �, grouping factor ↵, 0 < ↵ 1

1: Record which variables and which positions have missing values in X2: p0 number of variables that have missing value3: X

imp

quick and rough imputation4: Set diff =15: while diff � � do6: X

old.imp

Ximp

7: Randomly separate the p0 variables into K = K (↵) groups of approximatelythe same size (if ↵ = 1, K = p0)

8: for i = 1, ..., K do9: Let X

i

be the columns of X corresponding to group i, X(�i) the columnsof X excluding group i

10: Set Ximp

the values in Xi

which were missing back to NA11: Fit multivariate random forest using variables in groups i as response vari-

ables, and the rest as predicting variables. Note: ONLY the non-missingvalues of X

i

will be used in calculating the composite split rule12: X

imp

Get the final summary imputed value using the terminal averagefor continuous variables and using the maximal terminal node class rulefor categorical variables

13: end for14: Set diff = E (X

old.imp

,Ximp

)

15: end while16: Return the imputed matrix X

imp

Chapter 5

Imputation Performance

Performance of RF algorithms, including proximity imputation, on the fly algo-rithms,unsupervised imputation, missForest and multivariate missForest unsuper-vised was assessed using a large scale experiment of 60 data sets under differentmissing data mechanisms and levels of missingness of data. RF algorithms werefound to be robust overall, performing well in the presence of moderate to highmissingnes; sometimes even with missing not at random data. Performance wasespecially good for correlated data. In such cases, the most accurate algorithmwas missForest; however multivariate analogs of the algorithm performed nearlyas well while reducing computational time by a factor of up to 10. In low to mediumcorrelation settings, less computationally intensive RF algorithms were found to beas effective, some reducing compute times by another factor of up to 10.

5.1 Methods

5.1.1 Experimental design and data

Table 5.1 lists the nine experiments that were carried out. In each experiment, a

pre-specified target percentage of missing values was induced using one of three

different missing mechanisms (Rubin, 1976):

42

43

1. Missing completely at random (MCAR). This means the probability that an

observation is missing does not depend on the observed value or the missing

ones.

2. Missing at random (MAR). This means that the probability of missingness

may depend upon observed values, but not missing ones.

3. Not missing at random (NMAR). This means the probability of missingness

depends on both observed and missing values.

Table 5.1: Experimental design used for large scale study of RF missing data algo-rithms.

Missing PercentMechanism Missing

EXPT-A MCAR 25EXPT-B MCAR 50EXPT-C MCAR 75EXPT-D MAR 25EXPT-E MAR 50EXPT-F MAR 75EXPT-G NMAR 25EXPT-H NMAR 50EXPT-I NMAR 75

Sixty data sets were used, including both real and synthetic data. Figure 5.1 il-

lustrates the diversity of the data. Displayed are data sets in terms of correlation (⇢),

sample size (n), number of variables (p), and the amount of information contained

in the data (I = log10(n/p)). The correlation, ⇢, was defined as the L2-norm of the

correlation matrix. If R = (⇢

i,j

) denotes the p⇥ p correlation matrix for a data set,

⇢ was defined to equal

⇢ =

✓p

2

◆�1 pX

j=1

X

k<j

|⇢i,j

|2!1/2

. (5.1)

44

This is similar to the usual definition of the L2-norm for a matrix, but where we

have modifed the definition to remove the diagonal elements of R which equal 1,

as well as the contribution from the symmetric lower diagonal values.

Note that in the plot for p there are 10 data sets with p in the thousands—

these are a collection of well known gene expression data sets. The right-most plot

displays the log-information of a data set, I = log10(n/p). The range of values on

the log-scale vary from �2 to 2; thus the information contained in a data set can

differ by as much as ⇡ 10

4.

●●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 20 40 60

0.0

0.2

0.4

0.6

0.8

Data Sets

corre

latio

n

●●●●●●●●●●

●●●●●●●●●●

●

●●●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●●●●

●

●●

●

●●

0 20 40 60

010

0020

0030

0040

00

Data Sets

n

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●●

●

●

●

●

●

●

●●

●

●●

0 20 40 60

020

0040

0060

0080

00

Data Sets

p

●

●

●●

●●●●

●●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

0 20 40 60

−2−1

01

2

Data Sets

I = lo

g 10(n

/ p)

Figure 5.1: Summary values for the 60 data sets used in the large scale RF missingdata experiment. The last panel displays the log-information, I = log10(n/p), foreach data set.

5.1.2 Inducing MCAR, MAR and NMAR missingness

The following procedures were used to induce missigness in the data. Let the target

missigness fraction be 0 < g < 1. For MCAR, data was set to missing randomly

without imposing column or row constraints to the data matrix. Specifically, the

45

data matrix was made into a long vector and ng of the entries selected at random

and set to missing.

For MAR, missing values were assigned by column. Let Xj

= (X1,j, . . . , Xn,j

)

be the n-dimensional vector containing the original values of the jth variable, 1

j p. Each coordinate of Xj

was made missing according to the tail behavior

of a randomly selected covariate X

k

, where k 6= j. The probability of selecting

coordinate X

i,j

was

P{selecting X

i,j

|Bj

} /

8><

>:

F (X

i,k

) if Bj

= 1

1� F (X

i,k

) if Bj

= 0

where F (x) = (1 + exp(�3x))�1 and B

j

were i.i.d. symmetric 0-1 Bernoulli ran-

dom variables. With this method, about half of the variables will have higher mis-

signess in those coordinates corresponding to the right tail of a randomly selected

variable (the other half will have higher missigness depending on the left tail of

a randomly selected variable). A total of ng coordinates were selected from X

j

and set to missing. This induces MAR, as missing values for Xj

depend only on

observed values of another variable X

k

.

For NMAR, each coordinate of Xj

was made missing according to its own tail

behavior. A total of ng values were selected according to

P{selecting X

i,j

} /

8><

>:

F (X

i,j

) with probability 1/2

1� F (X

i,j

) with probability 1/2.

Notice that missingness in X

i,j

depends on both observed and missing values. In

particular, missing values occur with higher probability in the right and left tails of

the empirical distribution. Therefore, this induces NMAR.

46

5.1.3 Measuring imputation accuracy

Accuracy of imputation was assessed using the following metric. As described

above, values of Xj

were made missing under various missing data assumptions.

Let (11,j, . . . , 1n,j) be a vector of zeroes and ones indicating which values of Xj

were artificially made missing. Define 1

i,j

= 1 if Xi,j

is artificially missing; oth-

erwise 1

i,j

= 0. Let nj

=

Pn

i=1 1i,j be the number of artificially induced missing

values for Xj

.

Let N and C be the set of nominal (continuous) and categorical (factor) vari-

ables with more than one artificially induced missing value. That is,

N = {j : Xj

is nominal and n

j

> 1}

C = {j : Xj

is categorical and n

j

> 1}.

Standardized root-mean-squared error (RMSE) was used to assess performance

for nominal variables, and misclassification error for factors. Let X⇤j

be the n-

dimensional vector of imputed values for Xj

using procedure I . Imputation error

for I was measured using

E (I ) =

1

#N

X

j2N

vuuuuuuut

nX

i=1

1

i,j

�X

⇤i,j

�X

i,j

�2/n

j

nX

i=1

1

i,j

�X

i,j

�X

j

�2/n

j

+

1

#C

X

j2C

Pn

i=1 1i,j 1{X⇤i,j

6= X

i,j

}n

j

�,

where X

j

=

Pn

i=1 (1i,jXi,j

) /n

j

. To be clear regarding the standardized RMSE,

observe that the denominator in the first term is the variance of Xj

over the arti-

47

ficially induced missing values, while the numerator is the MSE difference of Xj

and X

⇤j

over the induced missing values.

As a benchmark for assessing imputation accuracy we used strawman imputa-

tion described earlier, which we denote by s. Imputation error for a procedure I

was compared to s using relative imputation error defined as

ER

(I ) = 100⇥ E (I )

E (s)

.

A value of less than 100 indicates a procedure I performing better than the straw-

man.

5.1.4 Experimental settings for procedures

Randomized splitting was invoked with an nsplit value of 10. For random fea-

ture selection, mtry was set to pp. For random outcome selection for RFunsv , we

set ytry to equal pp. Algorithms RFotf , RFunsv and RFprx were iterated 5 times in

addition to be run for a single iteration. For mForest, the percentage of variables

used as responses was a = .05, .25. This implies that mRF0.05 used up to 20 regres-

sions per cycle, while mRF0.25 used 4. Forests for all procedures were grown using

a nodesize value of 1. Number of trees was set at ntree = 500. Each experi-

mental setting (Table 5.1) was run 100 times independently and results averaged.

For comparison, k-nearest neighbors imputation (hereafter denoted as KNN)

was applied using the impute.knn function from the R-package impute (Hastie

et al., 2015). For each data point with missing values, the algorithm determines the

k-nearest neighbors using a Euclidean metric, confined to the columns for which

that data point is not missing. The missing elements for the data point are then

imputed by averaging the non-missing elements of its neighbors. The number of

48

neighbors k was set at the default value k = 10. In experimentation we found

the method robust to the value of k and therefore opted to use the default setting.

Much more important were the parameters rowmax and colmax which control

the maximum percent missing data allowed in each row and column of the data

matrix before a rough overall mean is used to impute the row/column. The default

values of 0.5 and 0.8, respectively, were too low and led to poor performance in the

heavy missing data experiments. Therefore, these values were set to their maximum

of 1.0, which greatly improved performance. Our rationale for selecting KNN as

a comparison procedure is due to its speed because of the large scale nature of

experiments (total of 100 ⇥ 60 ⇥ 9 = 54, 000 runs for each method). Another

reason was because of its close relationship to forests. This is because RF is also

a type of nearest neighbor procedure—although it is an adaptive nearest neighbor.

We comment later on how adaptivity may give RF a distinct advantage over KNN.

5.2 Results

Section 4.1 presents the results of the performance of a procedure as measured by

relative imputation accuracy, ER

(I ), and in Section 4.2 we discuss computational

speed.

5.2.1 Imputation Accuracy

In reporting the values for imputation accuracy, we have stratified data sets into low,

medium and high-correlation groups, where correlation, ⇢, was defined as in (5.1).

Low, medium and high-correlation groups were defined as groups whose ⇢ value

fell into the [0, 50], [50, 75] and [75, 100] percentile for correlations. Results were

stratified by ⇢ because we found it played a very heavy role in imputation per-

49

formance and was much more informative than other quantities measuring infor-

mation about a data set. Consider for example the log-information for a data set,

I = log10(n/p), which reports the information of a data set by adjusting its sample

size by the number of features. While this is a reasonable measure, Figure 5.2 shows

that I is not nearly as effective as ⇢ in predicting imputation accuracy. The figure

displays the ANOVA effect sizes for ⇢ and I from a linear regression in which log

relative imputation error was used as the response. In addition to ⇢ and I , depen-

dent variables in the regression also included the type of RF procedure. The effect

size was defined as the estimated coefficients for the standardized values of ⇢ and

I . The two dependent variables ⇢ and I were standardized to have a mean of zero

and variance of one which makes it possible to directly compare their estimated

coefficients. The figure shows that both values are important for understanding im-

putation accuracy and that both exhibit the same pattern. Within a specific type of

missing data mechanism, say MCAR, importance of each variable decreases with

missingness of data (MCAR 0.25, MCAR 0.5, MCAR 0.75). However, while the

pattern of the two measures is similar, the effect size of ⇢ is generally much larger

than I . The only exceptions being the MAR 0.75 and NMAR 0.75 experiments, but

these two experiments are the least interesting. As will be discussed below, nearly

all methods performed poorly here.

Correlation

The imputation accuracy was summarized in Figure 5.3 and Table 5.2. Figure 5.4

presents the same results as Figure 5.3, with more compacted way for displaying.

Figure 5.3 and Table 5.2, which have been stratified by correlation group, show

the importance of correlation for RF imputation procedures. In general, imputa-

tion accuracy generally improves with correlation. Over the high correlation data,

50

log−information correlation

ANO

VA E

ffect

Size

0.0

0.1

0.2

0.3

MC

AR .2

5

MC

AR .5

0

MC

AR .7

5

MAR

.25

MAR

.50

MAR

.75

NM

AR .2

5

NM

AR .5

0

NM

AR .7

5

MC

AR .2

5

MC

AR .5

0

MC

AR .7

5

MAR

.25

MAR

.50

MAR

.75

NM

AR .2

5

NM

AR .5

0

NM

AR .7

5

Figure 5.2: ANOVA effect size for the log-information, I = log10(n/p), and corre-lation, ⇢ (defined as in (5.1)), from a linear regression using log relative imputationerror, log10(ER

(I )), as the response. In addition to I and ⇢, dependent variablesin the regression included type of RF procedure used. ANOVA effect sizes are theestimated coefficients of the standardized variable (standardized to have mean zeroand variance 1).

51

5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

MCAR .25

5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

MCAR .50

5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

MCAR .75

low correlationmedium correlationhigh correlation

5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

MAR .25

5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

MAR .50

5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

MAR .75


5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

NMAR .25

5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

NMAR .50

5060708090

100110

Rel

ative

Impu

tatio

n Er

ror

RFo

tfR

Fotf.

5R

FotfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5R

Fprx

RFp

rx.5

RFp

rxR

RFp

rxR

.5m

RF0

.25

mR

F0.0

5m

RF

KNN

NMAR .75


Figure 5.3: Relative imputation error, ER

(I ), stratified and averaged by levelof correlation of a data set. Procedures are: RFotf , RFotf.5 (on the fly imputa-tion with 1 and 5 iterations); RFotfR , RFotfR.5 (similar to RFotf and RFotf.5 but usingpure random splitting); RFunsv , RFunsv.5 (multivariate unsupervised splitting with1 and 5 iterations); RFprx , RFprx.5 (proximity imputation with 1 and 5 iterations);RFprxR , RFprxR.5 (same as RFprx and RFprx.5 but using pure random splitting); mRF0.25,mRF0.05, mRF (mForest imputation, with 25%, 5% and 1 variable(s) used as the re-sponse); KNN (k-nearest neighbor imputation).

52

RFo

.1R

Fo.5

RFo

r.1R

For.5

RFu

.1R

Fu.5

RFp

.1R

Fp.5

RFp

r.1R

Fpr.5

mR

F0.2

5m

RF0

.05

mR

FKN

N

50

60

70

80

90

100

110R

elat

ive Im

puta

tion

Erro

r

low correlation

● ● ●

●

●●

● ●●

● ●● ●

●● ● ●

●

●●

● ● ●● ●

●●

●● ● ●

●

●●

●

● ●● ● ● ●

●

● ● ●MCAR .25MCAR .50MCAR .75

MAR .25MAR .50MAR .75

NMAR .25NMAR .50NMAR .75

RFo

.1R

Fo.5

RFo

r.1R

For.5

RFu

.1R

Fu.5

RFp

.1R

Fp.5

RFp

r.1R

Fpr.5

mR

F0.2

5m

RF0

.05

mR

FKN

N

50

60

70

80

90

100

110

Rel

ative

Impu

tatio

n Er

ror

medium correlation

●

●

● ●●

●

●● ●

●●

●●

●

● ●

●

●

●

●

●●

● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

RFo

.1R

Fo.5

RFo

r.1R

For.5

RFu

.1R

Fu.5

RFp

.1R

Fp.5

RFp

r.1R

Fpr.5

mR

F0.2

5m

RF0

.05

mR

FKN

N

50

60

70

80

90

100

110

Rel

ative

Impu

tatio

n Er

ror

high correlation

●

●

●

●●

●

●

●●

● ●

● ●

●

●●

●

● ●

●

●●

●

●●

●●

●

● ●

●●

●

●

●●

●

●●

●●

●

.25

.50

.75

.25

.50

.75

.25

.50

.75

50

60

70

80

90

100

110

Rel

ative

Impu

tatio

n Er

ror

MCAR MAR NMAR

low correlation

●●

●

● ●

●●

●

●

●● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

RFo.1RFo.5RFor.1RFor.5RFu.1RFu.5RFp.1

RFp.5RFpr.1RFpr.5mRF0.25mRF0.05mRFKNN

.25

.50

.75

.25

.50

.75

.25

.50

.75

50

60

70

80

90

100

110

Rel

ative

Impu

tatio

n Er

ror

MCAR MAR NMAR

medium correlation

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

.25

.50

.75

.25

.50

.75

.25

.50

.75

50

60

70

80

90

100

110

Rel

ative

Impu

tatio

n Er

ror

MCAR MAR NMAR

high correlation

●

●

●

●●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Figure 5.4: Relative imputation error for a procedure over 60 data sets averagedover 100 runs, stratified and averaged by level of correlation of a data set. Top rowdisplays values by procedure; bottom row displays values in terms of the 9 differentexperiments (Table 1). Procedures were: RFo.1, RFo.5 (on the fly imputation with1 and 5 iterations); RFor.1, RFor.5 (similar to RFo.1 and RFo.5, but using purerandom splitting); RFu.1, RFu.5 (multivariate unsupervised splitting with 1 and 5iterations); RFp.1, RFp.5 (proximity imputation with 1 and 5 iterations); RFpr.1,RFpr.5 (same as RFp.1 and RFp.5, but using pure random splitting); mRF0.25,mRF0.05, mRF (mForest imputation, with 25%, 5% and 1 variable(s) used as theresponse); KNN (k-nearest neighbor imputation). See Section 3.4 for more detailsregarding parameter settings of procedures.

53

Table 5.2: Relative imputation error ER

(I ).Low Correlation

MCAR MAR NMAR.25 .50 .75 .25 .50 .75 .25 .50 .75

RFotf 89.0 93.9 96.2 89.5 94.5 97.2 96.5 97.2 100.9RFotf.5 88.7 91.0 95.9 89.5 88.6 93.5 96.0 92.6 98.8RFotfR 89.9 94.1 96.8 89.8 94.7 97.8 96.6 97.6 101.7RFotfR.5 92.3 95.8 95.8 96.5 93.7 94.2 103.2 97.0 102.9RFunsv 88.3 92.8 96.2 87.9 93.0 97.3 95.4 97.4 101.6RFunsv.5 85.4 90.1 94.7 85.7 88.6 92.2 97.7 93.0 100.8RFprx 91.1 92.8 96.7 89.9 88.5 90.4 91.5 92.7 99.2RFprx.5 90.6 93.9 101.8 89.8 88.7 93.9 95.7 91.1 99.3RFprxR 90.2 92.6 96.2 89.4 88.8 90.7 94.6 97.5 100.5RFprxR.5 86.9 92.4 100.0 88.1 88.8 94.8 96.2 94.3 102.8mRF0.25 87.4 92.9 103.1 88.8 89.3 99.8 96.9 92.3 98.7mRF0.05 86.1 94.3 105.3 86.0 88.7 102.7 96.8 92.6 99.0mRF 86.3 94.7 105.6 84.4 88.6 103.3 96.7 92.5 98.8KNN 91.1 97.4 111.5 94.4 100.9 106.1 100.9 100.0 101.7

Medium CorrelationMCAR MAR NMAR

.25 .50 .75 .25 .50 .75 .25 .50 .75RFotf 82.3 89.9 95.6 78.8 88.6 97.0 92.7 92.6 102.2RFotf.5 76.2 82.1 90.0 83.4 79.1 93.4 99.6 89.1 100.8RFotfR 83.1 91.4 96.0 80.3 90.3 97.4 92.2 96.1 105.3RFotfR.5 82.4 84.1 93.1 88.2 84.2 95.1 112.0 97.1 104.5RFunsv 80.4 88.4 95.9 76.1 87.7 97.5 87.3 92.7 104.7RFunsv.5 73.2 78.9 89.3 78.8 79.0 92.4 98.8 92.8 104.2RFprx 82.6 86.3 93.1 80.7 80.5 97.7 88.6 93.8 99.5RFprx.5 77.1 84.1 93.3 86.5 77.0 92.1 98.1 93.7 101.0RFprxR 81.2 85.4 93.1 80.4 82.4 96.3 89.2 97.2 101.3RFprxR.5 76.1 80.8 92.0 82.1 77.7 95.1 102.1 96.6 105.1mRF0.25 73.8 80.2 91.6 75.3 75.6 90.2 97.6 87.5 102.1mRF0.05 70.9 80.1 95.2 70.1 76.6 93.0 87.4 87.9 103.4mRF 69.6 80.1 95.0 71.3 74.6 92.4 86.9 87.8 103.1KNN 79.8 93.5 105.3 80.2 96.0 98.7 93.9 98.3 102.1

High CorrelationMCAR MAR NMAR

.25 .50 .75 .25 .50 .75 .25 .50 .75RFotf 72.3 83.7 94.6 65.5 83.3 98.4 66.5 84.8 100.4RFotf.5 70.9 72.1 80.9 69.5 70.9 91.0 70.1 70.8 97.3RFotfR 68.6 81.0 93.6 59.5 87.1 98.9 61.2 88.2 100.3RFotfR.5 58.4 58.9 64.6 56.7 55.1 88.4 58.4 60.9 97.3RFunsv 62.1 75.1 91.3 56.8 70.8 97.8 58.1 73.3 100.6RFunsv.5 54.2 57.5 65.4 54.0 49.4 80.0 55.4 51.7 90.7RFprx 75.5 82.0 88.5 70.7 72.8 94.3 70.9 74.3 102.0RFprx.5 70.4 72.0 78.6 69.7 71.2 90.3 70.0 72.2 98.2RFprxR 61.9 68.1 76.6 58.7 64.1 79.5 60.4 74.6 97.5RFprxR.5 57.3 58.1 61.9 55.9 54.1 71.9 57.8 60.2 93.7mRF0.25 57.0 57.9 63.3 55.5 50.4 70.5 56.7 50.7 87.3mRF0.05 50.7 54.0 61.7 48.3 48.4 74.9 49.9 48.6 85.9mRF 48.2 49.8 61.3 47.0 47.5 70.2 46.6 47.6 82.9KNN 52.7 63.2 83.2 52.0 71.1 96.4 53.2 74.9 99.2

54

mForest algorithms were by far the best. In some cases, they achieved a relative im-

putation error of 50, which means their imputation error was half of the strawman’s

value. Generally there are no noticeable differences between mRF (missForest) and

mRF0.05. Performance of mRF0.25, which uses only 4 regressions per cycle (as op-

posed to p for mRF), is also very good. Other algorithms that performed well in

high correlation settings were RFprxR.5 (proximity imputation with random splitting,

iterated 5 times) and RFunsv.5 (unsupervised multivariate splitting, iterated 5 times).

Of these, RFunsv.5 tended to perform slightly better in the medium and low correla-

tion settings. We also note that while mForest also performed well over medium

correlation settings, performance was not superior to other RF procedures in low

correlation settings, and sometimes was worse than procedures like RFunsv.5 . Re-

garding the comparison procedure KNN, while its performance also improved with

increasing correlation, performance in the medium and low correlation settings was

generally much worse than RF methods.

Missing data mechanism

The missing data mechanism also plays an important role in accuracy of RF pro-

cedures. Accuracy decreased systematically when going from MCAR to MAR and

NMAR. Except for heavy missingness (75%), all RF procedures under MCAR and

MAR were more accurate than strawman imputation. Performance in NMAR was

generally poor unless correlation was high.

Heavy missingness

Accuracy degraded with increasing missingness. This was especially true when

missingness was high (75%). For NMAR data with heavy missingness, procedures

were not much better than strawman (and sometimes worse), regardless of corre-

55

lation. However, even with missingness of up to 50%, if correlation was high, RF

procedures could still reduce the strawman’s error by one-half.

Iterating RF algorithms

Iterating generally improved accuracy for RF algorithms, except in the case of

NMAR data, where in low and medium correlation settings, performance some-

times degraded.

5.2.2 Computational speed

Figure 5.5 displays the log of total elapsed time of a procedure averaged over all

experimental conditions and runs, with results ordered by the log-computational

complexity of a data set, c = log10(np). The same results are also displayed in

Figure 5.7 in a compacted manner. The fastest algorithm is KNN which is generally

3 times faster on the log-scale, or 1000 times faster, than the slowest algorithm,

mRF (missForest). To improve clarity of these differences, Figure 5.6 displays the

relative computational time of procedure relative to KNN (obtained by subtracting

the KNN log-time from each procedures log-time). This new figure shows that

while mRF is 1000 times slower than KNN, the multivariate mForest algorithms,

mRF0.05 and mRF0.25, improve speeds by about a factor of 10. After this, the next

slowest procedures are the iterated algorithms. Following this are the non-iterated

algorithms. Some of these latter algorithms, such as RFotf , are 100 times faster than

missForest; or only 10 times slower than KNN. These kinds of differences can have

a real effect when dealing with big data. We have experienced settings where OTF

algorithms can take hours to run. This means that the same data would take

56

missForest 100’s of hours to run, which makes it questionable to be used in such

settings.

5.3 Simulations

In this section we used simulations to study the performance of RF as the sample

size n was varied. We wanted to verify two questions: (1) Does the relative impu-

tation error improve with sample size? (2) Do these values converge to the same or

different values for the different RF imputation algorithms?

For our simulations, there were 10 variables X1, . . . , X10 where the true model

was

Y = X1 +X2 +X3 +X4 + "

where " was simulated independently from a N(0, 0.5) distribution. Variables X1

and X2 were correlated with a correlation coefficient of 0.96, and X5 and X6 were

correlated with value 0.96. The remaining variables were not correlated. Vari-

ables X1, X2, X5, X6 were N(3, 3), variables X3, X10 were N(1, 1), variable X8

was N(3, 4), and variables X4, X7, X9 were exponentially distributed with mean

0.5.

The sample size (n) was chosen to be 100, 200, 500, 1000, and 2000. Data was

made missing using the MCAR, MAR, and NMAR missing data procedures de-

scribed earlier. Percentage of missing data was set at 25%. All imputation param-

eters were set to the same values as used in our previous experiments as described

in Section 3.4. Each experiment was repeated 500 times and the relative imputa-

tion error, ER

(I ), recorded in each instance. Figure 5.8 displays the mean relative

imputation error for a RF procedure and its standard deviation for each sample size

57

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o ooooo ooooo ooo oooo

oooooooooooo ooo oooo oo oo o o

o

RFotf

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o oooooooooo ooo oo

oooooooooooo

o o ooo oooo oo o

o o oo

RFotf.5

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o ooooo oooooo ooo

ooooooooooooooooo ooo ooooo

oo ooooo o ooooo o ooo o

RFotfR

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o ooooo oooooo

ooo oooooooooooooooo o ooo ooooooo ooooo o ooooo o ooo o

RFotfR.5

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)


ooooo

ooooooooooo oooo

ooooo oo ooooo o ooooo o ooo

o

RFunsv

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)


oooooooooooooooo oooo

ooooooo ooooo o ooooo o ooo

o

RFunsv.5

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o ooooo ooooo ooo oooooooooo

ooooo

oooo oooo

oo

oo

oo o

RFprx

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o oooooooooo ooo oo

oooooooo

ooooo

oooo oooo

oo

oo

oo o

RFprx.5

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o ooooo oooooo oooooooooooo

oooooooo

ooo ooooo

oo ooooo o

ooooo o o

oo o

RFprxR

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)


oooooooooooooooo

oooo ooooo

oo ooooo o

ooooo o o

oo o

RFprxR.5

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o ooooooooooo ooo

ooooooooooooooo

o o ooo ooooo o

o ooooo o ooooo o ooo o

mRF0.25

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o oooooooooo

o ooo

ooooo

ooooooooooo o ooo oo

ooo oo ooooo o

ooooo o ooo o

mRF0.05

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

o o o oooooooooo

o ooo

o

oooooooooooooo

o oooo

ooooo o

oo

oooo o ooooo o ooo

omRF

3 4 5 6

−20

24

c = log10(n p)

log 1

0(Com

putin

g Ti

me)

oo ooooo oo

o ooooooo ooo

oooooooo

ooo ooooo

oo ooooo o

ooooo o ooo o

KNN

Figure 5.5: Log of computing time for a procedure versus log-computational com-plexity of a data set, c = log10(np).

58

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

oooo oo

o

oooo

oooooo

oo

o

oo

o

o

oo

o

o

o

o

o o o

RFotf

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

ooo

ooooo

ooo

ooooooo

oooo

o

o

oo

o

o

o

o

o o o

RFotf.5

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

oooo

oooo ooo

oooooo

oo

o

ooooooo

o

o o

o

o

ooo

ooooo o oo

oo

RFotfR

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

o

oo

ooooo ooo

ooooooo

oo

ooooooo

o

oo

o

o

oo oooooo

o ooo

o

RFotfR.5

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo ooo oo

ooooo o

oo

ooooooo

o

o

oo

o

o

ooo

o

o o

o

o

ooo

ooooo o ooo o

RFunsv

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

oooo

ooo

o ooo

oo

o

oooo

o

o

oo

o

o

ooo

oo o

o

o

ooo

ooooo o ooo o

RFunsv.5

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

oooo ooo

oooooooooo

o oooo oooo

oooo

o oo

RFprx

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

ooo

o oooo o

ooooooooo o

ooo oooooo

ooo o

o

RFprx.5

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

oooo

oooo oooooooooo o ooo ooooooo ooooo o

oo

o

ooo o

oo o

RFprxR

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooo oo

oooo oooo oooooooooo o ooo ooooo oo ooooo o

oo

o

ooo o

oo o

RFprxR.5

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

oo o

oooooo

o oo

o

oooo ooo

oooooo

o

o

o

oo

o

o

ooo

o

oo

o

o

ooo

ooooo o o

oo

o

mRF0.25

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

o o oooooooo oo

o

oo

o

o ooo

ooooo

oo

o

o

oo

o

o

ooo

oo

o

o

o

oo oooooo o oo

o o

mRF0.05

3 4 5 6

1.0

2.0

3.0

4.0

c = log10(n p)

Rel

ative

Com

putin

g Ti

me

o o oooooooo oo

o

oo

o

o ooo

ooooo

oo

o

o

oo

o

o

ooo

o

oo

o

o

ooo

ooooo o o

oo

o

mRF

Figure 5.6: Relative log-computing time (relative to KNN) versus log-computational complexity of a data set, c = log10(np).

59

3 4 5 6

−2−1

01

23

4

log10(Dimension)

log 1

0(C

ompu

ting

Tim

e)

● ● ● ●●●●●

●●●

●●

●

●

●●●

●● ●

●

●●●●

●●●●

● ●●

●●●●●●

●●

●

●

●●

●

● ● ● ●

●●●● ●●●

●●

●

●●● ●

●●●● ●●

●●●●●●●●

● ● ●●● ●●●●●

●●

● ●●● ● ●

●●●●●●

● ●●

●

● ● ●●

●●●● ●●●●●

●

●●●

● ●●●● ●

●●●●●

●●●●

● ●

●

●●

●●●●●

● ● ●●●●

● ●●●●●● ●

● ●●

●

● ●● ●

●●●●

●●

●●

●●●●

●●

●●

●

●●●

●●

●

●●●●

●

●

●●

●●●●

●

●

●

●

●

● ●

● ● ● ●

●●●●

●●●●●

●●●●

●●●●●

●●●●

●●●●●●●

●

●

●●

●●●●●

●● ●

●

●

● ●●

●●

●

●●●

●

●

● ●

● ●

● ●●●●●

●●

●

●●

●●●●

●

●●●

●●●●●

●

●●●●●

● ● ●●● ●●

●●●● ●

●●

●

●●

●

●●●●● ●

● ●●●

●

●

●

●

●

●

RFo.1RFo.5RFor.1RFor.5RFu.1RFu.5RFp.1

RFp.5RFpr.1RFpr.5mRF0.25mRF0.05mRFKNN

Figure 5.7: Log of computing time for a procedure versus log of dimension of adata set, with compute times averaged over runs and experimental conditions.

setting. As can be seen, values improve with increasing n. It is also noticeable that

performance depends upon the RF imputation method. In these simulations, the

missForest algorithm mRF0.1 appears to be best (note that p = 10 so mRF0.1 corre-

sponds to the limiting case missForest). Also, it should be noted that performance

of RF procedures decrease systematically as the missing data mechanism becomes

more complex. This mirrors our previous findings.

5.4 Conclusions

Being able to deal with missing data effectively is of great importance to scien-

tists working with real world data today. A machine learning method such as RF

known for its excellent prediction performance and ability to handle all forms of

data, represents a poentially attractive solution to this challenging problem. How-

ever, because no systematic study of RF procedures had been attempted in missing

60

8590

9510

010

5

Impu

tatio

n Er

ror +

/− S

D

RFo

tf

RFo

tf.5

RFo

tfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5

RFp

rx

RFp

rx.5

mR

F0.5

mR

F0.1

n = 100n = 200n = 500n = 1000n = 2000

MCAR

●

●

●● ●

●●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

8590

9510

010

5

Impu

tatio

n Er

ror +

/− S

D

RFo

tf

RFo

tf.5

RFo

tfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5

RFp

rx

RFp

rx.5

mR

F0.5

mR

F0.1

n = 100n = 200n = 500n = 1000n = 2000

MAR

●

●

●● ●

● ●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

120

130

140

150

160

170

Impu

tatio

n Er

ror +

/− S

D

RFo

tf

RFo

tf.5

RFo

tfR

RFo

tfR.5

RFu

nsv

RFu

nsv.

5

RFp

rx

RFp

rx.5

mR

F0.5

mR

F0.1

n = 100n = 200n = 500n = 1000n = 2000

NMAR

Figure 5.8: Mean relative imputation error ± standard deviation from simulationsunder different sample size values n = 100, 200, 500, 1000, 2000.

61

data settings, we undertook a large scale experimental study of various RF proce-

dures to determine which methods performed best, and under what types of settings.

What we found was that correlation played a very strong role in performance of RF

procedures. Imputation performance of all RF procedures improved with increas-

ing correlation of features. This held even with heavy levels of missing data and

in all but the most complex missing data scenarios. When there is high correlation

we recommend using a method like missForest which performed the best in such

settings. Although it might seem obvious that increasing feature correlation should

improve imputation, we found that in low to medium correlation, RF algorithms did

noticeably better than the popular KNN imputation method. This is interesting be-

cause KNN is related to RF. Both methods are a type of nearest neighbor method,

although RF is more adaptive than KNN and in fact can be more accurately de-

scribed as an adaptive nearest neighbor method. This adaptivity of RF may play a

special role in harnessing correlation in the data that may not necessarily be present

in other methods, even methods that have similarity to RF. Thus, we feel it is worth

emphasizing that correlation is extremely important to RF imputation methods.

In big data settings, computational speed will play a key role. Thus practically

speaking users might not be able to implement the best method possible because

computational times will simply be too long. This is the downside of a method

like missForest, which was the slowest of all the procedures considered. As one

solution, we proposed mForest (mRFa

) which is a omputationally more efficient

implementation of missForest. Our results showed mForest could achieve up to a

10-fold reduction in compute time relative to missForest. We believe these compu-

tational times can be improved further by incorporating mForest directly into the

native C-library of randomForestSRC (RF-SRC). Currently mForest is run as

an external R-loop that makes repeated calls to the impute.rfsrc function in

62

RF-SRC. Incorporating mForest into the native library, combined with the openMP

parallel processing of RF-SRC, could make it much more attractive. However, even

with all of this, we still recommend some of the more basic OTFI algorithms like

unsupervised RF imputation procedures for big data. These algorithms perform

solidly in terms of imputation and are 100’s of times faster than missForest.

5.5 Discussion

MissForest outperforms OTF imputation for the following reason. When there are

two variables(x1, x2, lets say x1 has missing values) that are highly correlated by

not important in prediction Y, x2 won’t get splitted on early. OTF will not be

63

accurate in imputing x1. missForest will be very accurate in imputing x1, unsuper-

viased will be in between. This is where missForest is gaining advantage.

Although the three random forest imputation algorithms(OTF, unsupervised and

missForest ) represents the three different imputation strategy mentioned above,

they essentially all partition the cases in a way that the correlation between two

variables can be taken advantage of. For a variable with missing values, the ear-

lier the correlated covariate gets splitted, the better the imputation accuracy will be.

Since missForest always split its most correlated covariate first, the imputation ac-

curacy is alway as good or superior to OTF imputation. With OTF imputation, the

response variable determines when its most correlated covariate gets split on. Un-

supervised algorithm is in between, as the response variable is selected at random.

Although it was shown that correlation can be a general guide as to the impu-

tation accuracy, the real determine factor for the imputation accuracy on a specific

variable is how well the rest of the covariates can predict its missing values collec-

tively. For instance, if a continuous variable needs to be imputed, the ’error rate’, or

’percent of variation explained’ from the regression tree with this variable being the

outcome variable is the true indicator of the imputation accuracy. In some cases, the

imputation accuracy may be high for a particular variable even when its correlation

with the rest of the covariate is low.

A user always needs to look at the pattern of the missing values before any

imputation algorithm is applied, as some pitfalls may exit. For example, if a variable

that is highly predictive of the response variable has high missingness, and it is

independent of the covariates-meaning it can not be predicted from the covariates,

its imputed values will far from being accurate. As a result, inference made on this

variable with the imputed data set will be biased.

64

In the next chapter, we will look at some of the pitfalls in details and offer some

possible solutions.

Chapter 6

RF Imputation on VIMP andMinimal Depth

6.1 Introduction

Random Forest, as a machine learning method, has gained popularity in many re-

search fields. It is appreciated for its improvement of prediction accuracy over a

single CART tree. In addition, it is a good method for high dimensional data, espe-

cially when complex relations exit between the predictive variables. Because of its

intrinsic ability to deal with complicated interactions, multiple methods have been

developed to perform variable selection using Random Forest. These methods are

based on two measures: variable importance measure(VIMP), or minimal depth

measure.

Missing values in predictive variables are often encountered in data analysis.

Although some empirical studies have been carried out to compare different impu-

tation methods in terms of imputation accuracy(Stekhoven and Buhlmann, 2012),

or variable selection(Genuer et al , 2010), a direct investigation on how the variable

importance measures and minimal depth measure are affected by different ways of

handling missing values may provide guidance on which method of handling miss-

65

66

ing data problem should be chosen, and shed light on how these different methods

affect the result of variable selection. In this study, we introduced missing values in

a particular variable of the simulated or real-life datasets and looked at how VIMP

and minimal depth were affected by different RF imputation methods.

6.2 Approach

When an investigator intends to analyze a data set containing missing values for

the purpose of variable selection using random forest, one of the three approaches

of handling the missing values can be chosen: 1. to use complete case method,

which means that any observations that contain missing value should be deleted; 2.

to use the built in missing data options in the software, if available; 3. to impute

the data ahead of the analysis. Complete case method may not be a good choice,

especially when the number of predictor that contain missing values is high-a high

percentage of observations would be deleted as a result. The built in methods for

handling missing values are OTF imputation and proximity imputation. We com-

pared the performance of these two methods and some other imputation methods,

namely, complete case method, imputation by proximity, OTF imputation, mul-

tivariate missForest imputation, missForest imputation, and KNN imputation for

correctly estimating of the variable importance and minimal depth. The ideal im-

putation method is the one that the importance scores calculated from the imputed

data are identical to the importance scores or minimal depth that could have been

calculated from the data without missing values. Therefore, we defined a measure

called relative importance score(RelImp) and relative minimal depth score(RelMdp)

in this study, which is calculated as

67

RelImp =

Importance score with imputed dataImportance score with original full data

RelMdp =

minimal depth with imputed dataminimal depth with original full data

The imputation performance is considered to be good if the RelImp or RelMdp is

equal or close to 1. If the RelImp is much greater than one, or if the RelMdp is much

less than 1, it implies that some variable may be falsely selected because the missing

value imputation exaggerates its importance. If RelImp is much smaller than one,

or if RelMdp is much greater than one, it indicates the some importance variable

may be missed in variable selection because of the missing value imputation.

6.2.1 Simulation Experiment Design

Five simulation with synthesized data were carried out to study the effect of differ-

ent imputation methods on the importance measurement and minimal depth. We

started with a simplest scenario.

Simulation 1 : Y = x1 + 0.2⇥ x2 + 0.1⇥ x3 + 0.1⇥ x4

Where x1, x2, x3, e are normal distributions with mean of 2,2,1,0, standard devia-

tion of 2,2,1,0.5, respectively. x4 follows exponential distribution with rate of 0.5.

There are two important variables in the second simulation, and these two im-

portant variables are not correlated and are independent of each other.

Simulation 2 : Y = x1 + x2 + 0.1⇥ x3 + 0.1⇥ x4

Where x1,x2,x3, x4 are defined as in simulation 1.

68

In the 3rd, 4th,and 5th simulation, x1 and x2 have the same variance and mean

as in simulation 1 and 2, but instead of being independent of each other, they are

correlated with the correlation coefficient being 0.75.

Simulation 3 : Y = x1 + 0.2⇥ x2 + 0.1⇥ x3 + 0.1⇥ x4

Simulation 4 : Y = x1 + x2 + 0.1⇥ x3 + 0.1⇥ x4

Simulation 5 : Y = x1 + x2 + x3 + 0.5⇥ x4

Table 6.1: Summary characteristics for the Simulated models.Correlation between x1 and x2 Important variables

Simulation 1 0 x1

Simulation 2 0 x1, x2

Simulation 3 0.75 x1

Simulation 4 0.75 x1, x2

Simulation 5 0.75 x1, x2, x3

Five thousand observations were created for each of the simulated dataset. We

first created q(q=15,20,25,30,35,40,45,50,55,60,65,70,75,80) percent of missing val-

ues in x1 by MCAR missing data mechanism, followed by using either complete

case deletion, or one of the six RF imputation methods to impute the dataset.

Then the dataset was analyzed using regular RF. VIMP and minimal depth for

x1 were recorded; relative importance score(RelImp) and relative minimal depth

score(RelMdp) were calculated and plotted.

For OTF imputation, one iteration was used. For missForest or mForest imputa-

tion, a maximum of five iteration was used if the algorithm did not converge before

the fifth iteration. For all the imputation methods, node size, nsplit, ntree (the num-

69

ber of trees grown for each RF) were chosen to be 5, 20, 250, respectively. The ↵

value was chosen to be 0.1 in the mRF↵ imputation method to be an equivalent of

missForest imputation. As there are only four predictor in the simulation, ↵ was set

to 0.5 in mRF↵ for mForest imputation.

6.3. Simulation results

6.3.1 Complete case analysis

As shown in figure 6.1 to 6.5, the complete case method resulted in the same VIMP

and minimal depth as the original data with not missingness. Although some ef-

ficiency can be lost, complete case method gives the same random forest when

the missing mechanism is MCAR. As the percentage of missingness increases, the

standard deviation of the calculated VIMP increases.

6.3.2 When predictive variables are not correlated

In the case of simulation 1 and 2, When the predictive variables are not correlated,

the VIMP for x1 decreases regardless of the imputation method that was used. As

shown in figure 6.1 and 6.2, the relative importance score(RelImp) for x1 decreases

as the percentage of missingness in x1 increases. The magnitude of the decrease

were identical for all random forest imputation methods.

6.3.3 When the predictive variables are correlated

In simulation 3, 4 and 5, x1, the variable containing missing values correlates with

x2. Similar to simulation 1 and 2, The relative importance score(RelImp) for x1

decreases as the percentage of missingness increases. However, as shown in figure

70

6.3, 6.4 and 6.5, the magnitude of the decrease was less in the OTF and mFor-

est imputation methods, followed by unsupervised and mForest imputation. KNN

imputation had the worse performance in terms of preservation of VIMP and min-

imal depth. Generally speaking, better preservations of VIMP and minimal depth

correspond to better imputation accuracy, as shown in figure 6.3, 6.4 and 6.5.

6.3.4 VIMP change of all the variables

We then looked at the VIMP and minimal depth change in all four variable in model

5, when the missingness was in x1. That is, x1 and x2 are correlated with r of 0.75,

Y = x1 + x2 + x3 + 0.5 ⇤ x4.

As shown in figure 6.6, when x1 is missing, analysis after imputation will result

in VIMP of x1 to decrease, VIMP of x2 to increase, and the VIMP of x3 and x4

unchanged.

Adaptive method does not behave like this-the only loss is the sample size for

x1.

6.3.5 Correlation change due to imputation

To better understand the imputation effect, we further looked at the correlation

change when mForest imputation was performed, as mForest has the highest impu-

tation accuracy among all the RF imputation methods. The model from simulation

5 was used.

As shown in figure 6.8, The correlation of x1 and x2 increases, while the correla-

tion of x1 and Y decreases. This explains why the VIMP of x1 decreases. Although

the correlation of x2 and Y stays the same, x2 comes in more to compensate for the

decreases of VIMP in x1.

71

6.3.6 The effect of nodesize

Noticing the change of correlations between the variables, we suspected that choos-

ing a larger nodesize may be beneficial. We repeated simulation 5 with nodesize set

to be 200, instead of 5. As shown in figure 6.9, the bias for estimation VIMP was

much reduced with adaptive and mForest algorithm. The OTF algorithm performs

especially well, resulting very little bias, even when the percent of missingness was

high.

6.3.7 How coefficient estimation is affected

We further studied how the coefficient estimation in regression is affected when the

missing values are imputed with RF or KNN imputation. We used the simulation

5 described above, that is, Y = x1 + x2 + x3 + 0.5 ⇥ x4. Missing values of

q (q = 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80) percent were created by

MCAR mechanism. The data missing values were then imputed with RF or KNN

imputation methods and regular GLM were then performed on the imputed dataset

for estimation of �1, �2, �3 and �4. The ratio of the estimated � and the true � was

recorded and plotted in figure 6.10. As shown in figure 6.10,

6.4. A real life dataWe used a real life data on Diabetes to look at how different imputation methods

affected the calculation of VIMP and minimal depth. q=(15, 20, 25, 30, 35, 40, 45,

50, 55, 60, 65, 70, 75, 80, 85) percent of missingness was created in the predictive

variable BMI. The missing values were imputed using the RF imputation methods

or KNN imputation. As shown in figure 6.10, the VIMP for BMI decreased as

the percent of missing increased, and the minimal depth increased as the percent

of missing increased. Among all the imputation methods, the adaptive method

72

had only 10 to 15 percent decrease in the calculation of VIMP. missForest and

mForest imputation resulted in more decrease in the calculation of VIMP compared

to the adaptive method, but much less decrease in VIMP compared to unsupervised,

proximity imputation methods.

6.5. Conclusion

1. Generally speaking, better imputation accuracy results in better preservation

of VIMP or minimal depth. In the simulation examples in this study, OTF

imputation and mForest imputation had better imputation accuracy, as well

as better preservation of VIMP and minimal depth, compared with unsuper-

vised imputation or proximity imputation. All the RF imputation methods

performed better in accuracy or inference than KNN imputation.

2. In the simulation in this study, OTF imputation was as good or even slightly

better at preserve the original VIMP or minimal depth compared to the more

computationally expensive imputation methods such as missForest or mFor-

est.

3. Minimal depth is good especially for important predictors, as it will be close

to the tree root node and we do not need to worry about the cases will assigned

to the wrong daughter node.

4. In some specific settings, mForest imputation may have certain unintended

consequences, as it can alter the correlation among variables. When the per-

cent of missing is low for a variable, the alter of the correlation is not sig-

nificant. When the percent of missing is high for a variable, the alter of the

correlation can be significant. By choosing a larger node size, the change of

73

correlation can be reduced. Therefore, we suggest a large node size should

be chosen in imputation when the percentage of missingness is high.

6.6 Discussion

Although RF imputation algorithms are able to impute missing values with Superior

accuracy, cautious needs to be taken if the purpose of the imputation is for further

inference study. In two scenarios we can foresee understandable pitfalls. scenario

1: the variable containing missing values is not correlated with any other predictors.

Then the imputation is equivalent to random draw from the observed values. There

will be bias for the inference, and the magnitude of the bias depends on the percent

of missingness. scenario 2: the variable containing missing values is only correlated

with one another predictor. Then the imputation will result in correlation change

(and the covariate matrix changes) between variables. Not surprisingly, therefore

there will be bias for inference. This bias can be reduced with larger node size.

When the variable with missing values correlates with more than one predictor,

we did not observed much correlation change between the variables in our simula-

tions. This is because the imputed values do not lie one any particular regression

line. In this case, imputation is beneficial to the analysis.

Since the inference of the analysis can be affected differently depending on the

data structure and the percent of missingness, we suggest the user of the random

forest imputation algorithms to study the structure of the data before the imputation.

We also suggest a relatively large node size to be used in the imputation. Since

large node size will result in computational speed improvement, this is a desirable

characteristic of the random forest imputation algorithms.

74

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2

simulation 1−Importance

Percent Missing

Stan

dard

ized

Impo

rtanc

e

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

Complete CaseAdaptiveUnsupervisedProximitymfmissForestknn

● ● ● ● ● ●

● ● ●● ● ● ● ● ●

20 30 40 50 60 70 80

0.8

0.9

1.0

1.1

1.2

1.3

1.4

Simulation 1−Minimal depth

Percent Missing

Stan

dard

ized

Min

imal

Dep

th

● ● ●● ●

●●

●●

● ●

● ●

● ●

●

●


● ● ●● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Simulation1−Imputation acuracy

Percent Missing

Stan

dard

ized

Impu

tatio

n Er

ror

● AdpUnsupervisedProximitymfmissForestknn

Figure 6.1: Simulation 1: The effect of different imputations methods on the im-portance measurement and the minimal depth.

75

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2


Percent Missing

Stan

dard

ized

Impo

rtanc

e

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●


● ● ●●

●● ●

● ●●

●●

●● ●

20 30 40 50 60 70 80

0.8

1.0

1.2

1.4

1.6

1.8

2.0


Percent Missing

Stan

dard

ized

Min

imal

Dep

th

●●

● ● ● ● ● ●● ● ●

●●

● ●

●

●


● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Simulation 2−Imputation acuracy

Percent Missing

Stan

dard

ized

Impu

tatio

n Er

ror



76

● ● ● ● ● ● ● ●● ● ● ● ●

●

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2


Percent Missing

Stan

dard

ized

Impo

rtanc

e

●●

●●

●●

●●

●●

●●

●

●

●

●


● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

0.8

1.0

1.2

1.4

1.6

1.8

2.0


Percent Missing

Stan

dard

ized

Min

imal

Dep

th

● ● ● ● ● ● ● ● ●● ● ● ●

●

●

●


● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

0.5

1.0

1.5

2.0


Percent Missing

Stan

dard

ized

Impu

tatio

n Er

ror



77

● ● ● ● ● ● ● ●● ●

● ● ● ●

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2


Percent Missing

Stan

dard

ized

Impo

rtanc

e ● ●●

●●

●●

●

●●

●●

●●

●

●


●● ●

● ● ●● ●

●

●● ● ●

●

20 30 40 50 60 70 80

0.8

1.0

1.2

1.4

1.6

1.8

2.0


Percent Missing

Stan

dard

ized

Min

imal

Dep

th

● ●

●

●

●●

●●

●

●

●

●● ●

●

●


●● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

0.5

1.0

1.5

2.0


Percent Missing

Stan

dard

ized

Impu

tatio

n Er

ror



78

● ● ● ● ● ● ● ●● ● ●

●●

●

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2


Percent Missing

Stan

dard

ized

Impo

rtanc

e

●●

● ●●

●●

●●

●●

● ●●

●

●


●● ● ● ● ●

● ●

● ●● ●

●

●

20 30 40 50 60 70 80

0.8

1.0

1.2

1.4

1.6

1.8

2.0


Percent Missing

Stan

dard

ized

Min

imal

Dep

th

●●

● ●● ●

●●

●●

●●

●

●

●

●


● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

0.5

1.0

1.5

2.0


Percent Missing

Stan

dard

ized

Impu

tatio

n Er

ror



79

● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60

0.5

1.0

1.5

2.0

x1 Importance−Missingness in x1

Percent Missing

Stan

dard

ized

Impo

rtanc

e

●● ● ● ● ●

● ● ●● ●

●

●


● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60

0.5

1.0

1.5

2.0

x2 Importance−missingness in x1

Percent Missing

Stan

dard

ized

Impo

rtanc

e

● ● ● ● ● ● ●● ●

● ●

●

●


● ● ● ● ● ● ● ●● ● ●

20 30 40 50 60

0.5

1.0

1.5

2.0


Percent Missing

Stan

dard

ized

Impo

rtanc

e


● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60

0.5

1.0

1.5

2.0


Percent Missing

Stan

dard

ized

Impo

rtanc

e


Figure 6.6: The effect of different imputation methods on the importance measure-ment for all four variables.

80

● ● ● ● ● ● ● ● ●● ●

20 30 40 50 60

0.5

1.0

1.5

2.0

x1 Minimal Depth−Missingness in x1

Percent Missing

Stan

dard

ized

Min

imal

Dep

th

●

● ● ●●

● ●

●●

● ●

●

●


● ● ● ● ● ●● ● ● ● ●

20 30 40 50 60

0.5

1.0

1.5

2.0


Percent Missing

Stan

dard

ized

Min

imal

Dep

th

●●

●●

● ●● ● ●

●●

●

●


● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60

0.5

1.0

1.5

2.0


Percent Missing

Stan

dard

ized

Min

imal

Dep

th


● ● ● ● ●● ● ● ● ● ●

20 30 40 50 60

0.5

1.0

1.5

2.0


Percent Missing

Stan

dard

ized

Min

imal

Dep

th


Figure 6.7: The effect of different imputation methods on the minimal Depth mea-surement for all four variables.

81

●

●●

●●

●●

●

●

●

●

●

0 10 20 30 40 50 60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Correlation of x1 and x2 change after imputation using missForest

Percent Missing

r

●

●●

●●

●●

●●

●● ●

0 10 20 30 40 50 60

0.75

0.80

0.85

0.90

Correlation with Y change after imputation using missForest

Percent Missing

r

● x1 with Yx2 with Y

0 10 20 30 40 50 60

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Correlation change with Y after imputation using missForest

Percent Missing

r

x3 with Yx4 with Y

Figure 6.8: The correlation change due to missForest imputation.

82

● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Importance

Percent Missing

Stan

dard

ized

Impo

rtanc

e

●● ● ● ● ● ● ● ● ● ● ●

● ●

●

●


Figure 6.9: Simulation 5 with nodesize of 200.

83

20 30 40 50 60 70 80

0.2

0.4

0.6

0.8

1.0

1.2

mcar BMI importance, 500 rep

Percent Missing

Impo

rtanc

e/or

igin

al im

porta

nce


20 30 40 50 60 70 80

1.0

1.2

1.4

1.6

mcar BMI Minimal Depth, 500 rep

Percent Missing

Min

imal

Dep

th/o

rigin

al M

inim

al D

epth Complete Case

AdaptiveUnsupervisedProximitymfmissForestknn

Figure 6.10: The effect of different imputations methods on the importance mea-surement and the minimal depth with the Diabetes data.

84

● ● ● ● ● ●●

●●

●

●

20 30 40 50 60

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Beta1 Estimation−missing in x1

Percent Missing

beta

1

● AdaptiveUnsupervisedProximitymfmissForestknn

● ● ● ● ● ●●

●

●●

●

20 30 40 50 60

1.0

1.1

1.2

1.3

1.4

1.5

Beta2 Estimation−Missingness in x1

Percent Missing

beta

2


● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60

0.8

0.9

1.0

1.1

1.2


Percent Missing

beta

3


20 30 40 50 60

0.2

0.3

0.4

0.5

0.6


Percent Missing

beta

4

● ● ● ● ● ● ● ● ● ● ●

●

●


Figure 6.11: The effect of different imputations methods on the parameter estima-tion in GLM.

Chapter 7

MESA Data Analysis

The MESA(Multi-Ethnic Study of Atherosclerosis) study was designed to study theprevalence, risk factors and sub-clinical cardiovascular disease (CVD) in multieth-nic cohort, with 10 years follow-up for incident CHD events. 6814 subject wereenrolled for MESA study. 711 variables, including recorded variables and com-puted variables, such as different versions of Framingham scores, are available aspotential predictors for the outcome. Although multiple outcomes are available inthe MESA dataset, we focus on the coronary heart disease(CHD) as outcome inthis analysis. Thirty-five observations with missing CHD outcome were removed,resulting in 6779 observations for the analysis.

7.1 How different imputation affects variable selec-tion in MESA data analysis

7.1.1 Variables Included

In the MESA data, there are 711 total variables with at least one observation(8 vari-

ables have 0 observations, therefore were deleted.), including variables from clinic

procedures, questionnaires, and created analytic variables (e.g. body mass index,

hypertension stage, ankle-brachial index), and key Reading Center variables (e.g.

total calcium score, aortic distensibility by MRI, average carotid wall thickness by

85

86

ultrasound). These variables involve multiple aspects of the patients, such as med-

ical history, anthropometry, health and life,medication information,neighborhood

information,personal history, CT, ECG, lipids( Blood group or NMR lipoprofile-II

spectral analysis), MRI, pulse wave, US: DISTENSIBILITY, US - IMT, Physical

Activity.

Out of the 711 variables, 134 variables have more than 50 percent missing val-

ues, while 114 variables have more than 75 percent missing values, 58 variables

have more than 90 percent missing values. The percent of missingness for each

variable were plotted in figure 7.1.

7.1.2 Relative VIMP by different RF imputation method

To select variables that are important in predicting 10-year CHD, we first used

unsupervised imputation, mForest imputation, and OTF imputation to impute the

dataset. Random Survival Forest analysis was then conducted on the imputed

datasets. The top important variables in predicting 10-year CHD and their rela-

tive VIMPs were reported in table 7.1. We also compared these results with the

OTF no imputation method.

As shown in table 7.1, Total calcium volume score and different versions of

Framingham risk score of CHD were top selected predictors in all the imputed

datasets. Interestingly, with OTF imputation, three variables ranked ahead of the

Framingham scores in terms of its importance in predicting the outcome; and with

OTF no imputation, six variables ranked before the Framingham scores.

87

7.1.3 VIMP of TNF-alpha

We were particularly interested in how important TNF-alpha is in predicting 10-

year CHD. Although there has been increasing evidence that cytokines in general

and TNF-alpha in particular play an importance role in cardiovascular disease, it

is not clear whether the level of these cytokines are predictive of long-term CHD.

Out of the 134 variables that have more than 50 percent missing values, some of

them are serum cytokines such as TNF-alpha soluble receptors, IL2, ec tl. For TNF-

alpha, 57.7 percent of the measurements were missing. We first carried out Random

Survival Forest (RSF) using only the observations with non-missing TNF-alpha

measurements, and concluded that TNF-alpha was not important in predicting 10-

year CHD. To rule out the possibility that the VIMP of TNF-alpha was affected by

NMAR missing mechanism, we first imputed the dataset using mForest imputation,

then coded all the missing values in TNF-alpha as 0 and carried out RSF analysis.

This is practically missingness incorporated in attributes(MIA). As shown in table

7.1, the relative VIMP of TNF-alpha compared to CAC score was 0.01, indicating

it is not important in predicting the outcome. This is consistent with the result

from the analysis using only observations with TNF-alpha measurements. This

also indicates that the missing mechanism of TNF-alpha is not NMAR. This result

was compared with the results from RSF on imputed datasets in table 7.1. From this

analysis, we are confident to conclude that TNF-alpha is not predictive of 10-year

CHD risk.

88

0 100 200 300 400 500 600 700

020

4060

8010

0

Variables

Perc

ent m

issi

ng

Figure 7.1: The percent of missingness for all the variables in MESA data.

89

Table 7.1: Top importance variables and the rank of importance of tnfri1 with dif-ferent imputation methods

Use Observed MIA for tnfri1 mForest ImputationVariable Importance Variable Importance Variable Importance

lncac 1.00 lncac 1.00 lncac 1.00Fram. Scores 0.20-0.11 Fram. Scores 0.17-0.07 Fram. Scores 0.16-0.08

maxstn1c 0.089 zmximt1c 0.051 maxint1c 0.057youngm1 0.072 stm1 0.002 zmximt1c 0.051totvip1 0.070 age1c 0.027 rdpedis1 0.039pkyrs1c 0.051 pkyrs1c 0.026 lptib1 0.038

chwyrs1c 0.035 maxint1c 0.021 pkyrs1c 0.029spp1c 0.034 lptib1 0.020 pprescp1 0.025

...(27 variables) ... abi1c 0.019 age1c 0.023tnfri1 0.01 (10 variables) ... ...(6 variables) ...

... ... tnfri1 0.010 tnfri1 0.015Unsupervised New OTF with Imputation OTF No Imputation

Variable Importance Variable Importance Variable Importancelncac 1.00 lncac 1.00 lncac 1.00

Fram. Scores 0.16-0.09 Chwday1 0.225 Chwday1 0.197lolucdp1 0.087 tfpi1 0.165 vwf1 0.180zmximt1c 0.063 vwf1 0.156 aspeage1 0.131rdpedis1 0.043 Fram. Scores 0.110-0.094 hrmage1c 0.103maxint1c 0.038 pkyrs1c 0.086 tfpi1 0.094

lptib1 0.035 aspeage 0.078 pkyrs1c 0.079pkyrx1c 0.028 stm1 0.064 Fram. Scores 0.078-0.074

...(2 variables) ... ... ... aspsage1 0.062tnfri1 0.023 ... ... lptib1 0.062

... ... ... ... mmp31 0.057

Table 7.2: Top importance variables explanationVariable Names Meaning Percent missing

tnfri1 Tumor Necrosis Facto-a soluble receptors 57.7maxstn1c MAXIMUM CAROTID STENOSIS 1.4youngm1 YOUNGS MODULUS 4.2pkyrs1c PACK-YEARS OF CIGARETTE SMOKING 1.2

chwyrs1c CHEWING TOBACCO AMOUNT 0.4spp1c SEATED PULSE PRESSURE (mmHg) 0.04

totvip1 Total Vascular Impedance(pulse wave) 7.0lptib1 LEFT POSTERIOR TIBIAL BP (mmHg) 0.9

rdpedis1 RIGHT DORSALIS PEDIS BP (mmHg) 1.0abi1c ANKLE-BRACHIAL INDEX) (BP) 1.1

maxint1c INTERNAL CAROTID INTIMAL-MEDIAL THICKNESS (mm) 2.7zmximt1c Z SCORE MAXIMUM IMT 1.1lulocdp1 REASON AABP INCOMPLETE (L): UNABLE TO LOCATE DP) 98.9cgrday1 CIGARS AVERAGE number SMOKED PER DAY 90.8

90

7.2 Ten year CHD prediction in the absence of CAC

score

7.2.1 CHD risk engine

Personalized care based on the risk of a patient for developing Atherosclerotic coro-

nary heart disease ASCVD is the safest and most effective way of treating a patient.

ASCVD risk engine provide a promising way for the clinicians to tailor treatment

to risk by identifying patients that are likely to benefit from or be harmed by a

particular treatment. Current guidelines on ASCVD prevention matches the inten-

sity of risk-reducing therapy to the patients absolute risk for new or recurrent AS-

CVD events using ASCVD risk engines(Stone et al., 2014). For instance, the 2013

American College of Cardiology/American Heart Association(ACC/AHA) guide-

lines recommend that patients with calculated pooled risk score of equal or greater

than 7.5% should be eligible for statin therapy for primary prevention.

A number of multivariable risk models have been developed for estimating the

risk of initial cardiovascular events in healthy individuals. The original Framing-

ham risk score, published in 1998, was derived from a largely Caucasian popula-

tion of European descent (Wilson et al., 1998) using the endpoints of CHD death,

Nonfatal MI, Unstable angina, and Stable angina. Prediction variables used in

Framingham CHD risk score included age, gender, total or LDL cholesterol, HDL

cholesterol, systolic blood pressure, diabetes mellitus, and current smoking status.

The Framingham General CVD risk score (2008) was an extension of the origi-

nal Framingham risk was score to include all of the potential manifestations and

adverse consequences of atherosclerosis, such as stroke, transient ischemic attack,

claudication and heart failure (HF) (D’Agostino et al., 2008). ACC/AHA pooled

91

cohort hard CVD risk calculator (2013) was the first risk model to include data

from large populations of both Caucasian and African-American patients (Goff et

al., 2014), developed from several cohorts of patients. The model includes the

same parameters as the 2008 Framingham General CVD model, but in contrast to

the 2008 Framingham model includes only hard endpoints (fatal and nonfatal MI

and stroke). However, while the calculator appears to be well-calibrated in some

populations similar to those for which the calculator was developed (REGARDS),

it has not been as accurate in other populations (Rotterdam)(Kavousi et al., 2014).

Prediction variables used in ACC/AHA pooled cohort hard CVD risk calculator

(2013) were age, gender, total cholesterol, HDL cholesterol, systolic blood pres-

sure, blood pressure treatment, diabetes mellitus, and current smoking. Endpoints

assessed in ACC/AHA pooled cohort hard CVD risk calculator (2013) were CHD

death, nonfatal MI, fatal stroke, and nonfatal stroke. Another well-known risk score

is JBS risk score(2014), which is based on the QRISK lifetime cardiovascular risk

calculator and extends the assessment of risk beyond the 10-year window, allowing

for the estimation of heart age and assessment of risk over longer intervals.

The MESA risk score(2015) improved the accuracy of the 10-year CHD risk es-

timation by incorporating CAC(Coronary Artery Calcium score) in the algorithm,

together with the traditional risk factors. It was shown that inclusion of CAC in the

MESA risk score resulted significant improvements in risk prediction (C-statistic

0.80 vs.0.75; p<0.0001), compared to using only the traditional risk factors. In

addition, external validation in both the HNR(German Heinz Nixdorf Recall) and

DHS(Dallas Heart Study) studies provided evidence of very good discrimination

and calibration. The prediction variables used in MESA risk score (2015) were

age, gender, ethnicity (non-Hispanic white, Chinese American, African American,

Hispanic), total cholesterol, HDL cholesterol, Lipid lowering treatment, Systolic

92

blood pressure, Blood pressure treatment (yes or no), diabetes mellitus (yes or no),

current smoking (yes or no), family history of myocardial infarction at any age

(yes or no), coronary artery calcium score. The endpoints assessed in MESA risk

score (2015) were CHD death, nonfatal MI, resuscitated cardiac arrest, and coro-

nary revascularization in patient with angina. Although the MESA risk score with

CAC incorporated appears to be superior in risk prediction, one problem is that the

CAC score may not be available for many individuals. In this study, we look at dif-

ferent strategies of building risk engine when CAC score is not available in testing

data.

7.2.2 Methods

When the CAC score is available in the training data set, but not in the testing data

set, four strategies may be applied, which are:

1. Use Framingham variables only

2. Use the full model (all the variables available )

3. Framingham model with true CAC, predict CAC in testing using all the test-

ing dataset.

4. New method: replace in training with predicted CAC and proceed as strategy

3.

93

The detailed steps of this new strategy is as follows.

1. Build a predictive model for CAC using the training data, with the outcome

variables removed. (Remember that CAC score is available in the training

dataset.)

2. Get fitted CAC score in the training data.

3. Replace the true CAC score with the fitted CAC score in the training data.

4. Build predictive model for the outcome (10-year CHD) using updated train-

ing data from step 3

5. Predict CAC score in the testing dataset using model from step 1

6. Predict the outcome (CHD) using model from step 4

7.2.3 Results

Traditional Framingham with CAC score predict 10-year CHD better than

RFS with all 696 available predictors

Using the 12 traditional Framingham predictors with CAC score, the error rate for

10-year CHD was found to be 22.6%, using node size of 100. We first investigated

whether the error rate can be improved by simply using all 690 available predictors

in the MESA dataset. The resulting RSF had an error rate of 0.239, showing no

improvement of prediction. This is not surprising as we showed that the only strong

predictors for 10-year CHD were CAC score and the Framingham predictors, the

resting being very weak predictors.

When the CAC is not available, the error rate increased to be 29.0%, or 0.290

using only the Framingham predictors.

94

When CAC score is not available

One natural scenario is that CAC score is available in training dataset, but not in

testing dataset. And accordingly, one strategy may be including CAC in the training

dataset while try to predict CAC in the testing dataset when CAC is not available-

which is a common case in clinic, as not all patients will take the CT scan for their

CAC score. Using this strategy, the prediction error rate was found to be 0.324,

which is worse than simply ignoring CAC in both training and testing dataset. This

is not surprising, as the percent of variance explained predicting CAC was only 32

percent. We used a novel strategy to be able to utilize the known CAC score in the

training dataset, which resulted in improved prediction compared to simply ignore

the CAC score.

To compared the four strategies for predicting the 10-year CHD with the MESA

data, the following simulation was carried out. The observations in the MESA data

was assigned randomly to training or testing set. The CAC scores in the testing test

were then removed. Each of the four strategies for predicting the 10-year CHD were

carried out. The experiment was repeated 100 times for each of the four strategies

and the average of the 100 error rate was recorded. The RSF parameters used were

100 for mtry, 600 for ntree, 20 or 100 for nsplit, 100 for nodesize. 696, instead

of 711 variables were used because we deleted all the variables created from CAC

score except one. The natural log of CAC score was calculated and used to be

consistent with the literature. The result is shown in Table 7.3.

7.2.4 Conclusions

The MESA dataset we used is unique in that it essentially has two very important

variables in predicting the outcome (CHD): CAC score and Framingham score. In

95

reality, the Framingham score is usually always available for a patient, while the

CAC score is often no available, since it takes a CT scan to obtain the CAC score,

which is expensive and involves radiation. This resulted in a scenario that a very

important variable is in the training dataset, while not in the testing dataset. We

showed that to simply discard this variable is not a good strategy. We proposed a

new strategy to utilize the information of this variable in the training dataset even

when it is not available in the testing dataset. We showed that the predictive error

rate was improved with this new strategy.

Table 7.3: Prediction error rates using four strategies when CAC score is availablein the training, but not test set

Strategy p(training) p(testing) Error rate Error rate(nsplit=20) (nsplit=100)

Framingham and CAC 13 13 0.223 0.228Full model and CAC 696 696 0.234 0.247

Strategy 1 12 12 0.283 0.290Strategy 2 695 695 0.276 0.293Strategy 3 13 13 0.298 0.324Strategy 4 13 13 0.263 0.263

Bibliography

Abdella, M. and Marwala, T. (2015). The use of genetic algorithms and neuralnetworks to approximate missing data in database. International Conference onComputational Cybernatics, School of Electrical and information Engineering.University of the Witwatersrand. Johannesburg, South Africa, 13-16 April, pg2-7-212.

Aittokallio, T. (2009) Dealing with missing values in large-scale studies: microar-ray data imputation and beyond. Brief Bioinform; 2(2):253–264.

Breiman L., Friedman J.H., Olshen R.A., and Stone C.J. (1984) Classification andRegression Trees, Belmont, California.

Breiman L (2001) Random forests. Machine Learning, 45: 5–32.

Breiman L. (2003) Manual – setting up, using, and understanding random forestsV4.0. .

Bartlett J.W, Seaman S.R., White I.R., and Carpenter J.R. (2015). Multiple impu-tation of covariates by fully conditional specification: accommodating the sub-stantive model. Statistical Methods in Medical Reserch, 24(4): 462487.

D’Agostino, R.B. Sr, Vasan, R.S., Pencina, M.J., Wolf, P.A., Cobain, M., Massaro,J.M., Kannel, W.B. (2008) General cardiovascular risk profile for use in primarycare: the Framingham Heart Study. Circulation. 117(6):743.

Devroye L., Gyorfi L., and Lugosi G. (1996) Probabilistic Theory of Pattern Recog-nition, Springer-Verlag.

Diaz-Uriarte R., Alvarez de Andres S. (2006) Gene Selection and classification ofmicroarray data using random forest. BMC Bioinformatics.

96

97

Doove .L. L, Van Buuren S., and Dusseldorp E. (2014) Recursive partitioning formissing data imputation in the presence of interaction effects. ComputationalStatistics & Data Analysis, 72:92-104.

Enders C. K. (2010) Applied missing data analysis, Guilford Publications, NewYork.

Friedman, J.H. (2001). Greedy function approximation: A gradient boosting ma-chine. Annals of Statistics, 29:1189–1232.

Genuer, R. J.-M. Poggi, C. Tuleau-Malot (2010) Variable selection using randomforests. Pattern Recognit. Lett., 0167-8655, 31 (14), pp. 22252236.

Goff, D.C. Jr, Lloyd-Jones, D.M., Bennett, G., Coady, S., D’Agostino, R.B., Gib-bons, R., Greenland, P., Lackland, D.T., Levy, D., O’Donnell, C.J., Robinson,J.G., Schwartz, J.S., Shero, S.T., Smith, S.C. Jr, Sorlie, P., Stone, N.J., Wil-son, P.W., Jordan, H.S., Nevo, L., Wnek, J., Anderson, J.L., Halperin, J.L.,Albert, N.M., Bozkurt, B., Brindis, R.G., Curtis, L.H., DeMets, D., Hochman,J.S., Kovacs, R.J., Ohman, E.M., Pressler, S.J., Sellke, F.W., Shen, W.K., Smith,S.C. Jr, Tomaselli, G.F. (2014) 2013 ACC/AHA guideline on the assessment ofcardiovascular risk: a report of the American College of Cardiology/AmericanHeart Association Task Force on Practice Guidelines. Circulation. 129(25 Suppl2):S49.

Gyorfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002) A Distribution-FreeTheory of Nonparametric Regression, Springer.

Hastie, T., Tibshirani, R., Narasimhan, B., and Chu, G. (2015) impute: Imputationfor microarray data. R package version 1.34.0, http://bioconductor.org.

Hothorn, T. and Lausen, B.(2003) On the exact distribution of maximally selectedrank statistics. Comput.Statist. Data Analysis, 43:121137,.

Ishwaran, H. (2007) Variable importance in binary regression trees and forests.Electronic Journal of Statistics, Vol. 1 519-537.

Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008) Randomsurvival forests. Ann. Appl. Stat., 2:841–860.

Ishwaran, H., Kogalur, U.B., Gorodeski, E.Z., Minn, A.J. and Lauer, M.S. (2010).High-dimensional variable selection for survival data. J. Amer. Stat. Assoc, 105,205-217.

Ishwaran, H. (2015) The effect of splitting on random forests. Machine Learning99(1):75–118.

98

Ishwaran, H. and Kogalur, U.B. (2016). randomForestSRC: Random Forestsfor Survival, Regression and Classification (RF-SRC). R package version 2.0.5http://cran.r-project.org.

Kavousi, M., Leening, M.J., Nanchen, D., Greenland, P., Graham, I.M., Steyerberg,E.W., Ikram, M.A., Stricker, B.H., Hofman, A., Franco, O.H. (2014) Compari-son of application of the ACC/AHA guidelines, Adult Treatment Panel III guide-lines, and European Society of Cardiology guidelines for cardiovascular diseaseprevention in a European cohort. JAMA. 311(14):1416.

Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest.Rnews, 2/3:18–22, 2002.

Liao, S.G. et al.. (2014) Missing value imputation in high-dimensional phenomicdata: imputable or not, and how? BMC Bioinformatics, 15:346.

Little, R.J.A (1992). Regression With Missing X’s: A Review Journal of theAmerican Statistical Association Vol. 87, No. 420, pp. 1227-1237

Leisch, F. and Dimitriadou, E. (2009) mlbench: Machine Learning BenchmarkProblems, R package version 1.1-6

Loh, P.L. and Wainwright, M.J.. (2011) High-dimensional regression with noisyand missing data: provable guarantees with non-convexity. Advances in NeuralInformation Processing Systems, pp. 2726–2734.

Mendez, G. and Lohr, S. (2011) Estimating residual variance in random forestregression. Computational Statistics & Data Analysis, 55(11):29372950.

Meng, X.L. (1995). Multiple-imputation inferences with uncongenial sources ofinput (with discussion), Statistical Science, 10, 538-573.

Pantanowitz, A. and Marwala, T. Evaluating the impact of missing dataimputaion through the use of the random forest algorithm (2008)http://arxiv.org/ftp/arxiv/papers/0812/0812.2412.pdf. School of Electricaland Information Engineering. University of the Witwatersrand Private Bag x3.Wits. 2050. Republic of South Africa, 2008

Rubin, D.B. (1976) Inference and missing data.Biometrika, 63(3):581–592.

Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. New York:John Wiley & Sons.

Rubin, D.B. (1996) Multiple Imputation after 18+ Years (with discussion). Journalof the American Statistical Association, 91:473-489.

99

Segal, M. and Xiao, Y. (2011) Multivariate random forests. Wiley InterdisciplinaryReviews: Data Mining and Knowledge Discovery 1(1):80–87.

Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., and Hemingway, H. (2014)Comparison of random forest and parametric imputation models for imputingmissing data using MICE: a caliber study. American Journal of Epidemiology,179(6):764774.

Stekhoven, D.J. and Buhlmann, P. (2012) MissForest—non-parametric missingvalue imputation for mixed-type data.Bioinformatics, 28(1):112–118.

Stone, C.J. (1977). Consistent nonparametric regression. Ann. Stat., 8:1348–1360.

Stone, N.J., Robinson, J.G., Lichtenstein, A.H., et al. (2014) ACC/AHA guidelineon the treatment of blood cholesterol to reduce atherosclerotic cardiovascularrisk in adults: a report of the American College of Cardiology/American HeartAssociation Task Force on Practice Guidelines. J Am Coll Cardiol 63:2889934.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R.,Botstein, D., and Altman, R. (2001) Missing value estimation methods for DNAmicroarrays. Bioinformatics, 17(6):520525.

Twala,B., Jones, M.C. and Hand, D.J. (2008) Good methods for coping with miss-ing data in decision trees. Pattern Recognition Letters, 29(7):950–956.

Twala,B., Cartwright, M.C. (2010) Ensemble missing data techniques for softwareeffort prediction. Intelligent Data Analysis. 14(3):299-331.

Van Buuren, S. (2007). Multiple imputation of discrete and continuous databy fully conditional specification. Statistical Methods in Medical Research,16(3):219242.

Waljee, A.K. et al. (2013). Comparison of imputation methods for missing labora-tory data in medicine. BMJ Open, 3(8):e002847.

Wilson, P.W., D’Agostino, R.B., Levy, D., Belanger, A.M., Silbershatz, H., Kannel,W.B. Prediction of coronary heart disease using risk factor categories. Circula-tion. 1998;97(18):1837.

random forest missing data approaches...1. apply random forest to (yobs, xobs), using k bootstraps...

Documents