missing data imputation for tree-based models.pdf

8/17/2019 Missing Data Imputation for Tree-based Models.pdf

1/96

UNIVERSITY OF CALIFORNIA

Los Angeles

Missing Data Imputation for Tree-Based Models

A dissertation submitted in partial satisfaction

of the requirements for the degree

Doctor of Philosophy in Statistics

by

Yan He

2006


2/96

c Copyright byYan He

2006


3/96

The dissertation of Yan He is approved.

Susan Sorenson

Hongquan Xu

Mark Hansen

Richard Berk, Committee Chair

University of California, Los Angeles

2006

ii


4/96

To my parents and my husband with love and gratitude

iii


5/96

TABLE OF CONTENTS

Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Classification and Regression Trees (CART) and Extensions . . . . . . . 6

2.1 Classification and Regression Trees (CART) . . . . . . . . . . . . . . 6

2.1.1 Splitting A Tree . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Pruning A Tree . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 Taking Cost into Account . . . . . . . . . . . . . . . . . . . 11

2.2 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 The Comparative Advantage of Random Forests . . . . . . . 15

3 Standard Theory on Missing Data . . . . . . . . . . . . . . . . . . . . . 17

3.1 Mechanisms That Lead to Missing Data . . . . . . . . . . . . . . . . 17

3.2 Treatment of Missing Data . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Listwise Deletion . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 Single Imputation . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.3 Multiple Imputations through Data Augmentation . . . . . . . 23

3.2.4 Assessment of Multiple Imputations . . . . . . . . . . . . . . 25

iv


6/96

4 Missing Data with CART/RF . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Missing Data with CART . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Missing Data with RF . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Nonparametric Bootstrap Methods to Impute Missing Data . . . . . . . 33

5.1 The Simple Bootstrap for Complete Data . . . . . . . . . . . . . . . 33

5.2 The Simple Bootstrap Applied to Imputed Incomplete Data . . . . . . 34

5.3 The Imputation Algorithm for Tree-Based Models . . . . . . . . . . . 36

6 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.1 Data of Diabates . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.2 Data of Domestic Violence . . . . . . . . . . . . . . . . . . . 40

6.1.3 Data of Dolphin . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Missing Values in the Three Data Sets . . . . . . . . . . . . . . . . . 45

6.3 Comparison for CART . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.4 Comparison for Random Forests . . . . . . . . . . . . . . . . . . . . 61

7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 71

8 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

v


7/96

LIST OF FIGURES

6.1 Empirical Distributions for False Positive Errors & False Negative Er-

rors from 2000 CART Bootstrap: “DV” data; cost ratio = 5:1. . . . . . 56


rors from 2000 CART Bootstrap: “crime” data; cost ratio = 10:1. . . . 57


rors from 2000 CART Bootstrap: “diabetes” data; cost ratio = 2:1. . . 58


rors from 2000 CART Bootstrap: “dolphin” data; cost ratio = 10:1. . . 59


rors from 2000 RF Bootstrap: “DV” data. . . . . . . . . . . . . . . . 65


rors from 2000 RF Bootstrap: “crime” data. . . . . . . . . . . . . . . 67


rors from 2000 RF Bootstrap: “diabetes” data. . . . . . . . . . . . . . 68


rors from 2000 RF Bootstrap: “dolphin” data. . . . . . . . . . . . . . 69

vi


8/96

LIST OF TABLES

6.1 CART confusion table for “DV” objective: N = 516 complete cases;

cost ratio = 5:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2 CART confusion table for “DV” objective using “surrogate”: N = 636;

cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 CART confusion table for “DV” objective using nonparametric boot-

strap method to impute missing values (Algorithm 2): N = 636; B =

2000; cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4 Surrogate Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.5 CART confusion table for “DV” objective using nonparametric boot-


30; cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.6 CART confusion table for “crime” objective: N = 516 complete cases;

cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.7 CART confusion table for “crime” objective using “surrogate”: N =

636; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.8 CART confusion table for “crime” objective using nonparametric boot-


30; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.9 CART confusion table for “diabetes” objective using full data set: N =

768; cost ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.10 CART confusion table for “diabetes” objective using “surrogate”: N =

768, cost ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

vii


9/96

6.11 CART confusion table for “diabetes” objective using nonparametric

method to impute missing values (Algorithm 2): N = 768; B = 30; cost

ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.12 CART confusion table for “dolphin” objective using full data set: N =

1000; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.13 CART confusion table for “dolphin” objective using “surrogate”: N =

1000; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.14 CART confusion table for “dolphin” objective using nonparametric

bootstrap method to impute missing values (Algorithm 2): N = 1000;

B = 30; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . 54

6.15 Prediction Errors & 95% Confidence Intervals for False Positives and

False Negatives Using 2000 Bootstrap Samples: CART Model . . . . 61

6.16 RF confusion table for “DV” objective: N = 516 complete cases. . . . 61

6.17 RF confusion table for “DV” objective using “rfImpute”: N = 671. . . 62

6.18 RF confusion table for “DV” objective using nonparametric bootstrap

method to impute missing values (Algorithm 2): N = 671, B = 30. . . 63

6.19 RF confusion table for “crime” objective: N = 516 complete cases. . . 63

6.20 RF confusion table for “crime” objective using “rfImpute”: N = 671. . 63

6.21 RF confusion table for “crime” objective using nonparametric boot-


30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.22 RF confusion table for “diabetes” objective using full data set: N = 768. 64

6.23 RF confusion table for “diabetes” objective using “rfImpute”: N = 768. 64

viii


10/96

6.24 RF confusion table for “diabetes” objective using nonparametric boot-


30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.25 RF confusion table for “dolphin” objective using full data set: N = 1000. 64

6.26 RF confusion table for “dolphin” objective using “rfImpute”: N = 1000. 66

6.27 RF confusion table for “dolphin” objective using nonparametric boot-


30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.28 Prediction Errors & 95% Confidence Intervals for Misclassification

Errors Using 2000 Bootstrap Samples: RF . . . . . . . . . . . . . . . 70

7.1 Prediction Errors for RF Model by Applying Different Imputation Meth-

ods to Test Sets with Missing Data. . . . . . . . . . . . . . . . . . . . 74


ods to Test Sets with No Missing Data: I (Deleting 20% of Cases from

Learning Sample). . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


ods to Test Sets with No Missing Data: II (Deleting 50% of Cases from

Learning Sample). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

ix


11/96

ACKNOWLEDGMENTS

Here first and foremost I would like to express my deepest gratitude to my advisor

and committee chair Professor Berk. His guidance, support, and kindness made this

work possible. He has my most sincere and hearty appreciation for giving me the

freedom and tolerating me to pursue problems I chose in the way I liked.

I am also thankful to the other members of my committee, Professors Hansen, Xu

and Sorenson, for their suggestions and comments and the time they spent in reviewing

this dissertation.

I wish to express my appreciation to Professor Lin, a senior western-educated pro-

fessor, who opened this amazing field to me when I was still a college student. It is

under his mentorship that I became interested in studying and doing statistical analysis

to solve real world problems. During my four years in college, Professor Lin offered

me numerous opportunities to learn cutting-edge researches, and he also invited me

to actively participate in many national projects, which built my strong background in

mathematical analysis. It was his reference that made my application to the statistics

program in UCLA a lot easier.

I would also like to thank Professor Sun, a super nice professor, whose pecuniary

support made my joining UCLA a reality.

A debt of gratitude is owed to Professors Wu, Ferguson and Jan de Leeuw, who

were always so nice and so patient to answer my questions. I also appreciate kindness

from Mrs. Dean Dacumos, who made the Statistics Department a united community,

and our life enjoyable.

To my parents and my husband for their consecutive support and understanding.

x


12/96

VITA

1975.10.24 Born, Nantong, P. R. China.

1994–1998 B.A. in Economics & B.A. in Economic Law, Huazhong University

of Science and Technology, Wuhan, P. R. China. With High Honors.

2000–2001 M.A. in Economics, Department of Economics, The Ohio State

University. Awarded University Fellowship.

2001–2003 M.S. in Statistics, Department of Statistics, UCLA.

2003.06–09 Fair Isaac Corporation, Internship.

2005–present Countrywide Home Loans, CA.

xi


13/96

PUBLICATIONS

Yan He: Problems and Suggestions for Improving to the Exchange Sterilization Op-

eration of China’s Central Bank. The Study of Finance and Economics, April 2000,

Vol.26 No.4.

ShaoGong Lin, Qiming Tang, ZhiHong Fan and Yan He: Translated the book Econo-

metric Methods by Jack Johnston & John DiNardo (UCI) (4th edition) into Chinese.

Published by China Economics Publishing House (ISBN 7-5017-5063-7), 2002.

Juana Sanchez and Yan He: Examples of the Application of Statistics and Probability

to Computer Science. Presented at the Joint AMS-MAA (American Mathematical So-

ciety - Mathematical Association of American) Annual Meeting. January 7-10, 2004,

Phoenix, AZ.

Richard Berk, Yan He and Susan Sorenson: Developing a Practical Forecasting Screener

for Domestic Violence Incidents. Evaluation Review, 29(4): 358-382, August 2005.

Juana Sanchez and Yan He: Internet Data Analysis for the Undergraduate Statistics

Curriculum. Journal of Statistics Education, Volume 13(3), 2005.

xii


14/96

ABSTRACT OF THE DISSERTATION

Missing Data Imputation for Tree-Based Models

by

Yan He

Doctor of Philosophy in Statistics

University of California, Los Angeles, 2006

Professor Richard Berk, Chair

A wide variety of data can include some form of censorship or missing informa-

tion. Missing data are a problem for all statistical analyses, tree-based models, such as

CART and Random Forests are certainly no exception.

In recent years, there have been many new developed tools that can be applied

to missing data problems: likelihood and estimating function methodology, cross-

validation, the bootstrap and other simulation techniques, Bayesian and multiple im-

putations, and the EM algorithm. Although applied successfully to well-defined para-

metric models, such methods may be inappropriate for tree-based models, which are

usually considered as non-parametric models. CART/RF have built-in algorithms to

impute missing data, such as surrogate variables or proximity. But these imputation

methods have no formal rationale, and are unstable, especially for RF models.

The nonparametric bootstrap methods to impute missing values overcome all of

the drawbacks that are implicit in both single and multiple imputations. It 1) does not

depend on the missing-data mechanism, 2) requires no knowledge of either the prob-ability distributions or model structure, and 3) successfully incorporates the estimates

of uncertainty associated with the imputed data. Furthermore, 2000 replications of

bootstrap samples provide stable and accurate statistical inferences (Efron, 1994).

xiii


15/96

In my dissertation research, the nonparametric bootstrap methods were imple-

mented to impute missing values before cases were dropped down the tree (CART/RF),

and the classification results were compared to both complete-data/full-data analysis

and to the classification results using surrogate variables/proximity. Significant im-

provement in the ability to predict were found for both CART and RF models.

xiv


16/96

CHAPTER 1

Introduction

A wide variety of data can include some form of censorship or missing informa-

tion. Data imputation can then be an important component of the analysis, but crude

methods for data imputation can lead to substantial bias in the results. For example, a

“complete-case analysis” simply ignores the missing data and risks substantial bias.

In recent years, there have been many new computationally intensive tools devel-

oped that can be applied to missing data problems: likelihood and estimating function

methodology, cross-validation, the bootstrap and other simulation techniques, Bayes’

and multiple imputations, and the EM algorithm. Existing methods have been suc-

cessfully applied with well-defined parametric models, such as Gaussian regression,

and loglinear models. But their usefulness has yet to be demonstrated for tree-based

models, such as Classification and Regression Trees (CART) and random forests (RF),

which are usually considered as non-parametric methods. It is this oversight that I will

attempt to remedy, in part, in the pages ahead.

More specifically, parametric models, such as linear regression, can provide useful

descriptions of simple structures in data. However, sometimes such simple structure

does not extend across an entire data set and may instead be confined more locally

within subsets of the data. Then, the structure might be better described by a modelthat partitions the data into subsets, employing separate submodels for each. Such al-

ternative can be accomplished by using a tree-based approach, known as CART (Clas-

sification and Regression Trees).

1


17/96

Given a data set, a common strategy for finding a good tree is to use a greedy

algorithm to grow a tree and then to prune it back to avoid overfitting. Such greedy

algorithm typically grow a tree by sequentially choosing splitting rules for nodes on

the basis of maximizing some fitting criterion. This generates a sequence of trees,

each of which is an extension of previous trees. A single tree is then selected by prun-

ing the largest tree according to a model selection criterion such as cost-complexity

pruning (Brieman et al., 1984), cross-validation, or even multiple tests of whether two

adjoining nodes should be collapsed into a single node.

The overfitting problem in CART motivated people to develop bundling methods

such as bagging and random forests. Bagging predictors is a method for generating

multiple versions of a predictor and using these to get an aggregated result. In the case

of CART, the aggregation averages over the trees when predicting a numerical outcome

and does a plurality vote when predicting a class. The multiple versions are formed

by making bootstrap replicates of the learning set and using these as new learning data

sets. Tests on real and simulated data sets using classification and regression trees and

subset selection in linear regression have shown that bagging can allow for substantial

gains in accuracy (Breiman, 1996). The vital element is the instability of the prediction

method. If perturbing the learning set can cause significant changes in the predictor

constructed, then bagging can improve accuracy.

Random forests (RF) is a further extension of bagging. A Random forest model is

a combination of tree predictors such that each tree depends on the values of a random

vector sampled independently and with the same distribution for all trees in the forest.

The generalization error for RF converges almost surely to a limit as the number of

trees in the forest becomes large (Breiman, 2001). Using a random selection of features

to split each node yields error rates that compare favorably to Adaboost (Freund and

Schapire, 1996), but more robust with respect to noise.

2


18/96

Missing data can be a problem for all statistical problems. CART/bagging/RF are

certainly no exception. Missing data can create the same kinds of difficulties they

create for conventional linear regression. There is the loss of statistical power with the

reduction in sample size and real possibility of bias if the observations are not lost at

random.

A general discussion of missing data and excellent treatment are easily found (Lit-

tle and Rubin, 2002). If the data are really “missing completely at random” (MCAR),

the only loss is statistical power. And if the number of cases lost is not large, the reduc-

tion in power is likely to be insignificant. It is, therefore, mandatory that the researcher

make a convincing argument that the data are missing completely at random. The re-

sults are then dependent upon the missing completely at random assumption, and may

be of little statistical interest unless the credibility of that assumption is determined.

A less strict assumption is that the data are “missing at random” (MAR). One

can subset the data based on the values of observed variables so that for each such

subset, the data are missing completely random. If this assumption is correct, the

analysis can be conducted separately for each of the subsets and then reassembled.

But again, the assumption of the mechanism in which the data are missing must beargued convincingly.

If either of these assumptions can be justified, it will be useful to impute the values

of the missing data. Imputing missing values for the response variable is usually not

sensible because the relationship between the response and the predictors can be sys-

tematically altered. But sometimes it can be very helpful to impute missing data for

predictors.

The key problem with any imputation procedure is that when the data are ultimately

analyzed, including the real data and the imputed data, the statistical procedures ap-

plied cannot tell which is which and necessarily treat all of the observations alike. The

3


19/96

imputed values are estimates, and estimates usually come with random error. In addi-

tion, “the imputed values, which are just fitted values, will have less variability than the

original value itself” (Berk, 2005). In short, the imputed values will typically be less

variable than the real thing. The reduced variability can seriously undermine statistical

inference.

It is well known that CART/RF have built-in algorithms to impute missing data,

such as using surrogate variables or proximities. But these imputation methods have

no formal rationale. Furthermore, since CART/RF are more nonparametric models

than parametric, advanced multiple imputation (MI) methods may not apply at all. In

short, tools for imputing missing data are likely to be inadequate.

This thesis will address nonparametric approaches to assessing the accuracy of

an estimator in a missing data situation. Three main topics are discussed: bootstrap

methods for missing data, its relationship to the theory of multiple imputations, and

comparison to the surrogate variables/proximity method. Two main advantages (Efron,

1994) of nonparametric bootstrap imputation are: 1) it requires no knowledge of the

missing-data mechanism other than that it is missing at random or conditionally at

random; 2) the confidence interval turns out to give convenient and accurate answers.

The thesis is structured as follows: Chapter 1 introduces basic concepts about tree-

based models and missing data problem, and motivates this thesis. Chapter 2 intro-

duces Classification and Regression Trees (CART), as well as random forests (RF).

Standard theories of missing data and imputation methods are elaborated in Chapter 3,

which also illustrates the limitation of applying multiple imputation (MI) to tree-based

models. Chapter 4 explains how CART and RF deal with missing data, and their po-

tential limitations. Chapter 5 formally introduces nonparametric bootstrap methods to

impute missing data, and proposes corresponding algorithms in detail. Chapter 6 is

empirical study, which applies several imputation methods to various data sets. Here,

4


20/96

the classification errors from 2000 bootstrapped imputations are compared to surro-

gate method for CART models, and the 2000 bootstrapped imputations are compared

to proximity method for RF models. Significant improvement can be found by using

nonparametric bootstrap methods. Chapter 7 discusses the effectiveness of the non-

parametric bootstrap methods in their ability to classify, as well as possible limitations.

Further improvement to the algorithm is also suggested.

5


21/96

CHAPTER 2

Classification and Regression Trees (CART) and

Extensions

2.1 Classification and Regression Trees (CART)

We begin with a discussion of the general structure of a CART model. A CART

model describes the conditional distribution of y given X , where y is the response vari-

able and X is a set of predictors (X = (X 1, X 2, · · · , X p)). This model has two maincomponents: a tree T with b terminal nodes, and a parameter Θ = (θ1, θ2, · · · , θb) ⊂Rk which associates the parameter values θm with the m

th terminal node. Thus a treed

model is fully specified by the pair (T, Θ). If X lies in the region corresponding to the

m

th

terminal node then y|X has the distribution f (y|θm), where we use f to representa conditional distribution indexed by θm. The model is called a regression tree or a

classification tree according to whether the response y is quantitative or qualitative,

respectively.

2.1.1 Splitting A Tree

The binary tree T subdivides the predictor space as follows. Each internal node

has an associated splitting rule which uses a predictor to assign observations to either

its left or right child node. The internal nodes are thus partitioned into two subsequent

nodes using the splitting rule. For quantitative predictors, the splitting rule is bases on

6


22/96

a split rule s, and assigns observations for which {xi ≤ s} or {xi > s} to the left orright child node respectively. For qualitative predictors, the splitting rule is based on

a category subset C , and assign observations for which {xi ∈ C } or {xi /∈ C } to theleft or right child node respectively.

For a regression tree, conventional algorithm models the response in each region

Rm as a constant cm. Thus the overall tree model can be expressed as (Hastie, Tibshi-

rani and Friedman, 2001):

f (x) =b

m=1

cmI (X ∈ Rm). (2.1)

where Rm, m = 1, 2, · · · , b consist of a partition of the predictors space, and thereforerepresenting the space of b terminal nodes. If we adopt the method of minimizing the

sum of squares

(yi− f (X i))2 as our criterion to characterize the best split, it is easyto see that the best ĉm is just the average of yi in region Rm:

ĉm = ave(yi | X i ∈ Rm) = 1N m

X i∈Rm

yi (2.2)

Where N m is the number of observations falling in node m. The residual sum of

squares is then

Qm(T ) = 1

N m

X i∈Rm

(yi − ĉm)2 (2.3)

which will serve as an impurity measure for regression trees.

If the response is a factor taking outcomes 1, 2, . . . , K , the impurity measure Qm(T ),

defined in (2.3) is not suitable. Instead, we represent a region Rm with N m observa-

tions with

ˆ pmk =

1

N m X i∈Rm I (yi = k) (2.4)which is the proportion of class k(k ∈ c(1, 2, · · · , K )) observations in node m. Weclassify the observations in node m to a class k(m) = arg maxk ˆ pmk, the majority class

7


23/96

in node m. Different measures Qm(T ) of node impurity include following (Hastie,

Tibshirani and Friedman, 2001):

Misclassification error : 1

N m i∈RmI (y

i /∈

k(m)) = 1−

ˆ pmk(m)

Gini index :k=k

ˆ pmk ˆ pmk =K k=1

ˆ pmk(1− ˆ pmk)

Cross− entropy or deviance :K k=1

ˆ pmk log ˆ pmk

(2.5)

For binary outcomes, if p is the proportion of the second class, these three measures

are 1 − max( p, 1 − p), 2 p(1− p) and − p log p− (1− p) log(1− p), respectively.

All three definitions of impurity are concave, having minimums at p = 0 and p = 1

and a maximum at p = 0.5. Entropy and the Gini index are the most common, and

generally “give very similar results except when there are two response categories”

(Berk, 2005).

2.1.2 Pruning A Tree

To be consistent with conventional notations, let’s define the impurity of a node τ

as I (τ ) ((2.3) for a regression tree, and any one in (2.5) for a classification tree). We

then choose the split with maximal impurity reduction

∆I = I (τ )− p(τ L)I (τ L)− p(τ R)I (τ R) (2.6)

where τ L and τ R are the left and right children nodes of τ .

How large should we grow the tree then? Clearly a very large tree might overfit the

data, while a small tree may not be able to capture the important structure. Tree size is

a tuning parameter governing the model’s complexity, and the optimal tree size should

8


24/96

be adaptively chosen from the data. One approach would be to continue the splitting

procedures until the decrease on impurity due to the split exceeds some threshold. This

strategy is too short-sighted, however, since a seeming worthless split might lead to a

very good split below it.

The preferred strategy is to grow a large tree T 0, stopping the splitting process

when some minimum number of obervations in a terminal node (say 10) is reached.

Then this large tree is pruned using cost-complexity pruning.

We define a subtree T ⊂ T 0 to be any tree that can be obtained by pruning T 0, anddefine T to be the set of terminal nodes of T . That is, collapsing any number of itsterminal nodes. As before, we index terminal nodes by m, with node m representing

region Rm. Let |T | denote the number of terminal nodes in T (|T | = b). We use |T |instead of b in this section following the “conventional” notation and define the risk of

trees

Regression tree : R(T ) =

|eT |m=1

N mQm(T )

Classification tree : R(T ) =τ ∈eT

P (τ )r(τ ) (2.7)

where r(τ ) measures the impurity of node τ in a classification tree (can be any one in

(2.5)).

We define the cost complexity criterion (Breiman et al., 1984)

Rα(T ) = R(T ) + α|T | (2.8)where α(> 0) is the complexty parameter. The idea is , for each α, find the subtree

T α ⊆ T 0 to minimize Rα(T ). The tuning parameter α ≥ 0 “governs the tradeoff between tree size and its goodness of fit to the data” (Hastie, Tibshirani and Friedman,

2001). Large values of α result in smaller tree T α, and conversely for smaller values

of α. As the notation suggests, with α = 0 the solution is the full tree T 0.

9


25/96

To find T α we use weakest link pruning: we successively collapse the internal node

that produces the smallest per-node increase in R(T ), and continue until we produce

the single-node (root) tree. This gives a (finite) sequence of subtrees, and one can

show this sequence must contains T α. See Brieman et al. (1984) and Ripley (1996) for

details. Estimation of α (α̂) is achieved by five- or ten-fold cross-validation. Our final

tree is then denoted as T ̂α.

It follows that, in CART and related algorithms, classification and regression trees

are produced from data in two stages. In the first stage, a large initial tree is produced

by splitting one node at a time in an iterative, greedy fashion. In the second stage,

a small subtree of the initial tree is selected, using the same data set. Whereas the

splitting procedure proceeds in a top-down fashion, the second stage, known as prun-

ing, proceeds from the bottom-up by successively removing nodes from the initial tree.

Theorem2.1 (Brieman et al., 1984, Section 3.3) For any value of the complexity pa-

rameter α , there is a unique smallest subtree of T 0 that minimizes the cost-complexity.

Theorem2.2 (Zhang and Singer, 1999, Section 4.2) If α2 > α1 , the optimal sub-

tree corresponding to α2 is a subtree of the optimal subtree corresponding to α1

More general, suppose we end up with m thresholds,

0 < α1 < α2 < · · · < αm

and let α0 = 0. Also, let corresponding optimal subtrees be

{T α0, T α1, T α2, · · · , T αm}, thenT α0 ⊃ T α1 ⊃ T α2 ⊃ · · · ⊃ T αm (2.9)

where T α0 ⊃ T α1 means that T α1 is a subtree of T α0 . There are so-called nested optimal

10


26/96

subtrees.

2.1.3 Taking Cost into Account

We talk about classification trees in this section. In many applications, tree-based

method is used for the purpose of prediction. That is, given the characteristics of a

subject, we must predict the outcome of this subject before we know the outcome. For

example, physicians in emergency rooms must predict whether a patient with chest

pain suffers from a serious disease based on the information available within a few

hours of admission. For this purpose, we must first classify a node τ to either class 0

(normal) or 1 (abnormal), then we predict the outcome of an individual based on themembership of the node to which the individual belongs. Unfortunately, we always

make mistakes in such a classification procedure, because some of the normal subjects

will be predicted as diseased and vice versa. These two mistakes are called false-

positive (predicting a normal condition as abnormal) and false-negative (predicting an

ill-conditioned outcome as normal), respectively. In any case, to weigh these mistakes,

we need to assign misclassification costs.

Let c(i, j) denote the misclassification cost that a class j subject is classified as a

class i subject. When i = j, we have the correct classification and the cost should

naturally be zero, i.e., c(i, i) = 0. If the outcome is binary, i and j take the values 0 or

1. Without loss of generality, we can set c(1, 0) = 1. In other words, one false positive

error counts as one. The clinicians and the statisticians need to work together to gauge

the relative cost of c(0, 1). This is a subjective and difficult, but important, decision.

In the example of Domestic Violence (DV) analysis, 671 households reported DV

incidents during the study period, among which, about 21% of the households reported

to have a new call within 3-month follow-up period. In this instance, the two errors are:

1) false negative: failing to predict a new DV incident for households that really hap-

11


27/96

pened and 2) false positive: predicting a new DV for households that really didn’t hap-

pen. Thus, a predictor that produced few false positives but many false negatives might

be discarded if the undesirable consequences from the false negatives were larger than

the undesirable consequences from the false positives. Therefore, we needed informa-

tion from the Los Angeles Sheriff’s Department on the relative consequences of false

positives and false negatives.

Information from the Los Angeles Sheriff’s Department led to a general conclusion

that false negatives were substantially more problematic than false positives. In other

words, they considered not responding to a call when there actually was a need for

law enforcement assistance more “costly” than responding to a call that turned out to

be a false alarm. But the precise figures for these “costs” could not be determined.

Fortunately, all we needed for statistical analysis was the ratio of false negative costs

to false positive costs. We then proceeded with a reasonable ratio of the costs of false

negatives to the costs of false positives of 5 to 1. Consistent with the information

provided by the Sheriff’s Department, the failure to forecast a new call for service was

5 times more costly than incorrectly forecasting a new call for service.

We can now better understand the role of costs using the obtained 21% return callfigure in DV data. If for every household (671 households), we predicted another call

within three months, we would be correct about 21% of the time. And, we would

also be wrong about 79% of the time. Conversely, if for every household, we always

predicted no calls within three months, we would be correct about 79% of the time.

And we would also be wrong about 21% of the time. Which is a better strategy: always

predicting a future call or not? The answer depends on the costs of false negatives

compared to the costs of false positives.

If both were equally costly, the best strategy would clearly be to never predict

a subsequent call. But since the failure to predict future calls was very costly (false

12


28/96

negatives were 5 times more costly than false positives), the best strategy would clearly

be to predict a subsequent call. In short, the relative costs of false negatives compared

to the relative costs of false positives can affect how forecasting is done (Berk, 2005).

And, it also affects which predictors are likely to be important. Hence, in subsequent

analysis, we take costs into account.

2.2 Random Forest (RF)

Significant improvement in classification accuracy can be obtained by growing an

ensemble of trees and letting them vote for the most popular class (namely, majority

vote). An early example is bagging (Breiman, 1996), where to grow each tree a random

sample is selected from training set. Bagging stands for “Bootstrap Aggregation” and

may be best understood as nothing more than an algorithm.

The bagging algorithm for a data set having n observations and a binary response

variable can be summarized as following steps:

1. Take a random sample of size n with replacement from the data.

2. Construct a classification tree as usual but do not prune.

3. Assign a class to each terminal node as in CART. Drop the out-of-bag data down

the tree, and store the class attached to each case.

4. Repeat steps 1-3 a large number of times (say, 1000).

5. For each observation in the data, count the number of times over trees that it is

classified in one category and the number of times over trees it is classified in

the other category.

6. Assign each observation to a final category by a majority vote over the set of

13


29/96

trees. Thus, if 51% of the time over a large number of trees a given observation

is classified as a “1”, that becomes its final classification.

7. Construct the confusion table from these class assignments.

2.2.1 The Algorithm

Random Forest extends the ideas of of bagging to the extent that allows random

selections of both observations and predictors at splitting step. Here, a large number

(say, 1000) of classification trees are constructed, each based on a bootstrap sample of

the data. In addition, at each split a random subset of predictors is selected. For each

tree constructed, data not used to grow the tree are dropped down to evaluate how well

the tree performs. Finally, overall results are produced by majority vote over the trees.

For example, if there are fifty predictors, choose a random seven candidates (It is

recommended to use the square root of number of predictors) for defining the split.

Then choose the best split, as usual, by selecting only from the seven randomly chosen

predictors. Repeat this process for each node. Therefore, the random forests algorithm

is very much like the bagging algorithm. Again let n be the number of observations

and assume for now that the response variable is binary.

1. Take a random sample of size n with replacement from the data.

2. Take a random sample of the predictors without replacement.

3. Construct the first CART partition of the data using selected predictors.

4. Repeat step 2 for each subsequent splits until the tree is as large as desired and

do not prune.

5. Drop the out-of-bag data down the tree, and store the class assigned to each

observation.

14


30/96

6. Repeat steps 1-5 a large number of times (e.g., 1000).

7. Using the observations not used to build the tree for evaluation, count the number

of times over trees that a given observation is classified in one category and the

number of times over trees it is classified in the other category.

8. Assign each case to a category by a majority vote over the set of trees. Thus, if

51% of the time over a large number of trees a given case is classified as a “1”,

that becomes its estimated classification.

9. Construct the confusion table for these assigned classes.

2.2.2 The Comparative Advantage of Random Forests

It is apparent that random forests are more than bagging. By working with a random

sample of predictors at each possible split, “the fitted values across trees are more

independent” (Berk, 2005). As a result, the gains from averaging over a large number

of trees can be larger. A related benefit is that it is possible to work with a very

large number of predictors, and even more predictors than observations. It is well

known that in the conventional regression modeling, all of the data mining procedures

considered so far have required that the number of predictors be less than the number

of observations (usually much less). An obvious gain is that more information can be

utilized in the fitting process , and more predictors can contribute.

The use of multiple trees (often as many as 1000) makes the random forests fitting

function much more complicated than the CART fitting function. However, the data

not included in each bootstrap sample are used to evaluate the model performance,

and “the averaging over trees directly compensates for the overfitting problem that is

vulnerable to CART” (Berk, 2005). Therefore, the random forest results can be treated

as true forecasts.

15


31/96

Some of other features of RF are:

(i) It is an excellent classifier–comparable in accuracy to many other classifiers.

(ii) It generates an internal unbiased estimate of the generalized error as the forest

building progresses.

(iii) It has an effective method for estimating missing data.

(iv) It has a method for balancing error in unbalanced class population data sets.

(v) Generated forests can be saved for future use on other data.

(vi) It gives estimates of what variables are important in the classification.

(vii) Output is generated that gives information about the relation between the vari-

ables and the classification.

(viii) It computes proximities between pairs of cases that can be used in clustering,

locating outliers, or by scaling, give interesting views of the data.

(ix) The capabilities of (vii) above can be extended to unlabeled data, leading to

unsupervised clustering, data views and outlier detection. The missing value

replacement algorithm can also be extended to unlabeled data.

16


32/96

CHAPTER 3

Standard Theory on Missing Data

3.1 Mechanisms That Lead to Missing Data

Missing Data are a problem for all statistical analyses. Missing data mechanisms

are crucial since the properties of missing-data methods depend very strongly on the

nature of these mechanisms. The crucial role of the mechanism in the analysis of data

with missing values was largely ignored until the concept was formalized in the theory

of Rubin (1976), through the simple device of treating the missing-data indicators as

random variables and assigning them a distribution.

Define the full data Y = (yij) and the missing-data indicator matrix M = (M ij),

with M ij indicating whether the corresponding Y ij is missing or not. The missing-

data mechanism is characterized by the conditional distribution of M given Y , say

f (M |Y, φ), where φ denotes unknown parameters. If missingness does not depend onthe values of the data Y , missing or observed, that is, if

f (M |Y, φ) = f (M |φ) for all Y,φ (3.1)

the data are called “missing completely at random” (MCAR).

Let Y obs and Y mis denote the observed and missing components of Y respectivelly.An assumption less restrictive than MCAR is that missingness depends only on the

observed components of Y (Y obs), and not on the components that are missing (Y mis).

17


33/96

That is,

f (M |Y, φ) = f (M |Y obs, φ) f or all Y mis, φ (3.2)

This missing-data mechanism is then called “missing at random” (MAR). The third

mechanism is called “not missing at random” (NMAR) if the distribution of M de-

pends on the missing values in the data matrix Y .

f (M |Y, φ) = f (M |Y obs, Y mis, φ) f or all φ (3.3)

Some literature also calls it nonignorable missing data.

The simplest data structure is a univariate random sample for which some units are

missing. Let Y = (y1, . . . , yn)T where yi denotes the value of a random variable for

unit i, and let M = (M 1, . . . , M n) where M i = 0 if unit i is observed and M i = 1 if

unit i missing. Suppose the joint distribution of (yi, M i) is independent across units,

so in particular the probability that a unit is observed does not depend on the values of

Y or M for other units. Then (Little and Rubin, 2002),

f (Y, M |θ, φ) = f (Y |θ)f (M |Y, φ) =ni=1

f (yi|θ)ni=1

f (M i|yi, φ) (3.4)

where f (yi

|θ) denotes the density of yi indexed by unknown parameters θ, and f (M i

|yi, φ)

is usually the density of a Bernoulli distribution for the binary indicator M i with prob-

ability Pr(M i = 1|yi, φ) that yi is missing.

If missingness is independent of Y , that is if Pr(M i = 1|yi, φ) = φ, a constantthat does not depend on yi, then the missing-data mechanism is MCAR (or in this

case equivalently MAR). If the mechanism depends on yi the mechanism is NMAR

since it depends on yi that are missing, assuming that there are some. NMAR is the

most general situation, and valid statistical inferences generally require specifying thecorrect model for the missing-data mechanism, distribution assumption for the missing

yi, or both. The resulting estimators and tests are typically very sensitive to these

assumptions.

18


34/96

Let r denote the number of responding units (I.e., M i = 0). An obvious conse-

quence of the missing values in this example is that sample size is reduced from n to

r (Little and Rubin, 2002). We might want to carry out the sample analyses on the

reduced sample as we intended for the size-n sample. For example, if we assume the

values are normally distributed and wish to make inferences about the mean, we might

estimate the mean by the sample mean of the r responding units with standard error

s/√

r, where s is the sample standard deviation of the responding units. This strat-

egy is valid if the mechanism is MCAR or MAR, since then the observed cases are

“a random subsample of all the cases” (Little and Rubin, 2002). However, if the data

are NMAR, the analysis based on the responding subsample is generally biased for the

parameter of the distribution of Y .

3.2 Treatment of Missing Data

There are three approaches in dealing with missing data:

1. Impute the missing data: that is, filling in the missing values.

2. Model the probability of missingness: this is a good option if imputation is in-

feasible; in certain cases it can account for much of the bias that would otherwise

occur.

3. Ignore the missing data: a poor choice, but by far the most common one.

This section gives a brief description of alternative approaches to handling the

problem of missing data.

19


35/96

3.2.1 Listwise Deletion

By far the most common approach is to simply omit those cases with missing data

and to run analyses on what remains.

If data are missing for the response variable, the only reasonable strategy is “list-

wise deletion”. That is, observations with missing response are dropped totally from

the analysis. If the data are missing completely at random, the only loss is statistical

power. If not, however, bias of unknown size and direction can be introduced.

When the data are missing for one or more predictors, we have more options. List-

wise deletion remains a possible choice, especially if there is not a lot of missing data

(e.g., less than 5% of the total number of observations). Listwise deletion is also easy

to implement and understand. However, this method ignores the possible systematic

difference between the complete cases and incomplete cases, and the resulting infer-

ence may not be applicable to the population of all cases, especially with a smaller

number of complete cases.

3.2.2 Single Imputation

Single imputation refers to filling in a missing value with a single replacement

value. Imputations are means or draws from a predictive distribution of the missing

values, and require a method of creating a predictive distribution for the imputation

based on the observed data. There are two generic approaches to generating this dis-

tribution (For details see Little and Rubin, 2002, Pages 59-60):

Explict Modeling: the predictive distribution is based on a formal statistical model

(e.g. multivariate normal), and hence the assumptions are explicit.

Implicit Modeling: the focus is on an algorithm, which implies an underlying

model; assumptions are implicit, but they still need to be carefully assessed to ensure

20


36/96

that they are reasonable.

Explicit modeling methods include:

(a) Mean imputation, where means from the responding units in the sample are used

to fill in missing values. Sometimes, the means may be formed by weighting

within cells or classes.

(b) Regression imputation replaces missing values by predicted values from a re-

gression of the missing item on items observed for the unit. Mean imputation

can actually be regarded as a special case of regression imputation. The proper

regression model depends on the type of the to-be-imputed variable. A probit

or logit is used for binary variables, Poisson or other count models for integer-

valued variables, and OLS or related models for continuous variables. For exam-

ple, suppose for subject properties, there are some missing data for gross living

areas (GLA). But gross living areas are strongly related to number of bedrooms,

number of bathrooms, number of total rooms, and lot size. For the observations

with no missing data, GLA is regressed on number of bedrooms, number of

bathrooms, number of total rooms, and lot size. Then, for the observations that

have missing GLA data, the values for the four predictors are inserted into the

estimated regression equation. Predicted values are computed, which are used

to fill in the holes in the GLA data.

(c) Stochastic regression imputation goes one step further, replacing missing values

by a value predicted by regression imputation plus a residual, which is drawn

to reflect the uncertainty in the predicted value. For example, the residual for

Gaussian regression is naturally normal with mean zero and variance equal to

the residual variance in the regression. With a binary outcome, as in logistic

regression, the predicted value is a probability of 1 versus 0, and the imputed

21


37/96

value is then a 1 or 0 drawn with that probability.

Implicit modeling methods include:

(d) Hot deck imputation, involves substituting individual values imputed from “sim-

ilar” responding units. Hot deck imputation is common in survey practice and

can “involve very elaborate schemes for selecting units that are similar for im-

putation” (Little and Rubin, 2002). To perform hot deck imputation, all obser-

vations are divided into groups with similar characteristics, for example, “prop-

erties priced 400K-800K”. To impute a missing value, the researcher randomly

draw a value for that variable from the pool of properties having similar char-

acteristics. Creating a large number of subgroups yields some improvement in

accuracy, but it can also lead to very small sample sizes within some subgroups.

The primary difficulty of this method is the selection of proper subgroups.

(e) Substitution, replaces nonresponding units with alternative units not selected into

the sample. For example, in order to estimate a property value using sales com-

parison method, we need to find similar sales within 0.5 mile of the subject prop-

erty. If a similar sale cannot be found, then a similar sale beyond 0.5 mile may

be substituted. The tendency to treat the resulting sample as complete should

be taken with caution, since the substituted property may differ systematically

from properties within 0.5 mile. Hence at the analysis stage, substituted proper-

ties should be regarded as imputed values of a particular type.

(f) Cold deck imputation replaces a missing value of an item by a constant value

from an external source, such as a value from a previous realization. In theproperty valuation example, we sometimes use historical sales price adjusted to

effective date (usually evaluation date).

22


38/96

(g) Composite methods combines ideas from different methods. For example, hot

deck and regression imputation can be combined by calculating predicted means

from a regression but then adding a residual randomly chosen from the empirical

residuals to the predicted value when forming values for imputation. See, for

example, Schieber (1978), David et. al (1986).

An important limitation of the single imputation methods described so far is that

standard variance formulas applied to the filled-in data systematically underestimate

the variance of estimates, even if the model used to generate imputations is correct.

Even if reasonably unbiased estimates can be constructed, single imputation methods

ignore the reduced variability of the predicted values and treats the imputed valuesas fixed. One response is to impute several times for each observation drawn at ran-

dom, say, from the conditional distributions implied by the regression equation. It is

then possible to get a better handle on the uncertainty associated with the imputed val-

ues. Multiple imputation is one such example, but its obvious disadvantage prevents it

from being using in the nonparametric situations. Another example is nonparametric

bootstrap imputation, which will be treated shortly.

3.2.3 Multiple Imputations through Data Augmentation

MI refers to the procedure of imputing missing value D(D ≥ 2) times. When theD sets of imputations are repeated random draws from the predictive distribution of

the missing values under a particular model, the D complete-data inferences can be

combined to form one inference that “properly reflects uncertainty due to nonresponse

under the model” (Little and Rubin, 2002).

As already indicated in Section 3.2.2, the obvious disadvantage of single imputa-

tion is that imputing a single value treats that value as known, and thus without special

adjustments, single imputation cannot reflect the sampling variability under the impu-

23


39/96

tation model for nonresponse. MI Shares advantages of single imputation and recti-

fies its disadvantages. Specifically, when the D imputations are repetitions under one

model for missingness, the resulting D complete-data analyses can be easily combined

to create an inference that validly reflects sampling variability because of the missing

values.

We now turn to the problem of creating the multiple imputations. Standard theory

suggests that we draw the missing values as

Y (d)mis ∼ p(Y mis|Y obs), d = 1, · · · , D. (3.5)

that is, from their joint posterior predictive distribution. Unfortunately, it is often dif-

ficult to draw from this predictive distribution in complicated problem, because of the

implicit requirement in Equation(3.5) to integrate over the unknown parameter θ. Data

augmentation accomplishes this by iteratively drawing a sequence of values of the

parameters and missing data until convergence.

Data augmentation (Tanner and Wong, 1987) is an iterative two-step method of

imputing missing values by simulating the posterior distribution of θ that combines

features of the EM algorithm and multiple imputations. These two steps are: the

imputation (or I ) step and the posterior (or P ) step. Start with an initial draw θ(0) from

an approximation to the posterior distribution of θ. Given a value θ (t) of θ, draw at

iteration t:

• I Step: Draw Y (t+1) with density p(Y mis|Y obs, θ(t));

• P Step: Draw θ(t+1) with density p(θ|Y obs, Y (t+1)mis ).

This procedure is motivated by the fact that the distribution in these two steps are often

much easier to draw from than either of the posterior distribution p(Y mis|Y obs) and p(θ|Y obs), or the joint posterior distribution p(θ, Y mis|Y obs). The iterative procedure

24


40/96

can be shown to eventually yield a draw from the joint posterior distribution of Y mis,

θ given Y obs, in the sense that as t tends to infinity, this sequence converges to a draw

from the joint distribution of (θ, Y mis) given Y obs.

3.2.4 Assessment of Multiple Imputations

Although multiple imputation has desirable features, for instance, it allows one to

get good estimates of the standard error, certain requirements must be met for MI to

have these desirable properties. First, the data must be missing at random (MAR),

meaning that the probability of missingness on the data Y depend on what are ob-

served, and not on the components that are missing (See equation 3.2). Second, themodel used to generate the imputed values must be “correct” in some sense. Third,

the model used for the analysis must match up, in some sense with the model used in

the imputation. All these conditions have been rigorously described by Rubin (1987,

1996).

The problem is that it’s easy to violate these conditions in practice. First, when

the data are missing for reasons beyond the control of the investigators, one can never

be sure whether MAR holds. In fact, to speak of a single “missingness mechanism”

is often misleading, because in most of studies missing values occur for a variety of

reasons; some of these may be entirely unrelated to the data in question, but others

may be closely related.

Unfortunately, it is not possible to relax the MAR assumption in any meaningful

way without replacing it with some other equally untestable assumptions. At present,

there are no principal nonignorable missing-data methods readily available to most

data analysts. Thus, the MI methods based on the MAR assumption should be used

with an awareness of its limitations.

Furthermore, in order to generate imputations for the missing values, a probability

25


41/96

model on the full data (observed and missing values) must be imposed. Each of the

software packages applies to a different class of multivariate models (Available in R).

NORM uses the multivariate normal distribution. CAT is based on loglinear models,

which have been traditionally used by social scientists to describe associations among

variables in cross-classified data. The MIX library relies on the general location model,

which combines a loglinear model for the categorical variables with a multivariate nor-

mal regression for the continuous ones. Details of these models are given by Schafer

(1997).

In reality, real data rarely conform to the convenient models such as multivariate

normal. In most applications of MI, the model used to generate the imputations will

at least be approximately true. And an imputation model should be chosen to be (at

least approximately) compatible with the real analyses to be performed on the imputed

datasets. In particular, the imputation model should be “rich enough to preserve the

associations or relationships among variables that will be the focus of later investiga-

tion” (Schafer and Olsen, 1998). The precision you lose when you include unimportant

predictors is usually a relatively small price to pay for the general validity of analyses

of the resultant multiply imputed data set (Rubin, 1996). Therefore, a rich imputation

model that preserves a large number of associations is desirable because it may be used

for a variety of post-imputation analyses.

Existing software packages, however, sometimes fail for imputation models with

a large number of variables, especially when there are a large number of categorical

variables, since then, the problems of “curse of dimensionality” and “sparse cells”

can easily occur. Not to mention the possibility of misspecified imputation model,

which typically leads to “overestimated variability, and thus, overcoverage of interval

estimates” (Little and Rubin, 2002).

Third, the Bayesian nature of MI requires investigators to specify a prior distri-

26


42/96

bution for the parameter (θ) of the imputation model. In the Bayesian paradigm, this

prior distribution quantifies one’s belief or state of knowledge about model parameters

before any data are seen. Because different prior distributions can lead to different

results, Bayesian models have been regarded by some statisticians as subjective and

unscientific. “We tend to view the prior distribution as a mathematical convenience

that allows us to generate the imputations in a principled fashion” (Schafer and Olsen,

1998).

The nonparametric bootstrap method avoids all these problems implicit in MI, thus

provides a good alternative to impute missing values in broader situations. The details

about bootstrap method together with the algorithm will be discussed in Chapter 5.

27


43/96

CHAPTER 4

Missing Data with CART/RF

4.1 Missing Data with CART

Missing data are a problem for all statistical problems. CART and RF are certainly

not exception. The imputation methods we have discussed so far can be used for

tree-based models or used with some adjustments. For instance, if the percentage of

missing data is less than 5% of the total number of observations, listwise deletion

remains a possible choice.

A second option is to impute the data outside CART. A simple example would be to

employ conventional regression in which a predictor with the missing data is regressed

on other predictors with which it is likely to be related. The resulting regression equa-

tion can then be used to impute what the missing values might be.

A third option is to address the missing data problems for predictors within CART

itself. There are a number of ways this might be done. Here, we consider one of the

better approaches, and the one readily available in the CART software.

The first place where missing data come up is when a split is chosen. Recall that

at each step we choose the split that gives the maximal reduction in impurity:

∆I = I (τ )− p(τ L)I (τ L)− p(τ R)I (τ R) (4.1)

where I (τ ) is the value of the parent impurity, p(τ R) is the probability of a case falling

in the right daughter node, p(τ L) is the probability of a case falling in the left daughter

28


44/96

node, I (τ R) is the impurity of the right daughter node and I (τ L) is the impurity of the

left daughter node. CART tries to find the predictor and the split rule with which ∆I

is as large as possible.

Consider the first term on the right hand side (I (τ )). We can easily calculate its

value without any predictors and thus, don’t have to worry about missing values. How-

ever, to construct the two daughter nodes, predictors are required. Each predictor is

evaluated as usual, but using only the predictor values that are not missing. That is,

I (τ R) and I (τ L) are computed for each of the optimal split for each predictor using

only the data available. And the associated probabilities p(τ R) and p(τ L) are estimated

for each predictor based on the split actually present.

We are not done yet. Now, observations have to be assigned to one of the two

daughter nodes. How can this be done if the predictor values needed are missing?

CART imputes those missing values using “surrogate variables”.

Suppose there are 10 predictors x1− x10 to be included in the CART analysis, andsuppose there are missing values for x1 only, which happens to be the “best” predictor

chosen to define the “optimal” split. The split necessarily defines two categories for

x1.

“The predictor x1 now actually becomes a binary response variable with the two

classes determined by the split” (Berk, 2005). CART is applied with x1 as the response

variable and x2 − x10 as potential splitting variables. Only one partitioning is allowedhere; a full tree is not constructed. The nine predictors are then ranked by the propor-

tion of cases in x1 that are misclassified. Predictors that do no better than the marginal

distribution of x1 are dropped from further consideration.

The variable with the lowest classification error for x1 is then used in place of x1 to

assign cases with missing values on x1 to one of the two daughter nodes. That is, “the

predicted classes for x1 are used when the actual classes for x1 are missing” (Berk,

29


45/96

2005). If there are missing data for the “best” predictor of x1, the “best” surrogate

variable is used instead. If there are missing data on the “best” surrogate variable of

x2, the second “best” surrogate variable of x3 is used instead. And so on. If each of

the variables x2 − x10 has missing data, the majority direction of the x1 split is used.For example if split is defined so that x1 ≤ c sends observations to the left and x1 > csends cases to the right, cases with data missing on x1, which have no surrogate to use

instead, are placed along with the majority of cases. To be more specific, there are

three options in real implementation (rpart library in R):

1. 0 = display only; an observation with a missing value for the primary split rule

is not sent further down the tree.

2. 1 = use surrogates, in order, to split subjects missing the primary variable; if all

surrogates are missing the observation is not split.

3. 2 = if all surrogates are missing, then send the observations in the majority di-

rection.

This would seem to be a reasonable response to missing data. There might be

other alternatives that may perform better. But the greatest risk is that “if there are lots

of missing data and the surrogate variables are used, the correspondence between the

results and the data, had it been complete, can become very tenuous” (Berk, 2005).

In practice, the data will rarely be missing completely at random (MCAR) or even

missing at random (MAR). Then, if too much of the data are manufactured, rather than

collected, a new kind of generalization error will be introduced. The problem is that

imputation can fail just when you need it the most.

Furthermore, a number of statistical difficulties can follow when the response vari-

able is highly skewed. “The danger with missing data is that the skewing can be made

worse” (Berk, 2005). Perhaps we should avoid using surrogate variables, and impute

30


46/96

the missing data using alternative imputation methods, such as nonparametric boot-

strap method.

4.2 Missing Data with RF

There are two ways with which random forests can impute missing data. Among

the two implementations in “randomForest” library, option “na.roughfix” is quick and

easy to implement. To be specific,

1. For numerical variables, NAs are replaced with column medians.

2. For factor variables, NAs are replaced with the most frequent levels (breaking

ties at random).

3. If a data matrix contains no NAs, it is returned unaltered.

A more advanced algorithm capitalizes on the proximity matrix (“rfImpute()” in

the randomForest library). We now formally introduce a proximity matrix.

A proximity matrix is an n×n symmetric matrix, which gives an intrinsic measureof similarities between cases. Here n is the number of cases in the data set. Run all

cases in the training set are dropped down the tree. If case i and case j both land in

the same terminal node, increase the proximity between i and j (element (i, j) of the

matrix) by one. At the end of the run, the proximities are divided by the number of

trees in the run and the proximity between a case and itself is set equal to one. This

is an intrinsic proximity measure, inherent in the data and the RF algorithm. Thus

each cell in the proximity matrix shows the proportion of trees over which each pair of

observations fall in the same terminal node. The higher the proportion, the more alike

those observations are, and the more “proximate” they have.

31


47/96

The proximities between cases i and j are from the matrix { prox(i, j)}. Fromtheir definition, it follows that “the value 1 − { prox(i, j)} are squared distances in aEuclidean Space of high dimension” (Breiman, 2003).

The function “rfImpute()” starts by imputing NAs using “na.roughfix”, then ran-

domForest() is called with the completed data. The proximity matrix from the random

forests is used to update the imputations of the NAs. For continuous predictors, the

imputed value is the weighted average of the non-missing observations, where the

weights are the proximities. So, cases that are more like the cases with the missing

data are given greater weight. For categorical predictors, the imputed value is the cat-

egory with the largest average proximity. Again, cases more like the case with the

missing data are given greater weight.

This process is relatively slow, and requiring up to 6 iterations of forest growing.

And the use of imputed values “tends to make the OOB measures of fit too optimistic”

(Breiman, 2003). The computational demands are also quite daunting and may be

impractical for many data sets until more efficient ways to handle the proximities are

found.

32


48/96

CHAPTER 5

Nonparametric Bootstrap Methods to Impute Missing

Data

In this section, we formally introduce one type of resampling methods to impute

missing data: nonparametric bootstrap. A primary advantage of the nonparametric

bootstrap method is that it does not depend on the missing-data mechanism, which

rectifies disadvantages of all other imputation methods. It also requires no knowledge

of either the probability distribution or model structure, and successfully incorporates

the estimates of uncertainty associated with the imputed data.

5.1 The Simple Bootstrap for Complete Data

Let θ̂ be a consistent estimate of a parameter θ based on a random sample Y =

(y1, y2, · · · , yn)T . Let Y (b) be a sample of size n obtained from the original sample Y by simple random sampling with replacement , and θ̂(b) be the estimate of θ obtained by

applying the standard estimation method to Y (b), where b indexes the drawn samples,

and b = 1, 2, · · · , B. Then the sequence (θ̂(1), . . . , θ̂(B)) represents the set of estimatesobtained by repeating this procedure B times. The bootstrap estimate of θ is defined

as the average of the B bootstrap estimates:

θ̂boot = 1

B

Bb=1

θ̂(b) (5.1)

33


49/96

Large-sample inferences can be derived from the bootstrap distribution of θ̂(b), which

are based on the histogram formed by the bootstrap estimates (θ̂(1), . . . , θ̂(B)). In par-

ticular, the bootstrap estimate of the variance of θ̂ or θ̂boot is

V̂ boot = 1

B − 1Bb=1

(θ̂(b) − θ̂boot)2 (5.2)

It can be shown that “under certain conditions, (a) the bootstrap estimator θ̂boot is less

biased than the original estimator θ̂, and under quite general conditions (b) V̂ boot is

a consistent estimate of the variance of θ̂ or θ̂boot as n and B tend to infinity” (Efron,

1987). From property (b), we can see that if the bootstrap distribution is approximately

normal, a 100(1−

α)% bootstrap confidence interval for a scalar θ can be computed as

CI norm(θ) = θ̂ ± z 1−α/2 ̂

V boot (5.3)

where z 1−α/2 is the 100(1−α/2) percentile of the normal distribution. Alternatively if the bootstrap distribution is non-normal, a 100(1 − α)% bootstrap confidence intervalcan be computed empirically as

CI emp(θ) = (θ̂(b,l), θ̂(b,u)) (5.4)

where θ̂(b,l) and θ̂(b,u) are the (α/2) and (1−α/2) percentiles of the empirical bootstrapdistribution of θ. Stable intervals based on Eq.(5.3) require bootstrap samples of the

order of B = 200. Intervals based on Eq.(5.4) require much large samples, for example

B = 2000 or more (Efron, 1994).

5.2 The Simple Bootstrap Applied to Imputed Incomplete Data

Suppose there is a simple random sample Y = (y1, y2, · · · , yn)T , but some obser-vations yi are missing. A consistent estimate θ̂ of an unknown parameter θ is com-

puted by first filling in the missing values in Y (b) using some imputation method Imp,

34


50/96

yielding imputed data Ŷ = I mp(Y ), and then estimating θ from the imputed data Ŷ .

Bootstrap estimates (θ̂(1), . . . , θ̂(B)) can be computed as follows:

For b = 1, . . . , B:

1. Generate a bootstrap sample Y (b) with replacement from the original incomplete

sample Y .

2. Fill in the missing data in Y (b) by applying the imputation procedure Imp to the

bootstrap sample Y (b), so that Ŷ (b) = I mp(Y (b)).

3. Compute θ(b) for the imputed complete data Ŷ (b).

Then Equation(5.2) provides a consistent estimate of the variance of θ̂, and Equations

(5.3) or (5.4) can be used to generate confidence intervals for an unknown scalar pa-

rameter.

A key feature of this procedure is that the imputation procedure is applied B times,

once to each bootstrap sample. Hence the approach is computationally intensive. A

simpler procedure would be to apply the imputation procedure Imp just once to yield

one imputed data set Ŷ , and then bootstrap the estimation method applied to the filled-

in data. However, this approach clearly “does not propagate the uncertainty in the

imputations and hence does not provide valid inferences” (Little and Rubin, 2002). A

second key feature is that the imputation method must yield a consistent estimate θ̂

for the true parameter. This is not required for Equation (5.2) to yield a valid estimate

of sampling error, but it is required for Equations (5.3) and (5.4) to yield appropriate

confidence coverage, and for tests to have the nominal size – see in particular Rubin’s

(1994) discussion of Efron (1994).

This approach should be applied with caution since it assumes large samples. With

moderate-sized data sets, it is possible that an imputation procedure that works for the

full sample may need to be modified for one or more bootstrap samples.

35


51/96

A principal advantage of the nonparametric bootstrap method is that it does not

depend on the missing-data mechanism. Its main practical disadvantage is the compu-

tational expense of the 2000 or so bootstrap replications required for reasonable nu-

merical accuracy if the bootstrap estimation is non-normal (Efron, 1994). Fortunately,

this is no longer a big concern with the computer power we have nowadays.

5.3 The Imputation Algorithm for Tree-Based Models

The nonparametric bootstrap method to impute missing data for tree-based models

can be structured as follows.

Algorithm 1:

1. Draw B (say, 2000) boostrap samples.

2. For each bootstrap sample, b = 1, 2, · · · , B, impute missing values using fol-lowing steps:

• Replace missing values with median (if the predictor is quantitative) or

mode (if the predictor is qualitative), a.k.a.“rough fix”;

• Categorical predictors are regressed on other predictors with which it islikely to be related, using Logistic regression.

• Continuous predictors are regressed on other predictors with which it islikely to be related, using Gaussian regression.

• Count (integer-valued) predictors are regressed on other predictors withwhich it is likely to be related, using Poisson regression.

• For observations that have missing data, predict each missing field usingcorresponding regression equation. The missing values are then filled in

using the predicted values.

36


52/96

• Apply CART/RF to each imputed bootstrap sample, and get confusion ta-ble.

• Extract and store false positive and false negative errors from confusion

tables.

3. Repeat Step #2 a large number of times (e.g., B = 2000).

4. Study the empirical distributions of false positive and false negative errors from

B runs.

5. Construct confidence intervals.

Algorithm 2 differs from Algorithm 1 at Step 4, and in fact the procedures used

to impute missing values remain the same. The only difference is that we now get an

overall estimate of false positive and false negative errors instead of their confidence

intervals.

Algorithm 2:

1. Draw B (say, 2000) boostrap samples.

2. For each bootstrap sample, b = 1, 2, · · · , B, impute missing values using fol-lowing algorithm:

• Replace missing values with median (if the predictor is quantitative) ormode (if the predictor is qualitative), a.k.a. “rough fix”;

• Categorical predictors are regressed on other predictors with which it is

likely to be related, using Logistic regression.

• Continuous predictors are regressed on other predictors with which it islikely to be related, using Gaussian regression.

37


53/96

• Count (integer-valued) predictors are regressed on other predictors withwhich it is likely to be related, using Poisson regression.

• For observations that have missing data, predict each missing field using

corresponding regression equation. The missing values are then filled in

using the predicted values.

• Apply CART/RF to each imputed bootstrap sample.

• Drop the cases in the bth bootstrap sample down the tree. Store the classassigned to each observation in-the-sample along with each observation’s

predictor values.

3. Repeat Step #2 a large number of times (e.g., B = 2000).

4. Use only the class assigned to each observation when that observation is in-

the-sample, count the number of times over B replications that the observation

is classified in one category and the number of times over B replictions it is

classified in the other category.

5. Assign each case to a category by a majority vote over B replications. Thus, if

51% of the time a given case is classified as a “1”, that becomes its estimated

classification.

6. Construct the confusion table using the assigned class.

38

8/17/2019 Missing Data Im

missing data imputation for tree-based models.pdf

Documents