missing data imputation for tree-based models.pdf

Upload: raghav-rv

Post on 06-Jul-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    1/96

    UNIVERSITY OF  CALIFORNIA

    Los Angeles

    Missing Data Imputation for Tree-Based Models

    A dissertation submitted in partial satisfaction

    of the requirements for the degree

    Doctor of Philosophy in Statistics

    by

    Yan He

    2006

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    2/96

    c Copyright byYan He

    2006

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    3/96

    The dissertation of Yan He is approved.

    Susan Sorenson

    Hongquan Xu

    Mark Hansen

    Richard Berk, Committee Chair

    University of California, Los Angeles

    2006

    ii

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    4/96

    To my parents and my husband with love and gratitude 

    iii

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    5/96

    TABLE OF  CONTENTS

    Acknowledgement   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   x

    Abstract   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   xiii

    1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1

    2 Classification and Regression Trees (CART) and Extensions . . . . . . .   6

    2.1 Classification and Regression Trees (CART) . . . . . . . . . . . . . . 6

    2.1.1 Splitting A Tree . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.1.2 Pruning A Tree . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.1.3 Taking Cost into Account . . . . . . . . . . . . . . . . . . . 11

    2.2 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.2 The Comparative Advantage of Random Forests . . . . . . . 15

    3 Standard Theory on Missing Data  . . . . . . . . . . . . . . . . . . . . .   17

    3.1 Mechanisms That Lead to Missing Data . . . . . . . . . . . . . . . . 17

    3.2 Treatment of Missing Data . . . . . . . . . . . . . . . . . . . . . . . 19

    3.2.1 Listwise Deletion . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.2.2 Single Imputation . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.2.3 Multiple Imputations through Data Augmentation . . . . . . . 23

    3.2.4 Assessment of Multiple Imputations . . . . . . . . . . . . . . 25

    iv

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    6/96

    4 Missing Data with CART/RF   . . . . . . . . . . . . . . . . . . . . . . .   28

    4.1 Missing Data with CART . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.2 Missing Data with RF . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    5 Nonparametric Bootstrap Methods to Impute Missing Data  . . . . . . .   33

    5.1 The Simple Bootstrap for Complete Data . . . . . . . . . . . . . . . 33

    5.2 The Simple Bootstrap Applied to Imputed Incomplete Data . . . . . . 34

    5.3 The Imputation Algorithm for Tree-Based Models . . . . . . . . . . . 36

    6 Empirical Studies   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   39

    6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    6.1.1 Data of Diabates . . . . . . . . . . . . . . . . . . . . . . . . 39

    6.1.2 Data of Domestic Violence . . . . . . . . . . . . . . . . . . . 40

    6.1.3 Data of Dolphin . . . . . . . . . . . . . . . . . . . . . . . . . 43

    6.2 Missing Values in the Three Data Sets . . . . . . . . . . . . . . . . . 45

    6.3 Comparison for CART . . . . . . . . . . . . . . . . . . . . . . . . . 46

    6.4 Comparison for Random Forests . . . . . . . . . . . . . . . . . . . . 61

    7 Discussion and Future Work   . . . . . . . . . . . . . . . . . . . . . . . .   71

    8 Acknowledgment   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   78

    Bibliography   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   79

    v

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    7/96

    LIST OF  FIGURES

    6.1 Empirical Distributions for False Positive Errors & False Negative Er-

    rors from 2000 CART Bootstrap: “DV” data; cost ratio = 5:1. . . . . . 56

    6.2 Empirical Distributions for False Positive Errors & False Negative Er-

    rors from 2000 CART Bootstrap: “crime” data; cost ratio = 10:1. . . . 57

    6.3 Empirical Distributions for False Positive Errors & False Negative Er-

    rors from 2000 CART Bootstrap: “diabetes” data; cost ratio = 2:1. . . 58

    6.4 Empirical Distributions for False Positive Errors & False Negative Er-

    rors from 2000 CART Bootstrap: “dolphin” data; cost ratio = 10:1. . . 59

    6.5 Empirical Distributions for False Positive Errors & False Negative Er-

    rors from 2000 RF Bootstrap: “DV” data. . . . . . . . . . . . . . . . 65

    6.6 Empirical Distributions for False Positive Errors & False Negative Er-

    rors from 2000 RF Bootstrap: “crime” data. . . . . . . . . . . . . . . 67

    6.7 Empirical Distributions for False Positive Errors & False Negative Er-

    rors from 2000 RF Bootstrap: “diabetes” data. . . . . . . . . . . . . . 68

    6.8 Empirical Distributions for False Positive Errors & False Negative Er-

    rors from 2000 RF Bootstrap: “dolphin” data. . . . . . . . . . . . . . 69

    vi

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    8/96

    LIST OF  TABLES

    6.1 CART confusion table for “DV” objective: N = 516 complete cases;

    cost ratio = 5:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    6.2 CART confusion table for “DV” objective using “surrogate”: N = 636;

    cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    6.3 CART confusion table for “DV” objective using nonparametric boot-

    strap method to impute missing values (Algorithm 2): N = 636; B =

    2000; cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    6.4 Surrogate Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    6.5 CART confusion table for “DV” objective using nonparametric boot-

    strap method to impute missing values (Algorithm 2): N = 636; B =

    30; cost ratio = 5:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    6.6 CART confusion table for “crime” objective: N = 516 complete cases;

    cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    6.7 CART confusion table for “crime” objective using “surrogate”: N =

    636; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    6.8 CART confusion table for “crime” objective using nonparametric boot-

    strap method to impute missing values (Algorithm 2): N = 636; B =

    30; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    6.9 CART confusion table for “diabetes” objective using full data set: N =

    768; cost ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    6.10 CART confusion table for “diabetes” objective using “surrogate”: N =

    768, cost ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    vii

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    9/96

    6.11 CART confusion table for “diabetes” objective using nonparametric

    method to impute missing values (Algorithm 2): N = 768; B = 30; cost

    ratio = 2:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    6.12 CART confusion table for “dolphin” objective using full data set: N =

    1000; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . 54

    6.13 CART confusion table for “dolphin” objective using “surrogate”: N =

    1000; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . . 54

    6.14 CART confusion table for “dolphin” objective using nonparametric

    bootstrap method to impute missing values (Algorithm 2): N = 1000;

    B = 30; cost ratio = 10:1. . . . . . . . . . . . . . . . . . . . . . . . . 54

    6.15 Prediction Errors & 95% Confidence Intervals for False Positives and

    False Negatives Using 2000 Bootstrap Samples: CART Model . . . . 61

    6.16 RF confusion table for “DV” objective: N = 516 complete cases. . . . 61

    6.17 RF confusion table for “DV” objective using “rfImpute”: N = 671. . . 62

    6.18 RF confusion table for “DV” objective using nonparametric bootstrap

    method to impute missing values (Algorithm 2): N = 671, B = 30. . . 63

    6.19 RF confusion table for “crime” objective: N = 516 complete cases. . . 63

    6.20 RF confusion table for “crime” objective using “rfImpute”: N = 671. . 63

    6.21 RF confusion table for “crime” objective using nonparametric boot-

    strap method to impute missing values (Algorithm 2): N = 671; B =

    30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    6.22 RF confusion table for “diabetes” objective using full data set: N = 768. 64

    6.23 RF confusion table for “diabetes” objective using “rfImpute”: N = 768. 64

    viii

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    10/96

    6.24 RF confusion table for “diabetes” objective using nonparametric boot-

    strap method to impute missing values (Algorithm 2): N = 768; B =

    30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    6.25 RF confusion table for “dolphin” objective using full data set: N = 1000. 64

    6.26 RF confusion table for “dolphin” objective using “rfImpute”: N = 1000. 66

    6.27 RF confusion table for “dolphin” objective using nonparametric boot-

    strap method to impute missing values (Algorithm 2): N = 1000; B =

    30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    6.28 Prediction Errors & 95% Confidence Intervals for Misclassification

    Errors Using 2000 Bootstrap Samples: RF . . . . . . . . . . . . . . . 70

    7.1 Prediction Errors for RF Model by Applying Different Imputation Meth-

    ods to Test Sets with Missing Data. . . . . . . . . . . . . . . . . . . . 74

    7.2 Prediction Errors for RF Model by Applying Different Imputation Meth-

    ods to Test Sets with No Missing Data: I (Deleting 20% of Cases from

    Learning Sample). . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    7.3 Prediction Errors for RF Model by Applying Different Imputation Meth-

    ods to Test Sets with No Missing Data: II (Deleting 50% of Cases from

    Learning Sample). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    ix

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    11/96

    ACKNOWLEDGMENTS

    Here first and foremost I would like to express my deepest gratitude to my advisor

    and committee chair Professor Berk. His guidance, support, and kindness made this

    work possible. He has my most sincere and hearty appreciation for giving me the

    freedom and tolerating me to pursue problems I chose in the way I liked.

    I am also thankful to the other members of my committee, Professors Hansen, Xu

    and Sorenson, for their suggestions and comments and the time they spent in reviewing

    this dissertation.

    I wish to express my appreciation to Professor Lin, a senior western-educated pro-

    fessor, who opened this amazing field to me when I was still a college student. It is

    under his mentorship that I became interested in studying and doing statistical analysis

    to solve real world problems. During my four years in college, Professor Lin offered

    me numerous opportunities to learn cutting-edge researches, and he also invited me

    to actively participate in many national projects, which built my strong background in

    mathematical analysis. It was his reference that made my application to the statistics

    program in UCLA a lot easier.

    I would also like to thank Professor Sun, a super nice professor, whose pecuniary

    support made my joining UCLA a reality.

    A debt of gratitude is owed to Professors Wu, Ferguson and Jan de Leeuw, who

    were always so nice and so patient to answer my questions. I also appreciate kindness

    from Mrs. Dean Dacumos, who made the Statistics Department a united community,

    and our life enjoyable.

    To my parents and my husband for their consecutive support and understanding.

    x

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    12/96

    VITA

    1975.10.24 Born, Nantong, P. R. China.

    1994–1998 B.A. in Economics & B.A. in Economic Law, Huazhong University

    of Science and Technology, Wuhan, P. R. China. With High Honors.

    2000–2001 M.A. in Economics, Department of Economics, The Ohio State

    University. Awarded University Fellowship.

    2001–2003 M.S. in Statistics, Department of Statistics, UCLA.

    2003.06–09 Fair Isaac Corporation, Internship.

    2005–present Countrywide Home Loans, CA.

    xi

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    13/96

    PUBLICATIONS

    Yan He: Problems and Suggestions for Improving to the Exchange Sterilization Op-

    eration of China’s Central Bank.  The Study of Finance and Economics, April 2000,

    Vol.26 No.4.

    ShaoGong Lin, Qiming Tang, ZhiHong Fan and Yan He: Translated the book  Econo-

    metric Methods  by Jack Johnston & John DiNardo (UCI) (4th edition) into Chinese.

    Published by China Economics Publishing House (ISBN 7-5017-5063-7), 2002.

    Juana Sanchez and Yan He: Examples of the Application of Statistics and Probability

    to Computer Science. Presented at the Joint AMS-MAA  (American Mathematical So-

    ciety - Mathematical Association of American) Annual Meeting. January 7-10, 2004,

    Phoenix, AZ.

    Richard Berk, Yan He and Susan Sorenson: Developing a Practical Forecasting Screener

    for Domestic Violence Incidents.  Evaluation Review, 29(4): 358-382, August 2005.

    Juana Sanchez and Yan He: Internet Data Analysis for the Undergraduate Statistics

    Curriculum.  Journal of Statistics Education, Volume 13(3), 2005.

    xii

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    14/96

    ABSTRACT OF THE  DISSERTATION

    Missing Data Imputation for Tree-Based Models

    by

    Yan He

    Doctor of Philosophy in Statistics

    University of California, Los Angeles, 2006

    Professor Richard Berk, Chair

    A wide variety of data can include some form of censorship or missing informa-

    tion. Missing data are a problem for all statistical analyses, tree-based models, such as

    CART and Random Forests are certainly no exception.

    In recent years, there have been many new developed tools that can be applied

    to missing data problems: likelihood and estimating function methodology, cross-

    validation, the bootstrap and other simulation techniques, Bayesian and multiple im-

    putations, and the EM algorithm. Although applied successfully to well-defined para-

    metric models, such methods may be inappropriate for tree-based models, which are

    usually considered as non-parametric models. CART/RF have built-in algorithms to

    impute missing data, such as surrogate variables or proximity. But these imputation

    methods have no formal rationale, and are unstable, especially for RF models.

    The nonparametric bootstrap methods to impute missing values overcome all of 

    the drawbacks that are implicit in both single and multiple imputations. It 1) does not

    depend on the missing-data mechanism, 2) requires no knowledge of either the prob-ability distributions or model structure, and 3) successfully incorporates the estimates

    of uncertainty associated with the imputed data. Furthermore, 2000 replications of 

    bootstrap samples provide stable and accurate statistical inferences (Efron, 1994).

    xiii

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    15/96

    In my dissertation research, the nonparametric bootstrap methods were imple-

    mented to impute missing values before cases were dropped down the tree (CART/RF),

    and the classification results were compared to both complete-data/full-data analysis

    and to the classification results using surrogate variables/proximity. Significant im-

    provement in the ability to predict were found for both CART and RF models.

    xiv

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    16/96

    CHAPTER 1

    Introduction

    A wide variety of data can include some form of censorship or missing informa-

    tion. Data imputation can then be an important component of the analysis, but crude

    methods for data imputation can lead to substantial bias in the results. For example, a

    “complete-case analysis” simply ignores the missing data and risks substantial bias.

    In recent years, there have been many new computationally intensive tools devel-

    oped that can be applied to missing data problems: likelihood and estimating function

    methodology, cross-validation, the bootstrap and other simulation techniques, Bayes’

    and multiple imputations, and the EM algorithm. Existing methods have been suc-

    cessfully applied with well-defined parametric models, such as Gaussian regression,

    and loglinear models. But their usefulness has yet to be demonstrated for tree-based

    models, such as Classification and Regression Trees (CART) and random forests (RF),

    which are usually considered as non-parametric methods. It is this oversight that I will

    attempt to remedy, in part, in the pages ahead.

    More specifically, parametric models, such as linear regression, can provide useful

    descriptions of simple structures in data. However, sometimes such simple structure

    does not extend across an entire data set and may instead be confined more locally

    within subsets of the data. Then, the structure might be better described by a modelthat partitions the data into subsets, employing separate submodels for each. Such al-

    ternative can be accomplished by using a tree-based approach, known as CART (Clas-

    sification and Regression Trees).

    1

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    17/96

    Given a data set, a common strategy for finding a good tree is to use a greedy

    algorithm to grow a tree and then to prune it back to avoid overfitting. Such greedy

    algorithm typically grow a tree by sequentially choosing splitting rules for nodes on

    the basis of maximizing some fitting criterion. This generates a sequence of trees,

    each of which is an extension of previous trees. A single tree is then selected by prun-

    ing the largest tree according to a model selection criterion such as cost-complexity

    pruning (Brieman et al., 1984), cross-validation, or even multiple tests of whether two

    adjoining nodes should be collapsed into a single node.

    The overfitting problem in CART motivated people to develop bundling methods

    such as bagging and random forests. Bagging predictors is a method for generating

    multiple versions of a predictor and using these to get an aggregated result. In the case

    of CART, the aggregation averages over the trees when predicting a numerical outcome

    and does a plurality vote when predicting a class. The multiple versions are formed

    by making bootstrap replicates of the learning set and using these as new learning data

    sets. Tests on real and simulated data sets using classification and regression trees and

    subset selection in linear regression have shown that bagging can allow for substantial

    gains in accuracy (Breiman, 1996). The vital element is the instability of the prediction

    method. If perturbing the learning set can cause significant changes in the predictor

    constructed, then bagging can improve accuracy.

    Random forests (RF) is a further extension of bagging. A Random forest model is

    a combination of tree predictors such that each tree depends on the values of a random

    vector sampled independently and with the same distribution for all trees in the forest.

    The generalization error for RF converges almost surely to a limit as the number of 

    trees in the forest becomes large (Breiman, 2001). Using a random selection of features

    to split each node yields error rates that compare favorably to Adaboost (Freund and

    Schapire, 1996), but more robust with respect to noise.

    2

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    18/96

    Missing data can be a problem for all statistical problems. CART/bagging/RF are

    certainly no exception. Missing data can create the same kinds of difficulties they

    create for conventional linear regression. There is the loss of statistical power with the

    reduction in sample size and real possibility of bias if the observations are not lost at

    random.

    A general discussion of missing data and excellent treatment are easily found (Lit-

    tle and Rubin, 2002). If the data are really “missing completely at random” (MCAR),

    the only loss is statistical power. And if the number of cases lost is not large, the reduc-

    tion in power is likely to be insignificant. It is, therefore, mandatory that the researcher

    make a convincing argument that the data are missing completely at random. The re-

    sults are then dependent upon the missing completely at random assumption, and may

    be of little statistical interest unless the credibility of that assumption is determined.

    A less strict assumption is that the data are “missing at random” (MAR). One

    can subset the data based on the values of observed variables so that for each such

    subset, the data are missing completely random. If this assumption is correct, the

    analysis can be conducted separately for each of the subsets and then reassembled.

    But again, the assumption of the mechanism in which the data are missing must beargued convincingly.

    If either of these assumptions can be justified, it will be useful to impute the values

    of the missing data. Imputing missing values for the response variable is usually not

    sensible because the relationship between the response and the predictors can be sys-

    tematically altered. But sometimes it can be very helpful to impute missing data for

    predictors.

    The key problem with any imputation procedure is that when the data are ultimately

    analyzed, including the real data and the imputed data, the statistical procedures ap-

    plied cannot tell which is which and necessarily treat all of the observations alike. The

    3

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    19/96

    imputed values are estimates, and estimates usually come with random error. In addi-

    tion, “the imputed values, which are just fitted values, will have less variability than the

    original value itself” (Berk, 2005). In short, the imputed values will typically be less

    variable than the real thing. The reduced variability can seriously undermine statistical

    inference.

    It is well known that CART/RF have built-in algorithms to impute missing data,

    such as using surrogate variables or proximities. But these imputation methods have

    no formal rationale. Furthermore, since CART/RF are more nonparametric models

    than parametric, advanced multiple imputation (MI) methods may not apply at all. In

    short, tools for imputing missing data are likely to be inadequate.

    This thesis will address nonparametric approaches to assessing the accuracy of 

    an estimator in a missing data situation. Three main topics are discussed: bootstrap

    methods for missing data, its relationship to the theory of multiple imputations, and

    comparison to the surrogate variables/proximity method. Two main advantages (Efron,

    1994) of nonparametric bootstrap imputation are: 1) it requires no knowledge of the

    missing-data mechanism other than that it is missing at random or conditionally at

    random; 2) the confidence interval turns out to give convenient and accurate answers.

    The thesis is structured as follows: Chapter 1 introduces basic concepts about tree-

    based models and missing data problem, and motivates this thesis. Chapter 2 intro-

    duces Classification and Regression Trees (CART), as well as random forests (RF).

    Standard theories of missing data and imputation methods are elaborated in Chapter 3,

    which also illustrates the limitation of applying multiple imputation (MI) to tree-based

    models. Chapter 4 explains how CART and RF deal with missing data, and their po-

    tential limitations. Chapter 5 formally introduces nonparametric bootstrap methods to

    impute missing data, and proposes corresponding algorithms in detail. Chapter 6 is

    empirical study, which applies several imputation methods to various data sets. Here,

    4

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    20/96

    the classification errors from 2000 bootstrapped imputations are compared to   surro-

    gate method for CART models, and the 2000 bootstrapped imputations are compared

    to   proximity method for RF models. Significant improvement can be found by using

    nonparametric bootstrap methods. Chapter 7 discusses the effectiveness of the non-

    parametric bootstrap methods in their ability to classify, as well as possible limitations.

    Further improvement to the algorithm is also suggested.

    5

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    21/96

    CHAPTER 2

    Classification and Regression Trees (CART) and

    Extensions

    2.1 Classification and Regression Trees (CART)

    We begin with a discussion of the general structure of a CART model. A CART

    model describes the conditional distribution of  y given X , where y is the response vari-

    able and X  is a set of predictors (X   = (X 1, X 2, · · · , X  p)). This model has two maincomponents: a tree T   with b  terminal nodes, and a parameter  Θ = (θ1, θ2, · · · , θb) ⊂Rk which associates the parameter values θm with the m

    th terminal node. Thus a treed

    model is fully specified by the pair (T, Θ). If  X  lies in the region corresponding to the

    m

    th

    terminal node then y|X  has the distribution f (y|θm), where we use f  to representa conditional distribution indexed by  θm. The model is called a regression tree or a

    classification tree according to whether the response  y   is quantitative or qualitative,

    respectively.

    2.1.1 Splitting A Tree

    The binary tree  T   subdivides the predictor space as follows. Each internal node

    has an associated splitting rule which uses a predictor to assign observations to either

    its left or right child node. The internal nodes are thus partitioned into two subsequent

    nodes using the splitting rule. For quantitative predictors, the splitting rule is bases on

    6

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    22/96

    a split rule s, and assigns observations for which {xi ≤  s} or {xi  > s} to the left orright child node respectively. For qualitative predictors, the splitting rule is based on

    a category subset C , and assign observations for which {xi ∈  C } or {xi   /∈ C } to theleft or right child node respectively.

    For a regression tree, conventional algorithm models the response in each region

    Rm as a constant cm. Thus the overall tree model can be expressed as (Hastie, Tibshi-

    rani and Friedman, 2001):

    f (x) =b

    m=1

    cmI (X  ∈ Rm).   (2.1)

    where Rm, m = 1, 2, · · · , b consist of a partition of the predictors space, and thereforerepresenting the space of  b  terminal nodes. If we adopt the method of minimizing the

    sum of squares

    (yi− f (X i))2 as our criterion to characterize the best split, it is easyto see that the best ĉm is just the average of  yi in region Rm:

    ĉm =  ave(yi | X i ∈ Rm) =   1N m

    X i∈Rm

    yi   (2.2)

    Where  N m   is the number of observations falling in node  m. The residual sum of 

    squares is then

    Qm(T ) =  1

    N m

    X i∈Rm

    (yi − ĉm)2 (2.3)

    which will serve as an impurity measure for regression trees.

    If the response is a factor taking outcomes 1, 2, . . . , K  , the impurity measure Qm(T ),

    defined in (2.3) is not suitable. Instead, we represent a region Rm   with N m  observa-

    tions with

    ˆ pmk  =

      1

    N m X i∈Rm I (yi  =  k)   (2.4)which is the proportion of class  k(k ∈  c(1, 2, · · · , K ))  observations in node  m. Weclassify the observations in node m to a class k(m) = arg maxk ˆ pmk, the majority class

    7

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    23/96

    in node  m. Different measures  Qm(T )  of node impurity include following (Hastie,

    Tibshirani and Friedman, 2001):

    Misclassification error   :  1

    N m i∈RmI (y

    i  /∈

    k(m)) = 1−

    ˆ pmk(m)

    Gini index   :k=k

    ˆ pmk ˆ pmk  =K k=1

    ˆ pmk(1− ˆ pmk)

    Cross− entropy or deviance   :K k=1

    ˆ pmk log ˆ pmk

    (2.5)

    For binary outcomes, if  p  is the proportion of the second class, these three measures

    are 1 − max( p, 1 − p), 2 p(1− p) and − p log p− (1− p) log(1− p), respectively.

    All three definitions of impurity are concave, having minimums at p = 0 and p = 1

    and a maximum at  p   = 0.5. Entropy and the Gini index are the most common, and

    generally “give very similar results except when there are two response categories”

    (Berk, 2005).

    2.1.2 Pruning A Tree

    To be consistent with conventional notations, let’s define the impurity of a node  τ 

    as I (τ ) ((2.3) for a regression tree, and any one in (2.5) for a classification tree). We

    then choose the split with maximal impurity reduction

    ∆I  = I (τ )− p(τ L)I (τ L)− p(τ R)I (τ R)   (2.6)

    where τ L and  τ R  are the left and right children nodes of  τ .

    How large should we grow the tree then? Clearly a very large tree might overfit the

    data, while a small tree may not be able to capture the important structure. Tree size is

    a tuning parameter governing the model’s complexity, and the optimal tree size should

    8

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    24/96

    be adaptively chosen from the data. One approach would be to continue the splitting

    procedures until the decrease on impurity due to the split exceeds some threshold. This

    strategy is too short-sighted, however, since a seeming worthless split might lead to a

    very good split below it.

    The preferred strategy is to grow a large tree  T 0, stopping the splitting process

    when some minimum number of obervations in a terminal node (say 10) is reached.

    Then this large tree is pruned using cost-complexity pruning.

    We define a subtree T  ⊂ T 0 to be any tree that can be obtained by pruning  T 0, anddefine T  to be the set of terminal nodes of  T . That is, collapsing any number of itsterminal nodes. As before, we index terminal nodes by  m, with node m  representing

    region Rm. Let |T | denote the number of terminal nodes in T   (|T |  =  b). We use |T |instead of  b in this section following the “conventional” notation and define the risk of 

    trees

    Regression tree   :   R(T ) =

    |eT |m=1

    N mQm(T )

    Classification tree   :   R(T ) =τ ∈eT 

    P (τ )r(τ )   (2.7)

    where r(τ ) measures the impurity of node τ  in a classification tree (can be any one in

    (2.5)).

    We define the cost complexity criterion (Breiman et al., 1984)

    Rα(T ) = R(T ) + α|T |   (2.8)where α(>  0)   is the complexty parameter. The idea is , for each α, find the subtree

    T α ⊆   T 0   to minimize  Rα(T ). The tuning parameter  α ≥   0  “governs the tradeoff between tree size and its goodness of fit to the data” (Hastie, Tibshirani and Friedman,

    2001). Large values of  α  result in smaller tree T α, and conversely for smaller values

    of  α. As the notation suggests, with α = 0 the solution is the full tree T 0.

    9

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    25/96

    To find T α we use weakest link pruning: we successively collapse the internal node

    that produces the smallest per-node increase in R(T ), and continue until we produce

    the single-node (root) tree. This gives a (finite) sequence of subtrees, and one can

    show this sequence must contains T α. See Brieman et al. (1984) and Ripley (1996) for

    details. Estimation of  α (α̂) is achieved by five- or ten-fold cross-validation. Our final

    tree is then denoted as T ̂α.

    It follows that, in CART and related algorithms, classification and regression trees

    are produced from data in two stages. In the first stage, a large initial tree is produced

    by splitting one node at a time in an iterative, greedy fashion. In the second stage,

    a small subtree of the initial tree is selected, using the same data set. Whereas the

    splitting procedure proceeds in a top-down fashion, the second stage, known as prun-

    ing, proceeds from the bottom-up by successively removing nodes from the initial tree.

    Theorem2.1 (Brieman et al., 1984, Section 3.3) For any value of the complexity pa-

    rameter  α , there is a unique smallest subtree of  T 0  that minimizes the cost-complexity.

    Theorem2.2   (Zhang and Singer, 1999, Section 4.2)   If  α2   > α1 , the optimal sub-

    tree corresponding to α2 is a subtree of the optimal subtree corresponding to  α1

    More general, suppose we end up with m  thresholds,

    0 < α1 < α2  < · · · < αm

    and let α0  = 0. Also, let corresponding optimal subtrees be

    {T α0, T α1, T α2, · · · , T αm}, thenT α0 ⊃ T α1 ⊃ T α2 ⊃ · · · ⊃ T αm   (2.9)

    where T α0 ⊃ T α1  means that T α1  is a subtree of  T α0 . There are so-called nested optimal

    10

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    26/96

    subtrees.

    2.1.3 Taking Cost into Account

    We talk about classification trees in this section. In many applications, tree-based

    method is used for the purpose of prediction. That is, given the characteristics of a

    subject, we must predict the outcome of this subject before we know the outcome. For

    example, physicians in emergency rooms must predict whether a patient with chest

    pain suffers from a serious disease based on the information available within a few

    hours of admission. For this purpose, we must first classify a node  τ  to either class 0

    (normal) or 1 (abnormal), then we predict the outcome of an individual based on themembership of the node to which the individual belongs. Unfortunately, we always

    make mistakes in such a classification procedure, because some of the normal subjects

    will be predicted as diseased and vice versa. These two mistakes are called false-

    positive (predicting a normal condition as abnormal) and false-negative (predicting an

    ill-conditioned outcome as normal), respectively. In any case, to weigh these mistakes,

    we need to assign misclassification costs.

    Let c(i, j) denote the misclassification cost that a class  j  subject is classified as a

    class  i   subject. When  i   =   j, we have the correct classification and the cost should

    naturally be zero, i.e., c(i, i) = 0. If the outcome is binary, i and  j  take the values 0 or

    1. Without loss of generality, we can set c(1, 0) = 1. In other words, one false positive

    error counts as one. The clinicians and the statisticians need to work together to gauge

    the relative cost of  c(0, 1). This is a subjective and difficult, but important, decision.

    In the example of Domestic Violence (DV) analysis, 671 households reported DV

    incidents during the study period, among which, about 21% of the households reported

    to have a new call within 3-month follow-up period. In this instance, the two errors are:

    1) false negative: failing to predict a new DV incident for households that really hap-

    11

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    27/96

    pened and 2) false positive: predicting a new DV for households that really didn’t hap-

    pen. Thus, a predictor that produced few false positives but many false negatives might

    be discarded if the undesirable consequences from the false negatives were larger than

    the undesirable consequences from the false positives. Therefore, we needed informa-

    tion from the Los Angeles Sheriff’s Department on the relative consequences of false

    positives and false negatives.

    Information from the Los Angeles Sheriff’s Department led to a general conclusion

    that false negatives were substantially more problematic than false positives. In other

    words, they considered not responding to a call when there actually was a need for

    law enforcement assistance more “costly” than responding to a call that turned out to

    be a false alarm. But the precise figures for these “costs” could not be determined.

    Fortunately, all we needed for statistical analysis was the ratio of false negative costs

    to false positive costs. We then proceeded with a reasonable ratio of the costs of false

    negatives to the costs of false positives of 5 to 1. Consistent with the information

    provided by the Sheriff’s Department, the failure to forecast a new call for service was

    5 times more costly than incorrectly forecasting a new call for service.

    We can now better understand the role of costs using the obtained 21% return callfigure in DV data. If for every household (671 households), we predicted another call

    within three months, we would be correct about 21% of the time. And, we would

    also be wrong about 79% of the time. Conversely, if for every household, we always

    predicted no calls within three months, we would be correct about 79% of the time.

    And we would also be wrong about 21% of the time. Which is a better strategy: always

    predicting a future call or not? The answer depends on the costs of false negatives

    compared to the costs of false positives.

    If both were equally costly, the best strategy would clearly be to never predict

    a subsequent call. But since the failure to predict future calls was very costly (false

    12

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    28/96

    negatives were 5 times more costly than false positives), the best strategy would clearly

    be to predict a subsequent call. In short, the relative costs of false negatives compared

    to the relative costs of false positives can affect how forecasting is done (Berk, 2005).

    And, it also affects which predictors are likely to be important. Hence, in subsequent

    analysis, we take costs into account.

    2.2 Random Forest (RF)

    Significant improvement in classification accuracy can be obtained by growing an

    ensemble of trees and letting them vote for the most popular class (namely, majority

    vote). An early example is bagging (Breiman, 1996), where to grow each tree a random

    sample is selected from training set. Bagging stands for “Bootstrap Aggregation” and

    may be best understood as nothing more than an algorithm.

    The bagging algorithm for a data set having  n  observations and a binary response

    variable can be summarized as following steps:

    1. Take a random sample of size n  with replacement from the data.

    2. Construct a classification tree as usual but do not prune.

    3. Assign a class to each terminal node as in CART. Drop the out-of-bag data down

    the tree, and store the class attached to each case.

    4. Repeat steps 1-3 a large number of times (say, 1000).

    5. For each observation in the data, count the number of times over trees that it is

    classified in one category and the number of times over trees it is classified in

    the other category.

    6. Assign each observation to a final category by a majority vote over the set of 

    13

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    29/96

    trees. Thus, if  51%  of the time over a large number of trees a given observation

    is classified as a “1”, that becomes its final classification.

    7. Construct the confusion table from these class assignments.

    2.2.1 The Algorithm

    Random Forest extends the ideas of of bagging to the extent that allows random

    selections of both observations and predictors at splitting step. Here, a large number

    (say, 1000) of classification trees are constructed, each based on a bootstrap sample of 

    the data. In addition, at each split a random subset of predictors is selected. For each

    tree constructed, data not used to grow the tree are dropped down to evaluate how well

    the tree performs. Finally, overall results are produced by majority vote over the trees.

    For example, if there are fifty predictors, choose a random seven candidates (It is

    recommended to use the square root of number of predictors) for defining the split.

    Then choose the best split, as usual, by selecting only from the seven randomly chosen

    predictors. Repeat this process for each node. Therefore, the random forests algorithm

    is very much like the bagging algorithm. Again let n  be the number of observations

    and assume for now that the response variable is binary.

    1. Take a random sample of size n  with replacement from the data.

    2. Take a random sample of the predictors without replacement.

    3. Construct the first CART partition of the data using selected predictors.

    4. Repeat step 2 for each subsequent splits until the tree is as large as desired and

    do not prune.

    5. Drop the out-of-bag data down the tree, and store the class assigned to each

    observation.

    14

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    30/96

    6. Repeat steps 1-5 a large number of times (e.g., 1000).

    7. Using the observations not used to build the tree for evaluation, count the number

    of times over trees that a given observation is classified in one category and the

    number of times over trees it is classified in the other category.

    8. Assign each case to a category by a majority vote over the set of trees. Thus, if 

    51% of the time over a large number of trees a given case is classified as a “1”,

    that becomes its estimated classification.

    9. Construct the confusion table for these assigned classes.

    2.2.2 The Comparative Advantage of Random Forests

    It is apparent that random forests are more than bagging. By working with a random

    sample of predictors at each possible split, “the fitted values across trees are more

    independent” (Berk, 2005). As a result, the gains from averaging over a large number

    of trees can be larger. A related benefit is that it is possible to work with a very

    large number of predictors, and even more predictors than observations. It is well

    known that in the conventional regression modeling, all of the data mining procedures

    considered so far have required that the number of predictors be less than the number

    of observations (usually much less). An obvious gain is that more information can be

    utilized in the fitting process , and more predictors can contribute.

    The use of multiple trees (often as many as 1000) makes the random forests fitting

    function much more complicated than the CART fitting function. However, the data

    not included in each bootstrap sample are used to evaluate the model performance,

    and “the averaging over trees directly compensates for the overfitting problem that is

    vulnerable to CART” (Berk, 2005). Therefore, the random forest results can be treated

    as true forecasts.

    15

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    31/96

    Some of other features of RF are:

    (i) It is an excellent classifier–comparable in accuracy to many other classifiers.

    (ii) It generates an internal unbiased estimate of the generalized error as the forest

    building progresses.

    (iii) It has an effective method for estimating missing data.

    (iv) It has a method for balancing error in unbalanced class population data sets.

    (v) Generated forests can be saved for future use on other data.

    (vi) It gives estimates of what variables are important in the classification.

    (vii) Output is generated that gives information about the relation between the vari-

    ables and the classification.

    (viii) It computes proximities between pairs of cases that can be used in clustering,

    locating outliers, or by scaling, give interesting views of the data.

    (ix) The capabilities of (vii) above can be extended to unlabeled data, leading to

    unsupervised clustering, data views and outlier detection. The missing value

    replacement algorithm can also be extended to unlabeled data.

    16

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    32/96

    CHAPTER 3

    Standard Theory on Missing Data

    3.1 Mechanisms That Lead to Missing Data

     Missing Data are a problem for all statistical analyses. Missing data mechanisms

    are crucial since the properties of missing-data methods depend very strongly on the

    nature of these mechanisms. The crucial role of the mechanism in the analysis of data

    with missing values was largely ignored until the concept was formalized in the theory

    of Rubin (1976), through the simple device of treating the missing-data indicators as

    random variables and assigning them a distribution.

    Define the full data  Y   = (yij) and the missing-data indicator matrix M   = (M ij),

    with M ij  indicating whether the corresponding  Y ij   is missing or not. The missing-

    data mechanism is characterized by the conditional distribution of  M   given  Y  , say

    f (M |Y, φ), where φ denotes unknown parameters. If missingness does not depend onthe values of the data Y  , missing or observed, that is, if 

    f (M |Y, φ) = f (M |φ)   for all Y,φ   (3.1)

    the data are called “missing completely at random” (MCAR).

    Let Y obs and  Y mis  denote the observed and missing components of  Y   respectivelly.An assumption less restrictive than MCAR is that missingness depends only on the

    observed components of  Y   (Y obs), and not on the components that are missing (Y mis).

    17

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    33/96

    That is,

    f (M |Y, φ) = f (M |Y obs, φ)   f or all Y mis, φ   (3.2)

    This missing-data mechanism is then called “missing at random” (MAR). The third

    mechanism is called “not missing at random” (NMAR) if the distribution of  M   de-

    pends on the missing values in the data matrix Y  .

    f (M |Y, φ) = f (M |Y obs, Y mis, φ)   f or all φ   (3.3)

    Some literature also calls it nonignorable missing data.

    The simplest data structure is a univariate random sample for which some units are

    missing. Let Y   = (y1, . . . , yn)T  where yi  denotes the value of a random variable for

    unit i, and let M   = (M 1, . . . , M  n)  where M i  = 0 if unit i  is observed and M i  = 1 if 

    unit i  missing. Suppose the joint distribution of  (yi, M i)  is independent across units,

    so in particular the probability that a unit is observed does not depend on the values of 

    Y   or M   for other units. Then (Little and Rubin, 2002),

    f (Y, M |θ, φ) = f (Y |θ)f (M |Y, φ) =ni=1

    f (yi|θ)ni=1

    f (M i|yi, φ)   (3.4)

    where f (yi

    |θ) denotes the density of yi indexed by unknown parameters θ, and f (M i

    |yi, φ)

    is usually the density of a Bernoulli distribution for the binary indicator  M i with prob-

    ability Pr(M i  = 1|yi, φ) that yi is missing.

    If missingness is independent of  Y  , that is if Pr(M i   = 1|yi, φ) =   φ, a constantthat does not depend on  yi, then the missing-data mechanism is MCAR (or in this

    case equivalently MAR). If the mechanism depends on  yi   the mechanism is NMAR

    since it depends on  yi   that are missing, assuming that there are some. NMAR is the

    most general situation, and valid statistical inferences generally require specifying thecorrect model for the missing-data mechanism, distribution assumption for the missing

    yi, or both. The resulting estimators and tests are typically very sensitive to these

    assumptions.

    18

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    34/96

    Let  r  denote the number of responding units (I.e.,  M i   = 0). An obvious conse-

    quence of the missing values in this example is that sample size is reduced from  n  to

    r   (Little and Rubin, 2002). We might want to carry out the sample analyses on the

    reduced sample as we intended for the size-n sample. For example, if we assume the

    values are normally distributed and wish to make inferences about the mean, we might

    estimate the mean by the sample mean of the  r  responding units with standard error

    s/√ 

    r, where  s   is the sample standard deviation of the responding units. This strat-

    egy is valid if the mechanism is MCAR or MAR, since then the observed cases are

    “a random subsample of all the cases” (Little and Rubin, 2002). However, if the data

    are NMAR, the analysis based on the responding subsample is generally biased for the

    parameter of the distribution of  Y  .

    3.2 Treatment of Missing Data

    There are three approaches in dealing with missing data:

    1. Impute the missing data: that is, filling in the missing values.

    2. Model the probability of missingness: this is a good option if imputation is in-

    feasible; in certain cases it can account for much of the bias that would otherwise

    occur.

    3. Ignore the missing data: a poor choice, but by far the most common one.

    This section gives a brief description of alternative approaches to handling the

    problem of missing data.

    19

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    35/96

    3.2.1 Listwise Deletion

    By far the most common approach is to simply omit those cases with missing data

    and to run analyses on what remains.

    If data are missing for the response variable, the only reasonable strategy is “list-

    wise deletion”. That is, observations with missing response are dropped totally from

    the analysis. If the data are missing completely at random, the only loss is statistical

    power. If not, however, bias of unknown size and direction can be introduced.

    When the data are missing for one or more predictors, we have more options. List-

    wise deletion remains a possible choice, especially if there is not a lot of missing data

    (e.g., less than 5% of the total number of observations). Listwise deletion is also easy

    to implement and understand. However, this method ignores the possible systematic

    difference between the complete cases and incomplete cases, and the resulting infer-

    ence may not be applicable to the population of all cases, especially with a smaller

    number of complete cases.

    3.2.2 Single Imputation

    Single imputation refers to filling in a missing value with a single replacement

    value. Imputations are means or draws from a predictive distribution of the missing

    values, and require a method of creating a predictive distribution for the imputation

    based on the observed data. There are two generic approaches to generating this dis-

    tribution (For details see Little and Rubin, 2002, Pages 59-60):

     Explict Modeling: the predictive distribution is based on a formal statistical model

    (e.g. multivariate normal), and hence the assumptions are explicit.

     Implicit Modeling: the focus is on an algorithm, which implies an underlying

    model; assumptions are implicit, but they still need to be carefully assessed to ensure

    20

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    36/96

    that they are reasonable.

    Explicit modeling methods include:

    (a)  Mean imputation, where means from the responding units in the sample are used

    to fill in missing values. Sometimes, the means may be formed by weighting

    within cells or classes.

    (b)   Regression imputation   replaces missing values by predicted values from a re-

    gression of the missing item on items observed for the unit. Mean imputation

    can actually be regarded as a special case of regression imputation. The proper

    regression model depends on the type of the to-be-imputed variable. A probit

    or logit is used for binary variables, Poisson or other count models for integer-

    valued variables, and OLS or related models for continuous variables. For exam-

    ple, suppose for subject properties, there are some missing data for gross living

    areas (GLA). But gross living areas are strongly related to number of bedrooms,

    number of bathrooms, number of total rooms, and lot size. For the observations

    with no missing data, GLA is regressed on number of bedrooms, number of 

    bathrooms, number of total rooms, and lot size. Then, for the observations that

    have missing GLA data, the values for the four predictors are inserted into the

    estimated regression equation. Predicted values are computed, which are used

    to fill in the holes in the GLA data.

    (c)  Stochastic regression imputation goes one step further, replacing missing values

    by a value predicted by regression imputation plus a residual, which is drawn

    to reflect the uncertainty in the predicted value. For example, the residual for

    Gaussian regression is naturally normal with mean zero and variance equal to

    the residual variance in the regression. With a binary outcome, as in logistic

    regression, the predicted value is a probability of 1 versus 0, and the imputed

    21

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    37/96

    value is then a 1 or 0 drawn with that probability.

    Implicit modeling methods include:

    (d)  Hot deck imputation, involves substituting individual values imputed from “sim-

    ilar” responding units. Hot deck imputation is common in survey practice and

    can “involve very elaborate schemes for selecting units that are similar for im-

    putation” (Little and Rubin, 2002). To perform hot deck imputation, all obser-

    vations are divided into groups with similar characteristics, for example, “prop-

    erties priced 400K-800K”. To impute a missing value, the researcher randomly

    draw a value for that variable from the pool of properties having similar char-

    acteristics. Creating a large number of subgroups yields some improvement in

    accuracy, but it can also lead to very small sample sizes within some subgroups.

    The primary difficulty of this method is the selection of proper subgroups.

    (e)   Substitution, replaces nonresponding units with alternative units not selected into

    the sample. For example, in order to estimate a property value using sales com-

    parison method, we need to find similar sales within 0.5 mile of the subject prop-

    erty. If a similar sale cannot be found, then a similar sale beyond 0.5 mile may

    be substituted. The tendency to treat the resulting sample as complete should

    be taken with caution, since the substituted property may differ systematically

    from properties within 0.5 mile. Hence at the analysis stage, substituted proper-

    ties should be regarded as imputed values of a particular type.

    (f)  Cold deck imputation   replaces a missing value of an item by a constant value

    from an external source, such as a value from a previous realization. In theproperty valuation example, we sometimes use historical sales price adjusted to

    effective date (usually evaluation date).

    22

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    38/96

    (g)   Composite methods  combines ideas from different methods. For example, hot

    deck and regression imputation can be combined by calculating predicted means

    from a regression but then adding a residual randomly chosen from the empirical

    residuals to the predicted value when forming values for imputation. See, for

    example, Schieber (1978), David et. al (1986).

    An important limitation of the single imputation methods described so far is that

    standard variance formulas applied to the filled-in data systematically underestimate

    the variance of estimates, even if the model used to generate imputations is correct.

    Even if reasonably unbiased estimates can be constructed, single imputation methods

    ignore the reduced variability of the predicted values and treats the imputed valuesas fixed. One response is to impute several times for each observation drawn at ran-

    dom, say, from the conditional distributions implied by the regression equation. It is

    then possible to get a better handle on the uncertainty associated with the imputed val-

    ues. Multiple imputation is one such example, but its obvious disadvantage prevents it

    from being using in the nonparametric situations. Another example is nonparametric

    bootstrap imputation, which will be treated shortly.

    3.2.3 Multiple Imputations through Data Augmentation

    MI refers to the procedure of imputing missing value  D(D ≥  2) times. When theD  sets of imputations are repeated random draws from the predictive distribution of 

    the missing values under a particular model, the  D  complete-data inferences can be

    combined to form one inference that “properly reflects uncertainty due to nonresponse

    under the model” (Little and Rubin, 2002).

    As already indicated in Section 3.2.2, the obvious disadvantage of single imputa-

    tion is that imputing a single value treats that value as known, and thus without special

    adjustments, single imputation cannot reflect the sampling variability under the impu-

    23

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    39/96

    tation model for nonresponse. MI Shares advantages of single imputation and recti-

    fies its disadvantages. Specifically, when the D   imputations are repetitions under one

    model for missingness, the resulting D complete-data analyses can be easily combined

    to create an inference that validly reflects sampling variability because of the missing

    values.

    We now turn to the problem of creating the multiple imputations. Standard theory

    suggests that we draw the missing values as

    Y  (d)mis ∼  p(Y mis|Y obs), d = 1, · · · , D.   (3.5)

    that is, from their joint posterior predictive distribution. Unfortunately, it is often dif-

    ficult to draw from this predictive distribution in complicated problem, because of the

    implicit requirement in Equation(3.5) to integrate over the unknown parameter θ. Data

    augmentation accomplishes this by iteratively drawing a sequence of values of the

    parameters and missing data until convergence.

    Data augmentation (Tanner and Wong, 1987) is an iterative two-step method of 

    imputing missing values by simulating the posterior distribution of  θ   that combines

    features of the EM algorithm and multiple imputations. These two steps are: the

    imputation (or I ) step and the posterior (or P ) step. Start with an initial draw θ(0) from

    an approximation to the posterior distribution of  θ. Given a value  θ (t) of  θ, draw at

    iteration t:

    •  I Step: Draw Y  (t+1) with density p(Y mis|Y obs, θ(t));

    •  P Step: Draw θ(t+1) with density p(θ|Y obs, Y  (t+1)mis   ).

    This procedure is motivated by the fact that the distribution in these two steps are often

    much easier to draw from than either of the posterior distribution  p(Y mis|Y obs)   and p(θ|Y obs), or the joint posterior distribution  p(θ, Y mis|Y obs). The iterative procedure

    24

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    40/96

    can be shown to eventually yield a draw from the joint posterior distribution of  Y mis,

    θ  given Y obs, in the sense that as t  tends to infinity, this sequence converges to a draw

    from the joint distribution of  (θ, Y mis) given Y obs.

    3.2.4 Assessment of Multiple Imputations

    Although multiple imputation has desirable features, for instance, it allows one to

    get good estimates of the standard error, certain requirements must be met for MI to

    have these desirable properties. First, the data must be missing at random (MAR),

    meaning that the probability of missingness on the data  Y   depend on what are ob-

    served, and not on the components that are missing (See equation 3.2). Second, themodel used to generate the imputed values must be “correct” in some sense. Third,

    the model used for the analysis must match up, in some sense with the model used in

    the imputation. All these conditions have been rigorously described by Rubin (1987,

    1996).

    The problem is that it’s easy to violate these conditions in practice. First, when

    the data are missing for reasons beyond the control of the investigators, one can never

    be sure whether MAR holds. In fact, to speak of a single “missingness mechanism”

    is often misleading, because in most of studies missing values occur for a variety of 

    reasons; some of these may be entirely unrelated to the data in question, but others

    may be closely related.

    Unfortunately, it is not possible to relax the MAR assumption in any meaningful

    way without replacing it with some other equally untestable assumptions. At present,

    there are no principal nonignorable missing-data methods readily available to most

    data analysts. Thus, the MI methods based on the MAR assumption should be used

    with an awareness of its limitations.

    Furthermore, in order to generate imputations for the missing values, a probability

    25

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    41/96

    model on the full data (observed and missing values) must be imposed. Each of the

    software packages applies to a different class of multivariate models (Available in R).

     NORM   uses the multivariate normal distribution.   CAT  is based on loglinear models,

    which have been traditionally used by social scientists to describe associations among

    variables in cross-classified data. The MIX  library relies on the general location model,

    which combines a loglinear model for the categorical variables with a multivariate nor-

    mal regression for the continuous ones. Details of these models are given by Schafer

    (1997).

    In reality, real data rarely conform to the convenient models such as multivariate

    normal. In most applications of MI, the model used to generate the imputations will

    at least be approximately true. And an imputation model should be chosen to be (at

    least approximately) compatible with the real analyses to be performed on the imputed

    datasets. In particular, the imputation model should be “rich enough to preserve the

    associations or relationships among variables that will be the focus of later investiga-

    tion” (Schafer and Olsen, 1998). The precision you lose when you include unimportant

    predictors is usually a relatively small price to pay for the general validity of analyses

    of the resultant multiply imputed data set (Rubin, 1996). Therefore, a rich imputation

    model that preserves a large number of associations is desirable because it may be used

    for a variety of post-imputation analyses.

    Existing software packages, however, sometimes fail for imputation models with

    a large number of variables, especially when there are a large number of categorical

    variables, since then, the problems of “curse of dimensionality” and “sparse cells”

    can easily occur. Not to mention the possibility of misspecified imputation model,

    which typically leads to “overestimated variability, and thus, overcoverage of interval

    estimates” (Little and Rubin, 2002).

    Third, the Bayesian nature of MI requires investigators to specify a prior distri-

    26

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    42/96

    bution for the parameter (θ) of the imputation model. In the Bayesian paradigm, this

    prior distribution quantifies one’s belief or state of knowledge about model parameters

    before any data are seen. Because different prior distributions can lead to different

    results, Bayesian models have been regarded by some statisticians as subjective and

    unscientific. “We tend to view the prior distribution as a mathematical convenience

    that allows us to generate the imputations in a principled fashion” (Schafer and Olsen,

    1998).

    The nonparametric bootstrap method avoids all these problems implicit in MI, thus

    provides a good alternative to impute missing values in broader situations. The details

    about bootstrap method together with the algorithm will be discussed in Chapter 5.

    27

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    43/96

    CHAPTER 4

    Missing Data with CART/RF

    4.1 Missing Data with CART

    Missing data are a problem for all statistical problems. CART and RF are certainly

    not exception. The imputation methods we have discussed so far can be used for

    tree-based models or used with some adjustments. For instance, if the percentage of 

    missing data is less than 5% of the total number of observations, listwise deletion

    remains a possible choice.

    A second option is to impute the data outside CART. A simple example would be to

    employ conventional regression in which a predictor with the missing data is regressed

    on other predictors with which it is likely to be related. The resulting regression equa-

    tion can then be used to impute what the missing values might be.

    A third option is to address the missing data problems for predictors within CART

    itself. There are a number of ways this might be done. Here, we consider one of the

    better approaches, and the one readily available in the CART software.

    The first place where missing data come up is when a split is chosen. Recall that

    at each step we choose the split that gives the maximal reduction in impurity:

    ∆I  = I (τ )− p(τ L)I (τ L)− p(τ R)I (τ R)   (4.1)

    where I (τ ) is the value of the parent impurity,  p(τ R) is the probability of a case falling

    in the right daughter node, p(τ L) is the probability of a case falling in the left daughter

    28

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    44/96

    node, I (τ R) is the impurity of the right daughter node and  I (τ L) is the impurity of the

    left daughter node. CART tries to find the predictor and the split rule with which  ∆I 

    is as large as possible.

    Consider the first term on the right hand side (I (τ )). We can easily calculate its

    value without any predictors and thus, don’t have to worry about missing values. How-

    ever, to construct the two daughter nodes, predictors are required. Each predictor is

    evaluated as usual, but using only the predictor values that are not missing. That is,

    I (τ R)  and  I (τ L)  are computed for each of the optimal split for each predictor using

    only the data available. And the associated probabilities p(τ R) and p(τ L) are estimated

    for each predictor based on the split actually present.

    We are not done yet. Now, observations have to be assigned to one of the two

    daughter nodes. How can this be done if the predictor values needed are missing?

    CART imputes those missing values using “surrogate variables”.

    Suppose there are 10 predictors x1− x10 to be included in the CART analysis, andsuppose there are missing values for x1 only, which happens to be the “best” predictor

    chosen to define the “optimal” split. The split necessarily defines two categories for

    x1.

    “The predictor x1  now actually becomes a binary response variable with the two

    classes determined by the split” (Berk, 2005). CART is applied with  x1 as the response

    variable and x2 − x10 as potential splitting variables. Only one partitioning is allowedhere; a full tree is not constructed. The nine predictors are then ranked by the propor-

    tion of cases in x1 that are misclassified. Predictors that do no better than the marginal

    distribution of  x1 are dropped from further consideration.

    The variable with the lowest classification error for  x1 is then used in place of  x1 to

    assign cases with missing values on x1 to one of the two daughter nodes. That is, “the

    predicted classes for  x1  are used when the actual classes for  x1  are missing” (Berk,

    29

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    45/96

    2005). If there are missing data for the “best” predictor of  x1, the “best”   surrogate

    variable is used instead. If there are missing data on the “best”  surrogate variable of 

    x2, the second “best”  surrogate variable of  x3  is used instead. And so on. If each of 

    the variables x2 − x10  has missing data, the majority direction of the  x1  split is used.For example if split is defined so that  x1 ≤ c sends observations to the left and x1  > csends cases to the right, cases with data missing on x1, which have no surrogate to use

    instead, are placed along with the majority of cases. To be more specific, there are

    three options in real implementation (rpart  library in R):

    1. 0 = display only; an observation with a missing value for the primary split rule

    is not sent further down the tree.

    2. 1 = use surrogates, in order, to split subjects missing the primary variable; if all

    surrogates are missing the observation is not split.

    3. 2 = if all surrogates are missing, then send the observations in the majority di-

    rection.

    This would seem to be a reasonable response to missing data. There might be

    other alternatives that may perform better. But the greatest risk is that “if there are lots

    of missing data and the surrogate variables are used, the correspondence between the

    results and the data, had it been complete, can become very tenuous” (Berk, 2005).

    In practice, the data will rarely be missing completely at random (MCAR) or even

    missing at random (MAR). Then, if too much of the data are manufactured, rather than

    collected, a new kind of generalization error will be introduced. The problem is that

    imputation can fail just when you need it the most.

    Furthermore, a number of statistical difficulties can follow when the response vari-

    able is highly skewed. “The danger with missing data is that the skewing can be made

    worse” (Berk, 2005). Perhaps we should avoid using  surrogate variables, and impute

    30

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    46/96

    the missing data using alternative imputation methods, such as nonparametric boot-

    strap method.

    4.2 Missing Data with RF

    There are two ways with which random forests can impute missing data. Among

    the two implementations in “randomForest” library, option “na.roughfix” is quick and

    easy to implement. To be specific,

    1. For numerical variables, NAs are replaced with column medians.

    2. For factor variables, NAs are replaced with the most frequent levels (breaking

    ties at random).

    3. If a data matrix contains no NAs, it is returned unaltered.

    A more advanced algorithm capitalizes on the proximity matrix (“rfImpute()” in

    the randomForest library). We now formally introduce a proximity matrix.

    A proximity matrix is an n×n symmetric matrix, which gives an intrinsic measureof similarities between cases. Here n  is the number of cases in the data set. Run all

    cases in the training set are dropped down the tree. If case i  and case  j  both land in

    the same terminal node, increase the proximity between i  and  j  (element (i, j) of the

    matrix) by one. At the end of the run, the proximities are divided by the number of 

    trees in the run and the proximity between a case and itself is set equal to one. This

    is an intrinsic proximity measure, inherent in the data and the RF algorithm. Thus

    each cell in the proximity matrix shows the proportion of trees over which each pair of 

    observations fall in the same terminal node. The higher the proportion, the more alike

    those observations are, and the more “proximate” they have.

    31

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    47/96

    The proximities between cases   i   and  j  are from the matrix { prox(i, j)}. Fromtheir definition, it follows that “the value  1 − { prox(i, j)} are squared distances in aEuclidean Space of high dimension” (Breiman, 2003).

    The function “rfImpute()” starts by imputing NAs using “na.roughfix”, then ran-

    domForest() is called with the completed data. The proximity matrix from the random

    forests is used to update the imputations of the NAs. For continuous predictors, the

    imputed value is the weighted average of the non-missing observations, where the

    weights are the proximities. So, cases that are more like the cases with the missing

    data are given greater weight. For categorical predictors, the imputed value is the cat-

    egory with the largest average proximity. Again, cases more like the case with the

    missing data are given greater weight.

    This process is relatively slow, and requiring up to 6 iterations of forest growing.

    And the use of imputed values “tends to make the OOB measures of fit too optimistic”

    (Breiman, 2003). The computational demands are also quite daunting and may be

    impractical for many data sets until more efficient ways to handle the proximities are

    found.

    32

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    48/96

    CHAPTER 5

    Nonparametric Bootstrap Methods to Impute Missing

    Data

    In this section, we formally introduce one type of resampling methods to impute

    missing data: nonparametric bootstrap. A primary advantage of the nonparametric

    bootstrap method is that it does not depend on the missing-data mechanism, which

    rectifies disadvantages of all other imputation methods. It also requires no knowledge

    of either the probability distribution or model structure, and successfully incorporates

    the estimates of uncertainty associated with the imputed data.

    5.1 The Simple Bootstrap for Complete Data

    Let  θ̂  be a consistent estimate of a parameter  θ   based on a random sample  Y   =

    (y1, y2, · · · , yn)T . Let Y  (b) be a sample of size n  obtained from the original sample  Y by simple random sampling with replacement , and θ̂(b) be the estimate of  θ obtained by

    applying the standard estimation method to Y  (b), where b  indexes the drawn samples,

    and b = 1, 2, · · · , B. Then the sequence (θ̂(1), . . . , θ̂(B)) represents the set of estimatesobtained by repeating this procedure B  times. The bootstrap estimate of  θ  is defined

    as the average of the B  bootstrap estimates:

    θ̂boot  =  1

    B

    Bb=1

    θ̂(b) (5.1)

    33

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    49/96

    Large-sample inferences can be derived from the bootstrap distribution of  θ̂(b), which

    are based on the histogram formed by the bootstrap estimates (θ̂(1), . . . , θ̂(B)). In par-

    ticular, the bootstrap estimate of the variance of  θ̂ or  θ̂boot is

    V̂ boot  =  1

    B − 1Bb=1

    (θ̂(b) − θ̂boot)2 (5.2)

    It can be shown that “under certain conditions, (a) the bootstrap estimator  θ̂boot  is less

    biased than the original estimator  θ̂, and under quite general conditions (b)  V̂ boot   is

    a consistent estimate of the variance of  θ̂  or  θ̂boot  as n  and  B  tend to infinity” (Efron,

    1987). From property (b), we can see that if the bootstrap distribution is approximately

    normal, a 100(1−

    α)% bootstrap confidence interval for a scalar θ can be computed as

    CI norm(θ) = θ̂ ± z 1−α/2 ̂

    V boot   (5.3)

    where z 1−α/2 is the 100(1−α/2) percentile of the normal distribution. Alternatively if the bootstrap distribution is non-normal, a 100(1 − α)% bootstrap confidence intervalcan be computed empirically as

    CI emp(θ) = (θ̂(b,l), θ̂(b,u))   (5.4)

    where θ̂(b,l) and θ̂(b,u) are the (α/2) and (1−α/2) percentiles of the empirical bootstrapdistribution of  θ. Stable intervals based on Eq.(5.3) require bootstrap samples of the

    order of B = 200. Intervals based on Eq.(5.4) require much large samples, for example

    B = 2000 or more (Efron, 1994).

    5.2 The Simple Bootstrap Applied to Imputed Incomplete Data

    Suppose there is a simple random sample  Y   = (y1, y2, · · · , yn)T , but some obser-vations yi   are missing. A consistent estimate  θ̂  of an unknown parameter  θ   is com-

    puted by first filling in the missing values in  Y  (b) using some imputation method  Imp,

    34

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    50/96

    yielding imputed data  Ŷ   =  I mp(Y  ), and then estimating θ  from the imputed data  Ŷ  .

    Bootstrap estimates (θ̂(1), . . . , θ̂(B)) can be computed as follows:

    For b = 1, . . . , B:

    1. Generate a bootstrap sample Y  (b) with replacement from the original incomplete

    sample Y  .

    2. Fill in the missing data in Y  (b) by applying the imputation procedure  Imp  to the

    bootstrap sample Y  (b), so that  Ŷ  (b) = I mp(Y  (b)).

    3. Compute θ(b) for the imputed complete data  Ŷ  (b).

    Then Equation(5.2) provides a consistent estimate of the variance of  θ̂, and Equations

    (5.3) or (5.4) can be used to generate confidence intervals for an unknown scalar pa-

    rameter.

    A key feature of this procedure is that the imputation procedure is applied B times,

    once to each bootstrap sample. Hence the approach is computationally intensive. A

    simpler procedure would be to apply the imputation procedure  Imp  just once to yield

    one imputed data set  Ŷ  , and then bootstrap the estimation method applied to the filled-

    in data. However, this approach clearly “does not propagate the uncertainty in the

    imputations and hence does not provide valid inferences” (Little and Rubin, 2002). A

    second key feature is that the imputation method must yield a consistent estimate  θ̂

    for the true parameter. This is not required for Equation (5.2) to yield a valid estimate

    of sampling error, but it is required for Equations (5.3) and (5.4) to yield appropriate

    confidence coverage, and for tests to have the nominal size – see in particular Rubin’s

    (1994) discussion of Efron (1994).

    This approach should be applied with caution since it assumes large samples. With

    moderate-sized data sets, it is possible that an imputation procedure that works for the

    full sample may need to be modified for one or more bootstrap samples.

    35

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    51/96

    A principal advantage of the nonparametric bootstrap method is that it does not

    depend on the missing-data mechanism. Its main practical disadvantage is the compu-

    tational expense of the 2000 or so bootstrap replications required for reasonable nu-

    merical accuracy if the bootstrap estimation is non-normal (Efron, 1994). Fortunately,

    this is no longer a big concern with the computer power we have nowadays.

    5.3 The Imputation Algorithm for Tree-Based Models

    The nonparametric bootstrap method to impute missing data for tree-based models

    can be structured as follows.

    Algorithm 1:

    1. Draw B  (say, 2000) boostrap samples.

    2. For each bootstrap sample, b   = 1, 2, · · · , B, impute missing values using fol-lowing steps:

    •   Replace missing values with median (if the predictor is quantitative) or

    mode (if the predictor is qualitative), a.k.a.“rough fix”;

    •   Categorical predictors are regressed on other predictors with which it islikely to be related, using  Logistic regression.

    •   Continuous predictors are regressed on other predictors with which it islikely to be related, using  Gaussian regression.

    •   Count (integer-valued) predictors are regressed on other predictors withwhich it is likely to be related, using Poisson regression.

    •  For observations that have missing data, predict each missing field usingcorresponding regression equation. The missing values are then filled in

    using the predicted values.

    36

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    52/96

    •   Apply CART/RF to each imputed bootstrap sample, and get confusion ta-ble.

    •  Extract and store false positive and false negative errors from confusion

    tables.

    3. Repeat Step #2 a large number of times (e.g., B  = 2000).

    4. Study the empirical distributions of false positive and false negative errors from

     B runs.

    5. Construct confidence intervals.

    Algorithm 2 differs from Algorithm 1 at Step 4, and in fact the procedures used

    to impute missing values remain the same. The only difference is that we now get an

    overall estimate of false positive and false negative errors instead of their confidence

    intervals.

    Algorithm 2:

    1. Draw B  (say, 2000) boostrap samples.

    2. For each bootstrap sample, b   = 1, 2, · · · , B, impute missing values using fol-lowing algorithm:

    •   Replace missing values with median (if the predictor is quantitative) ormode (if the predictor is qualitative), a.k.a. “rough fix”;

    •  Categorical predictors are regressed on other predictors with which it is

    likely to be related, using  Logistic regression.

    •   Continuous predictors are regressed on other predictors with which it islikely to be related, using  Gaussian regression.

    37

  • 8/17/2019 Missing Data Imputation for Tree-based Models.pdf

    53/96

    •   Count (integer-valued) predictors are regressed on other predictors withwhich it is likely to be related, using Poisson regression.

    • For observations that have missing data, predict each missing field using

    corresponding regression equation. The missing values are then filled in

    using the predicted values.

    •  Apply CART/RF to each imputed bootstrap sample.

    •   Drop the cases in the  bth bootstrap sample down the tree. Store the classassigned to each observation in-the-sample along with each observation’s

    predictor values.

    3. Repeat Step #2 a large number of times (e.g., B  = 2000).

    4. Use only the class assigned to each observation when that observation is in-

    the-sample, count the number of times over B   replications that the observation

    is classified in one category and the number of times over  B   replictions it is

    classified in the other category.

    5. Assign each case to a category by a majority vote over B  replications. Thus, if 

    51% of the time a given case is classified as a “1”, that becomes its estimated

    classification.

    6. Construct the confusion table using the assigned class.

    38

  • 8/17/2019 Missing Data Im