decision trees with missing values - julie...

40
Decision trees with missing values Journ´ ees de Statistique 2019 Nicolas Prost , Julie Josse, Erwan Scornet, Ga¨ el Varoquaux CMAP, INRIA PARIETAL June 4, 2019 N. Prost Trees with missing values 1 / 19

Upload: others

Post on 09-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Decision trees with missing valuesJournées de Statistique 2019

    Nicolas Prost, Julie Josse, Erwan Scornet, Gaël Varoquaux

    CMAP, INRIA PARIETAL

    June 4, 2019

    N. Prost Trees with missing values 1 / 19

  • Motivating data in healthMore data ⇒ more missing data

    Data filled manuallyFaulty sensorsData aggregation issues

    Traumabase: 15 000 patients/ 250 variables/ 15 hospitals

    Center Age Sex Weight Height BMI Lactates GlasgowBeaujon 54 m 85 NR NR NA 12

    Lille 33 m 80 1.8 24.69 4.8 15Pitie 26 m NR NR NR 3.9 3

    Beaujon 63 m 80 1.8 24.69 1.66 15Pitie 30 w NR NR NR NM 15

    missing features: Not Recorded, Not Made, Not Applicable, etc.aim: predict the Glasgow score

    N. Prost Trees with missing values 2 / 19

  • Supervised learningWhat it is

    A feature matrix X and a response vector YA training data set and a test data setA loss function to minimize

    Goal:

    f ? ∈ argminf :X→Y

    E[(f (X)− Y )2

    ].

    Tool:

    f̂n ∈ argminf :X→Y

    (1n

    n∑i=1

    (f (Xi )− Yi )2).

    How to learn a prediction function when data is missing?How to predict on a test set when data is missing?

    N. Prost Trees with missing values 3 / 19

  • Classification And Regression Trees (CART)1Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss, i.e.which

    feature j?

    threshold z?

    minimise

    E[(

    Y − E[Y |Xj ≤ z ])2 · 1Xj≤z + (Y − E[Y |Xj > z ])2 · 1Xj>z].

    1Leo Breiman et al. Classification and Regression Trees. Wadsworth, 1984.N. Prost Trees with missing values 4 / 19

  • Classification And Regression Trees (CART)Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss

    X1

    X2

    loss = 3.57loss = 3.75loss = 4.43loss = 4.22loss = 4.40loss = 3.90

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?X1 ≤ 2? X1 > 2?X1 ≤ 4? X1 > 4?X1 ≤ 2.8 X1 > 2.8

    X2 ≤ 1.3 X2 > 1.3

    N. Prost Trees with missing values 5 / 19

  • Classification And Regression Trees (CART)Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss

    X1

    X2

    loss = 3.57

    loss = 3.75loss = 4.43loss = 4.22loss = 4.40loss = 3.90

    root

    X2 ≤ 2? X2 > 2?

    X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?X1 ≤ 2? X1 > 2?X1 ≤ 4? X1 > 4?X1 ≤ 2.8 X1 > 2.8

    X2 ≤ 1.3 X2 > 1.3

    N. Prost Trees with missing values 5 / 19

  • Classification And Regression Trees (CART)Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss

    X1

    X2

    loss = 3.57

    loss = 3.75

    loss = 4.43loss = 4.22loss = 4.40loss = 3.90

    root

    X2 ≤ 2? X2 > 2?

    X2 ≤ 4? X2 > 4?

    X1 ≤ 1? X1 > 1?X1 ≤ 2? X1 > 2?X1 ≤ 4? X1 > 4?X1 ≤ 2.8 X1 > 2.8

    X2 ≤ 1.3 X2 > 1.3

    N. Prost Trees with missing values 5 / 19

  • Classification And Regression Trees (CART)Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss

    X1

    X2

    loss = 3.57loss = 3.75

    loss = 4.43

    loss = 4.22loss = 4.40loss = 3.90

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?

    X1 ≤ 1? X1 > 1?

    X1 ≤ 2? X1 > 2?X1 ≤ 4? X1 > 4?X1 ≤ 2.8 X1 > 2.8

    X2 ≤ 1.3 X2 > 1.3

    N. Prost Trees with missing values 5 / 19

  • Classification And Regression Trees (CART)Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss

    X1

    X2

    loss = 3.57loss = 3.75loss = 4.43

    loss = 4.22

    loss = 4.40loss = 3.90

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?

    X1 ≤ 2? X1 > 2?

    X1 ≤ 4? X1 > 4?X1 ≤ 2.8 X1 > 2.8

    X2 ≤ 1.3 X2 > 1.3

    N. Prost Trees with missing values 5 / 19

  • Classification And Regression Trees (CART)Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss

    X1

    X2

    loss = 3.57loss = 3.75loss = 4.43loss = 4.22

    loss = 4.40

    loss = 3.90

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?X1 ≤ 2? X1 > 2?

    X1 ≤ 4? X1 > 4?

    X1 ≤ 2.8 X1 > 2.8

    X2 ≤ 1.3 X2 > 1.3

    N. Prost Trees with missing values 5 / 19

  • Classification And Regression Trees (CART)Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss

    X1

    X2

    loss = 3.57loss = 3.75loss = 4.43loss = 4.22loss = 4.40

    loss = 3.90

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?X1 ≤ 2? X1 > 2?X1 ≤ 4? X1 > 4?

    X1 ≤ 2.8 X1 > 2.8

    X2 ≤ 1.3 X2 > 1.3

    N. Prost Trees with missing values 5 / 19

  • Classification And Regression Trees (CART)Decision trees

    Method: Recursively, find which split minimises the (quadratic) loss

    X1

    X2

    loss = 3.57loss = 3.75loss = 4.43loss = 4.22loss = 4.40loss = 3.90

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?X1 ≤ 2? X1 > 2?X1 ≤ 4? X1 > 4?

    X1 ≤ 2.8 X1 > 2.8

    X2 ≤ 1.3 X2 > 1.3

    N. Prost Trees with missing values 5 / 19

  • CART with missing valuesSplitting criterion discarding missing values

    X1

    X2

    ?

    X1 = NA

    X1 = NA

    root

    X1 ≤ 2.8 X1 > 2.8

    Where to send the incomplete observations?

    N. Prost Trees with missing values 6 / 19

  • CART with missing valuesSplitting criterion discarding missing values

    X1

    X2

    ?

    X1 = NA

    X1 = NA

    root

    X1 ≤ 2.8 X1 > 2.8

    Where to send the incomplete observations?

    N. Prost Trees with missing values 6 / 19

  • CART with missing valuesSplitting criterion discarding missing values

    X1

    X2?

    X1 = NA

    X1 = NA

    root

    X1 ≤ 2.8 X1 > 2.8

    Where to send the incomplete observations?

    N. Prost Trees with missing values 6 / 19

  • Probabilistic split2Completion strategiesSend incomplete observations to either side, flipping a coin.

    X1

    X2

    X1 = NA

    root

    X1 ≤ 2.8 X1 > 2.8

    2J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.N. Prost Trees with missing values 7 / 19

  • Block propagation3Completion strategiesSend all incomplete observations to the side which minimises the loss.

    X1

    X2

    X1 = NA

    loss = 5.07loss = 4.93

    root

    X1 ≤ 2.8 X1 > 2.8

    or NAor NA

    3Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In:sigkdd international conference on knowledge discovery and data mining. ACM. 2016,p. 785. N. Prost Trees with missing values 8 / 19

  • Block propagation3Completion strategiesSend all incomplete observations to the side which minimises the loss.

    X1

    X2

    X1 = NAloss = 5.07

    loss = 4.93

    root

    X1 ≤ 2.8 X1 > 2.8or NA

    or NA

    3Chen and Guestrin, “Xgboost: A scalable tree boosting system”.N. Prost Trees with missing values 8 / 19

  • Block propagation3Completion strategiesSend all incomplete observations to the side which minimises the loss.

    X1

    X2

    X1 = NA

    loss = 5.07

    loss = 4.93

    root

    X1 ≤ 2.8 X1 > 2.8

    or NA

    or NA

    3Chen and Guestrin, “Xgboost: A scalable tree boosting system”.N. Prost Trees with missing values 8 / 19

  • Surrogate splits4Completion strategies

    Find a related variable to split.

    X1

    X3

    X1 = NA

    root

    X1 ≤ 2.8 X1 > 2.8

    (X3 ≤ 3.2) (X3 > 3.2)

    4Breiman et al., Classification and Regression Trees.N. Prost Trees with missing values 9 / 19

  • Surrogate splits4Completion strategies

    Find a related variable to split.

    X1

    X3

    X1 = NA

    root

    X1 ≤ 2.8 X1 > 2.8(X3 ≤ 3.2) (X3 > 3.2)

    4Breiman et al., Classification and Regression Trees.N. Prost Trees with missing values 9 / 19

  • Variable selection biasIssue with discarding missing values: variables with more values areselected more frequently.

    Example: X1 and X2 are i.i.d., independent from Y

    Figure: X1 selection frequency for rpart (CART)

    N. Prost Trees with missing values 10 / 19

  • Conditional inference trees6Alterative splitting strategySeparate

    choice of splitting variable: test “Y |= X j” based on

    T (Xj) =∑

    X ji observed

    X ji Yi

    threshold choice: usual impurity.

    In order to test by permutation, consider the distribution of all the∑X ji observed

    X ji Yσ(i) where σ runs uniformly over the permutations.

    This distribution is asymptotically normal5,which enables the definition ofa p-value.

    5Helmut Strasser and Christian Weber. “On the asymptotic theory of permutationstatistics”. In: (1999).

    6Torsten Hothorn, Kurt Hornik, and Achim Zeileis. “Unbiased Recursive Partitioning:A Conditional Inference Framework”. In: Journal of Computational and GraphicalStatistics 15.3 (2006).

  • Empirical comparison

    Example: X1 and X2 are i.i.d., independent from Y

    Figure: X1 selection frequency for rpart (CART) vs ctree (conditional inferencetrees)

    N. Prost Trees with missing values 12 / 19

  • Missing incorporated in attribute7CART with missing values

    Method: Recursively, find which partition P minimises

    E[(

    Y − P(X̃))2],

    with, for each feature j and each threshold z , there are three possiblepartitions,

    {X̃j ≤ z or X̃j = NA} VS {X̃j > z}{X̃j ≤ z} VS {X̃j > z or X̃j = NA}{X̃j 6= NA} VS {X̃j = NA}

    → targets E[Y |X̃]

    7B. E. T. H. Twala, M. C. Jones, and D. J. Hand. “Good Methods for Coping withMissing Data in Decision Trees”. In: Pattern Recogn. Lett. 29.7 (May 2008).

    N. Prost Trees with missing values 13 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60loss = 4.79loss = 5.00loss = 4.76loss = 4.95loss = 4.55loss = 4.93loss = 4.90loss = 4.53

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?X1 ≤ 4? X1 > 4?X1 ≤ 2 X1 > 2or NAor NAor X1 > 1or X1 > 4

    X1 = NA?X1 = NA?or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NAloss = 4.60

    loss = 4.79loss = 5.00loss = 4.76loss = 4.95loss = 4.55loss = 4.93loss = 4.90loss = 4.53

    root

    X2 ≤ 2? X2 > 2?

    X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?X1 ≤ 4? X1 > 4?X1 ≤ 2 X1 > 2or NAor NAor X1 > 1or X1 > 4

    X1 = NA?X1 = NA?or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60

    loss = 4.79

    loss = 5.00loss = 4.76loss = 4.95loss = 4.55loss = 4.93loss = 4.90loss = 4.53

    root

    X2 ≤ 2? X2 > 2?

    X2 ≤ 4? X2 > 4?

    X1 ≤ 1? X1 > 1?X1 ≤ 4? X1 > 4?X1 ≤ 2 X1 > 2or NAor NAor X1 > 1or X1 > 4

    X1 = NA?X1 = NA?or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60loss = 4.79

    loss = 5.00

    loss = 4.76loss = 4.95loss = 4.55loss = 4.93loss = 4.90loss = 4.53

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?

    X1 ≤ 1? X1 > 1?

    X1 ≤ 4? X1 > 4?X1 ≤ 2 X1 > 2

    or NA

    or NAor X1 > 1or X1 > 4X1 = NA?X1 = NA?

    or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60loss = 4.79loss = 5.00

    loss = 4.76

    loss = 4.95loss = 4.55loss = 4.93loss = 4.90loss = 4.53

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?

    X1 ≤ 1? X1 > 1?

    X1 ≤ 4? X1 > 4?X1 ≤ 2 X1 > 2or NA

    or NA

    or X1 > 1or X1 > 4X1 = NA?X1 = NA?

    or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60loss = 4.79loss = 5.00loss = 4.76

    loss = 4.95

    loss = 4.55loss = 4.93loss = 4.90loss = 4.53

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?

    X1 ≤ 1?

    X1 > 1?X1 ≤ 4? X1 > 4?X1 ≤ 2 X1 > 2or NAor NA

    or X1 > 1

    or X1 > 4

    X1 = NA?

    X1 = NA?or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60loss = 4.79loss = 5.00loss = 4.76loss = 4.95

    loss = 4.55

    loss = 4.93loss = 4.90loss = 4.53

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?

    X1 ≤ 4? X1 > 4?

    X1 ≤ 2 X1 > 2

    or NA

    or NAor X1 > 1or X1 > 4X1 = NA?X1 = NA?

    or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60loss = 4.79loss = 5.00loss = 4.76loss = 4.95loss = 4.55

    loss = 4.93

    loss = 4.90loss = 4.53

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?

    X1 ≤ 4? X1 > 4?

    X1 ≤ 2 X1 > 2or NA

    or NA

    or X1 > 1or X1 > 4X1 = NA?X1 = NA?

    or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60loss = 4.79loss = 5.00loss = 4.76loss = 4.95loss = 4.55loss = 4.93

    loss = 4.90

    loss = 4.53

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?

    X1 ≤ 4?

    X1 > 4?X1 ≤ 2 X1 > 2or NAor NAor X1 > 1

    or X1 > 4

    X1 = NA?

    X1 = NA?

    or NA

    N. Prost Trees with missing values 14 / 19

  • Missing incorporated in attributeCART with missing values

    X1

    X2

    X1 = NA

    loss = 4.60loss = 4.79loss = 5.00loss = 4.76loss = 4.95loss = 4.55loss = 4.93loss = 4.90

    loss = 4.53

    root

    X2 ≤ 2? X2 > 2?X2 ≤ 4? X2 > 4?X1 ≤ 1? X1 > 1?X1 ≤ 4? X1 > 4?

    X1 ≤ 2 X1 > 2

    or NAor NAor X1 > 1or X1 > 4X1 = NA?X1 = NA?

    or NA

    N. Prost Trees with missing values 14 / 19

  • SimulationsFirst experiment

    Models1 Quadratic: 3 gaussian features, Y = X 21 + ε.

    Mechanisms1 Missing completely at random on X12 Missing on large values of X13 Predictive: M |= X, Y = X 21 + 3M1 + ε

    Methods1 CART or conditional trees with surrogate splits2 MIA3 Mean or gaussian imputation4 Adding the mask

    N. Prost Trees with missing values 15 / 19

  • Relative scores, quadratic model, 20% missing valuesFirst experiment

    RANDOM FOREST

    DECISION TREE

    −0,2 0,0 +0.2

    1. MIA2. imp

    ute mean

    + mask

    3. impute m

    ean4. imp

    ute Gaussi

    an

    + mask5

    . impute Ga

    ussian6

    . rpart (surr

    ogates)

    + mask7.

    rpart (surr

    ogates)

    8. ctree (sur

    rogates)

    + mask9.

    ctree (surr

    ogates)

    1. MIA

    2. impute m

    ean

    + mask

    3. impute m

    ean

    4. impute G

    aussian

    + mask

    5. impute G

    aussian

    Relative explained variance

    RANDOM FOREST

    DECISION TREE

    −0,4 −0,2 0,0 +0.2Relative explained variance

    RANDOM FOREST

    DECISION TREE

    −0,4 −0,2 0,0 +0.2Relative explained variance

    MCAR MNAR Predictive

    N. Prost Trees with missing values 16 / 19

  • SimulationsThird experimentModels

    1 Linear: gaussian features, Y = Xβ + ε2 Friedman: Gaussian feature, nonlinear regression83 Nonlinear: Nonlinear features, nonlinear regression.

    d = 10Mechanisms

    1 Missing at random on all ten variablesMethods

    1 CART or conditional trees with surrogate splits2 MIA3 Mean or gaussian imputation4 Adding the mask8Jerome H Friedman. “Multivariate adaptive regression splines”. In: The annals of

    statistics (1991).N. Prost Trees with missing values 17 / 19

  • Consistency, 40% missing values, MCARThird experiment

    103 104 105Sample size

    0.4

    0.6

    0.8

    Expl

    aine

    d va

    rianc

    eLinear problem(high noise)

    103 104 105Sample size

    0.4

    0.6

    0.8

    Expl

    aine

    d va

    rianc

    e

    Friedman problem(high noise)

    103 104 105Sample size

    0.6

    0.8

    1.0

    Expl

    aine

    d va

    rianc

    e

    Non-linear problem(low noise)

    DECISIO

    N TREE

    103 104 105Sample size

    0.65

    0.70

    0.75

    0.80

    0.85

    Expl

    aine

    d va

    rianc

    e

    103 104 105Sample size

    0.5

    0.6

    0.7

    Expl

    aine

    d va

    rianc

    e

    103 104 105Sample size

    0.96

    0.98

    1.00

    Expl

    aine

    d va

    rianc

    e

    RAND

    OM

    FOREST

    rpart (surrogates)rpart (surrogates) + maskMean imputationMean imputation + maskGaussian imputation

    Gaussian imputation + maskMIActree (surrogates)Bayes rate

    N. Prost Trees with missing values 18 / 19

  • Conclusion

    Regarding variable selection, conditional inference trees are unbiased;this does not solve prediction;MIA allows empirical risk minimisation in the “incomplete” space;most missing-encoding methods are empirically equivalent inprediction;larger discussion on supervised learning with missing values: talk byJulie Josse tomorrow.

    Thank you!

    N. Prost Trees with missing values 19 / 19