introduction to cart_2007

Introduction to CART® Salford Systems ©2007

Data Mining with Decision Trees:Data Mining with Decision Trees:An Introduction to CARTAn Introduction to CART®®

Dan SteinbergMikhail Golovnya

[email protected]://www.salford-systems.com

Introduction to CART® Salford Systems ©2007 2

Intro CART OutlineHistorical backgroundSample CART modelBinary Recursive PartitioningFundamentals of tree pruningCompetitors and surrogatesMissing value handlingVariable ImportancePenalties and structured trees

Introduction into splitting rulesConstrained treesIntroduction into priors and costsCross-ValidationStable trees and train-test consistency (TTC)Hot-spot analysisModel scoring and translationBatteries of CART runs


Historical BackgroundHistorical Background


In The Beginning…In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which:

Could separate relevant from irrelevant predictorsDid not require any kind of variable transforms (logs, square roots)Impervious to outliers and missing valuesCould yield relatively simple and easy to comprehend modelsRequired little to no supervision by the analystWas frequently more accurate than traditional logistic regression ordiscriminant analysis, and other parametric tools


The Years of Struggle

CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons

Monograph introducing CART is challenging to readbrilliant book overflowing with insights into tree growing methodology but fairly technical and brief

Method was not expounded in any textbooks Originally taught only in advanced graduate statistics classes at a handful of leading UniversitiesOriginal standalone software came with slender documentation, and output was not self-explanatoryMethod was such a radical departure from conventional statistics


The Final TriumphRising interest in data mining

Availability of huge data sets requiring analysisNeed to automate or accelerate and improve analysis process Comparative performance studies

Advantages of CART over other tree methodshandling of missing valuesassistance in interpretation of results (surrogates)performance advantages: speed, accuracy

New software and documentation make techniques accessible to end usersWord of mouth generated by early adoptersEasy to use professional software available from Salford Systems


Sample CART ModelSample CART Model


So What is CART?Best illustrated with a famous example-the UCSD Heart Disease studyGiven the diagnosis of a heart attack based on

Chest pain, Indicative EKGs, Elevation of enzymes typically released by damaged heart muscle

Predict who is at risk of a 2nd heart attack and early death within 30 days

Prediction will determine treatment program (intensive care or not)

For each patient about 100 variables were available, including: Demographics, medical history, lab results19 noninvasive variables were used in the analysis


Typical CART SolutionExample of a CLASSIFICATION treeDependent variable is categorical (SURVIVE, DIE)Want to predict class membership

PATIENTS = 215

SURVIVE 178 82.8%DEAD 37 17.2%

Is BP<=91?

Terminal Node A

SURVIVE 6 30.0%DEAD 14 70.0%

NODE = DEAD

Terminal Node B

SURVIVE 102 98.1%DEAD 2 1.9%

NODE = SURVIVE

PATIENTS = 195

SURVIVE 172 88.2%DEAD 23 11.8%

Is AGE<=62.5?

Terminal Node C

SURVIVE 14 50.0%DEAD 14 50.0%

NODE = DEAD

PATIENTS = 91

SURVIVE 70 76.9%DEAD 21 23.1%

Is SINUS<=.5?

Terminal Node D

SURVIVE 56 88.9%DEAD 7 11.1%

NODE = SURVIVE

<= 91 > 91

<= 62.5 > 62.5

>.5<=.5


How to Read ItEntire tree represents a complete analysis or modelHas the form of a decision treeRoot of inverted tree contains all dataRoot gives rise to child nodesChild nodes can in turn give rise to their own children At some point a given path ends in a terminal nodeTerminal node classifies objectPath through the tree governed by the answers to QUESTIONS or RULES


Decision QuestionsQuestion is of the form: is statement TRUE?Is continuous variable X ≤ c ?Does categorical variable D take on levels i, j, or k?

e.g. Is geographic region 1, 2, 4, or 7?

Standard split:If answer to question is YES a case goes left; otherwise it goes rightThis is the form of all primary splits

Question is formulated so that only two answers possibleCalled binary partitioningIn CART the YES answer always goes left


The Tree is a ClassifierClassification is determined by following a case’s path down the treeTerminal nodes are associated with a single class

Any case arriving at a terminal node is assigned to that classThe implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later)

Other algorithms often use only the majority rule or a limited set of fixed rules for node class assignment thus being less flexible

In standard classification, tree assignment is not probabilistic

Another type of CART tree, the class probability tree does report distributions of class membership in nodes (discussed later)

With large data sets we can take the empirical distribution in a terminal node to represent the distribution of classes


The Importance of Binary SplitsSome researchers suggested using 3-way, 4-way, etc. splits at each node (univariate)There are strong theoretical and practical reasons why we feel strongly against such approaches:

Exhaustive search for a binary split has linear algorithm complexity, the alternatives have much higher complexityChoosing the best binary split is “less ambiguous” than the suggested alternativesMost importantly, binary splits result to the smallest rate of data fragmentation thus allowing a larger set of predictors to enter a modelAny multi-way split can be replaced by a sequence of binary splits

There are numerous practical illustrations on the validity of the above claims


Accuracy of a TreeFor classification trees

All objects reaching a terminal node are classified in the same waye.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS

Classification is same regardless of other medical history and lab results

The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix)The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number

Overall accuracyAverage within-class accuracyMost generally – the Expected Cost


Prediction Success Table

A large number of different terms exists to name various simple ratios based on the table above:

Sensitivity, specificity, hit rate, recall, overall accuracy, false/true positive, false/true negative, predictive value, etc.

Classified AsTRUE Class 1 Class 2

"Survivors" "Early Deaths" Total % CorrectClass 1"Survivors" 158 20 178 88%

Class 2"Early Deaths" 9 28 37 75%

Total 167 48 215 86.51%


Tree Interpretation and UsePractical decision tool

Trees like this are used for real world decision makingRules for physiciansDecision tool for a nationwide team of salesmen needing to classify potential customers

Selection of the most important prognostic variablesVariable screen for parametric model buildingExample data set had 100 variablesUse results to find variables to include in a logistic regression

Insight — in our example AGE is not relevant if BP is not high

Suggests which interactions will be important


Introducing CART InterfaceIntroducing CART Interface


Classification with CART®

Real world study early 1990sFixed line service provider offering a new mobile phone serviceWants to identify customers most likely to accept new mobile offerData set based on limited market trial


Real World Trial830 households offered a mobile phone packageAll offered identical package but pricing was varied at random

Handset prices ranged from low to high valuesPer minute prices ranged separately from low to high rate rates

Household asked to make yes or no decision on offer15.2% of the households accepted the offerOur goal was to answer two key questions

Who to make offers to?How to price?


Setting Up the Model

Only requirement is to select TARGET (dependent) variable. CART will do everything else automatically.


Model Results

CART completes analysis and gives access to all results from the NAVIGATOR

Shown on the next slideUpper section displays tree of a selected sizeLower section displays error rate for trees of all possible sizesGreen bar marks most accurate treeWe display a compact 10 node tree for further scrutiny


Navigator Display

Most accurate tree is marked with the green bar. We select the 10 node tree for convenience of a more compact display.


Root Node

Root node displays details of TARGET variable in overall training dataWe see that 15.2% of the 830 households accepted the offerGoal of the analysis is now to extract patterns characteristic of responders

If we could only use a single piece of information to separate responders from non-responders CART chooses the HANDSET PRICEThose offered the phone with a price > 130 contain only 9.9% respondersThose offered a lower price respond at 21.9%


Maximal Tree

Maximal tree is the raw material for the best modelGoal is to find optimal tree embedded inside maximal treeWill find optimal tree via “pruning”Like backwards stepwise regression


Main Drivers

Red= Good Response Blue=Poor Response)High values of a split variable always go to the right; low values go left


Examine Individual Node

Hover mouse over node to see insideEven though this node is on the “high price” of the tree it still exhibits the strongest response across all terminal node segments (43.5% response)Here we select rules expressed in CEntire tree can also be rendered in Java, XML/PMML, or SAS


Configure Print Image

Variety of fine controls to position and scale image nicely


Variable Importance Ranking

CART offers multiple ways to compute variable importance


Predictive Accuracy

This model is not very accurate but ranks responders well


Gains and ROC Curves

In top decile model captures about 40% of responders


The Big PictureThe Big Picture


Binary Recursive Partitioning

Data is split into two partitionsthus “binary” partition

Partitions can also be split into sub-partitionshence procedure is recursive

CART tree is generated by repeated dynamic partitioning of data set

Automatically determines which variables to useAutomatically determines which regions to focus onThe whole process is totally data driven


Three Major Stages

Tree growingConsider all possible splits of a nodeFind the best split according to the given splitting rule (best improvement)Continue sequentially until the largest tree is grown

Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branchesOptimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence


Stage 1Stage 2

Stage 3

General Workflow

Historical Data

Learn

Test

Validate

Build a Sequence of Nested Trees

Monitor Performance

Best

Confirm Findings


Searching All Possible SplitsFor any node CART will examine ALL possible splits

Computationally intensive but there are only a finite number of splits

Consider first the variable AGE--in our data set has minimum value 40

The ruleIs AGE ≤ 40? will separate out these cases to the left — the youngest persons

Next increase the AGE threshold to the next youngest person

Is AGE ≤ 43This will direct six cases to the left

Continue increasing the splitting threshold value by value


Split TablesSplit tables are used to locate continuous splits

There will be as many splits as there are unique values minus one

AGE BP SINUST SURVIVE40 91 0 SURVIVE40 110 0 SURVIVE40 83 1 DEAD43 99 0 SURVIVE43 78 1 DEAD43 135 0 SURVIVE45 120 0 SURVIVE48 119 1 DEAD48 122 0 SURVIVE49 150 0 DEAD49 110 1 SURVIVE

Sorted by Age Sorted by Blood Pressure

AGE BP SINUST SURVIVE43 78 1 DEAD40 83 1 DEAD40 91 0 SURVIVE43 99 0 SURVIVE40 110 0 SURVIVE49 110 1 SURVIVE48 119 1 DEAD45 120 0 SURVIVE48 122 0 SURVIVE43 135 0 SURVIVE49 150 0 DEAD


Categorical Predictors

Categorical predictors spawn many more splits

For example, only four levels A, B, C, D result to 7 splitsIn general, there will be 2K-1-1 splits

CART considers all possible splits based on a categorical predictor unless the number of categories exceeds 15

Left Right

1 A B, C, D

2 B A, C, B

3 C A, B, D

4 D A, B, C

5 A, B C, D

6 A, C B, D

7 A, D B, C


Binary Target ShortcutWhen the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2K-1-1) in levels

30 levels takes twice as long as 15, not 10,000+ times as long

When you have high-level categorical variables in a multi-class problem

Create a binary classification problemTry different definitions of DPV (which groups to combine)Explore predictor groupings produced by CARTFrom a study of all results decide which are most informativeCreate new grouped predicted variable for multi-class problem


Pruning and Variable ImportancePruning and Variable Importance


GYM Model Example

The original data comes from a financial institution and disguised as a health clubProblem: need to understand a market research clustering schemeClusters were created externally using 18 variables and conventional clustering softwareNeed to find simple rules to describe cluster membershipCART tree provides a nice way to arrive at an intuitive story


not usedtarget

Variable DefinitionsID Identification # of memberCLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10)ANYRAQT Racquet ball usage (binary indicator coded 0, 1)ONRCT Number of on-peak racquet ball usesANYPOOL Pool usage (binary indicator coded 0, 1)ONPOOL Number of on-peak pool usesPLRQTPCT Percent of pool and racquet ball usageTPLRCT Total number of pool and racquet ball usesONAER Number of on-peak aerobics classes attendedOFFAER Number of off-peak aerobics classes attendedSAERDIF Difference between number of on- and off-peak aerobics visitsTANNING Number of visits to tanning salonPERSTRN Personal trainer (binary indicator coded 0, 1)CLASSES Number of classes takenNSUPPS Number of supplements/vitamins/frozen dinners purchasedSMALLBUS Small business discount (binary indicator coded 0, 1)OFFER Terms of offerIPAKPRIC Index variable for package priceMONFEE Monthly fee paidFIT Fitness scoreNFAMMEN Number of family membersHOME Home ownership (binary indicator coded 0, 1)

predictors


Things to Do

Dataset: gymsmall.sydSet the target and predictorsSet 50% test sampleNavigate the largest, smallest, and optimal treesCheck [Splitters] and [Tree Details]Observe the logic of pruningCheck [Summary Reports]Understand competitors and surrogatesUnderstand variable importance


Pruning Multiple Nodes

Note that Terminal Node 5 has class assignment 1All the remaining nodes have class assignment 2Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal


Pruning Weaker Split FirstEliminating nodes 3 and 4 results to “lesser” damage –loosing one correct dominant class assignmentEliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment


Understanding Variable Importance

Both FIT and CLASSES are reported as equally importantYet the tree only includes CLASSESONPOOL is one of the main splitters yet is has very low importanceWant to understand why


Competitors and Surrogates

Node details for the root node split are shown aboveImprovement measures class separation between two sidesCompetitors are next best splits in the given node ordered by improvement valuesSurrogates are splits most resembling (case by case) the main split

Association measures the degree of resemblanceLooking for competitors is free, finding surrogates is expensiveIt is possible to have an empty list of surrogates

FIT is a perfect surrogate (and competitor) for CLASSES


The Utility of SurrogatesSuppose that the main splitter is INCOME and comes from a survey formPeople with very low or very high income are likely not to report it – informative missingness in both tails of the income distributionThe splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case)Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sidesFor example, all low education subjects will join the low income side while all high education subjects will join the high income side


Competitors and Surrogates Are Different

For symmetry considerations, both splits must result to identical improvementsYet one split can’t be substituted for anotherHence Split X and Split Y are perfect competitors (same improvement) but very poor surrogates (low association)The reverse does not hold – a perfect surrogate is always a perfect competitor

ABC

AB

CSplit X

ABC

AC

BSplit Y

Three nominal classesEqually presentEqual priors, unit costs, etc.


Calculating AssociationTen observations in a nodeMain splitter XSplit Y mismatches on 2 observationsThe default rule sends all cases to the dominant side mismatching 3 observations

Split X Split Y Default

Association = (Default – Split Y)/Default = (3-2)/3 = 1/3


Notes on AssociationMeasure local to a node, non-parametric in natureAssociation is negative for nodes with very poor resemblance, not limited by -1 from belowAny positive association is useful (better strategy than the default rule)Hence define surrogates as splits with the highest positive associationIn calculating case mismatch, check the possibility for split reversal (if the rule is met, send cases to the right)

CART marks straight splits as “s” and reverse splits as “r”


Calculating Variable ImportanceList all main splitters as well as all surrogates along with theresulting improvementsAggregate the above list by variable accumulating improvementsSort the result descendingDivide each entry by the largest value and scale to 100%The resulting importance list is based on the given tree exclusively, if a different tree a grown the importance list will most likely changeVariable importance reveals a deeper structure in stark contrast to a casual look at the main splitters


Mystery SolvedIn our example improvements in the root node dominate the whole treeConsequently, the variable importance reflects the surrogates table in the root node

Importance is a function of the OVERALL tree including deepest nodesSuppose you grow a large exploratory tree — reviewimportancesThen find an optimal tree via test set or CV yielding smaller treeOptimal tree SAME as exploratory tree in the top nodesYET importances might be quite different.WHY? Because larger tree uses more nodes to compute the importanceWhen comparing results be sure to compare similar or same sized trees

Variable Importance Caution


Splitting Rules and FriendsSplitting Rules and Friends


Splitting Rules - IntroductionSplitting rule controls the tree growing stageAt each node, a finite pool of all possible splits exists

The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in thenode

The task of the splitting rule is to rank all splits in the poolaccording to their improvements

The winner will become the main splitterThe top runner-ups will become competitors

It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split”


Splitting Rules Control

Selected in the “Method” tab of the Model Setup dialogThere are 6 alternative rules for classification and 2 rules for regressionSince there is no theoretical justification on the best splitting rule, we recommend trying all of them and picking the best resulting model


Further InsightsOn binary classification problem, GINI and TWOING are equivalent

Therefore, we recommend setting “Favor even Splits” control to 1 when TWOING is used

GINI and Symmetric GINI only differ when an asymmetric cost matrix is used and identical otherwiseOrdered TWOING is a restricted version of TWOING on multinomial targets coded into a set of contiguous integersClass Probability is identical to GINI during tree growing but differs during the pruning


Battery RULES

The battery was designed to automatically try all six alternative splitting rulesAccording to the output above, Entropy splitting rule resulted to the best accuracy model on the given dataIndividual model navigators are also available


Controlled Tree GrowthCART was originally conceived to be fully automated thus protecting an analyst from making erroneous split decisionsIn some cases, however, it could be desirable to influence tree evolution to address specific modeling needsModern CART engine allows multiple direct and indirect ways of accomplishing this:

Penalties on variables, missing values, and categoriesControlling which variables are allowed or disallowed at different levels in the tree and node sizesDirect forcing of splits at the top two levels in the tree


Penalizing Individual Variables

We use variable penalty control to penalize CLASSES a little bitThis is enough to ensure FIT the main splitter positionThe improvement associated with CLASSES is now downgraded by 1 minus the penalty factor


Penalizing Missing and Categorical Predictors

Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantageIntroducing systematic penalty set to the proportion of the variable missing effectively removes the possible biasSame holds for categorical variables with large number of distinct levelsWe recommend that the user always set both controls to 1


Forcing Splits

CART allows direct forcing of the split variable (and possibly split value) in the top three nodes in the tree


Notes on Forcing SplitsIt is not guaranteed that a forced split below the root node level will be preserved (because of pruning)Do not simply overturn “illogical” automatic splits by forcing something else – look for an explanation why CART makes a “wrong” choice, it could indicate a problem with your data


Constrained TreesMany predictive models can benefit from Salford’s patent pending “Structured Trees”Trees constrained in how they are grown to reflect decision support requirementsIn mobile phone example: want tree to first segment on customer characteristics and then complete using price variablesPrice variables are under the control of the companyCustomer characteristics are not under company control


Setting Up Constraints

Green indicates where in the tree variables of group are allowed to appear


Constrained Tree

Demographic and spend information at top of treeHandset (HANDPRIC) and per minute pricing (USEPRICE) at bottom

An essential component of any CART analysisA prior probability distribution for the classes is needed to doproper splitting since prior probabilities are used to calculatethe improvements in the Gini formulaExample: Suppose data contains

99% type A (nonresponders),1% type B (responders),

CART might focus on not missing any class A objectsone "solution" classify all objects nonresponders

Advantages: Only 1% error rateVery simple predictive rules, although not informativeBut realize this could be an optimal rule

Prior Probabilities

Most common priors: PRIORS EQUALThis gives each class equal weight regardless of its frequency in the dataPrevents CART from favoring more prevalent classesIf we have 900 class A and 100 class B and equal priors

With equal priors prevalence measured as percent of OWN class sizeMeasurements relative to own class not entire learning setIf PRIORS DATA both child nodes would be class AEqual priors puts both classes on an equal footing

A: 600B: 90

2/3 of all A9/10 of all BClass as B

A: 300B: 10

1/3 of all A1/10 of all BClass as A

A: 900 B: 100

Priors EQUAL

For equal priors class node is class 1 if

Classify any node by “count ratio in node” relative to “count ratio in root”Classify by which level has greatest relative richness

N tN t

NN

NN

i

i

1

2

1

2

( )( )

>

= number of cases in class i at the root node(t) = number of cases in class i at the node t

Class Assignment Formula I

When priors are not equal: π1N1(t) / π2N2(t) > N1 / N2

where πi is the prior for class iSince π1 / π2 always appears as ratio just think in terms of boosting amount Ni(t) by a factor, e.g. 9:1, 100:1, 2:1If priors are EQUAL, πi terms cancel out

If priors are DATA, then π1 / π2 = N1 / N2which simplifies formula to plurality rule (simple counting) classify as class 1 if N1(t) > N2(t)

If priors are neither EQUAL nor DATA then formula will provide boost to up-weighted classes

Class Assignment Formula II

A: 300B: 10

A: 300B: 10

A: 600B: 90

A: 600B: 90

30010

900100> so class as A

In general the tests w ould be w eighted by the appropriate priors(B ) (A ) (A )

(B )

tests can be w ritten as AB or B

A as convenient

ππ

ππ

90600

100900

30010

900100> >

90600

100900> so class as B

A: 900B: 100

A: 900B: 100

Example


Cross ValidationCross Validation

Cross-validation is a recent computer intensive development in statisticsPurpose is to protect oneself from over fitting errors

Don't want to capitalize on chance — track idiosyncrasies of this dataIdiosyncrasies which will NOT be observed on fresh data

Ideally would like to use large test data sets to evaluate trees, N>5000Practically, some studies don't have sufficient data to spare for testingCross-validation will use SAME data for learning and for testing

Cross-Validation

Begin by growing maximal tree on ALL dataDivide data into 10 portions stratified on dependent variable levelsReserve first portion for test, grow new tree on remaining 9 portionsUse the 1/10 test set to measure error rate for this 9/10 data tree

Error rate is measured for the maximal tree and for all subtrees

Now rotate out new 1/10 test data setGrow new tree on remaining 9 portionsRepeat until all 10 portions of the data have been used as test sets

10-fold CV: Industry Standard

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

Test

Test

Test

Test

Learn

Learn

Learn

LearnLearn

ETC...

Learn

The CV Procedure

Every observation is used as test case exactly onceIn 10 fold CV each observation is used as a learning case 9 timesIn 10 fold CV 10 auxiliary CART trees are grown in addition to initial treeWhen all 10 CV trees are done the error rates are cumulated (summed)Summing of error rates is done by TREE complexitySummed Error (cost) rates are then attributed to INITIAL treeObserve that no two of these trees need be the sameResults are subject to random fluctuations — sometimes severe

CV Details

Look at the separate CV tree results of a single CV analysisCART produces a summary of the results for each tree at end of outputTable titled "CV TREE COMPETITOR LISTINGS" reports results for each CV tree Root (TOP), Left Child of Root (LEFT), and Right Child of Root (RIGHT)Want to know how stable results are -- do different variables split root node in different trees?Only difference between different CV trees is random elimination of 1/10 dataSpecial battery called CVR exists to study the effects of repeated cross-validation process

CV Replication and Stability


Battery CVR


Regression TreesRegression Trees


General CommentsRegression trees result to piece-wise constant models (multi-dimensional staircase) on an orthogonal partition of the data space

Thus usually not the best possible performer in terms of conventional regression loss functions

Only a very limited number of controls is available to influence the modeling process

Priors and costs are no longer applicableThere are two splitting rules: LS and LAD

Very powerful in capturing high-order interactions but somewhat weak in explaining simple main effects


Boston Housing Data SetHarrison, D. and D. Rubinfeld. Hedonic Housing Prices & Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978

506 census tracts in City of Boston for the year 1970Goal: study relationship between quality of life variables and property valuesMV median value of owner-occupied homes in tract (‘000s)CRIM per capita crime ratesNOX concentration of nitrogen oxides (pphm)AGE percent built before 1940DIS weighted distance to centers of employmentRM average number of rooms per houseLSTAT percent neighborhood ‘lower SES’RAD accessibility to radial highwaysCHAS borders Charles River (0/1)INDUS percent non-retail businessTAX tax ratePT pupil teacher ratio


Splitting the Root Node

Improvement is defined in terms of the greatest reduction in the sum of squared errors when a single constant prediction is replaced by two separate constants on each side


Regression Tree Model

All cases in the given node are assigned the same predicted response – the node average of the original targetNodes are color-coded according to the predicted responseWe have a convenient segmentation of the population according to the average response levels


The Best and the Worst Segments


Scoring and DeploymentScoring and Deployment


Notes on Scoring

Trees cannot be expressed as a simple analytical expression, instead, a lengthy collection of rules is needed to express the tree logicNonetheless, the scoring operation is very fast once the rules are compiled into the machine codeScoring can be done in two ways:

Internally – by using the SCORE commandExternally – by using the TRANSLATE command to generate C, SAS, Java, or PMML code

Model of any size can be scored or translated


Internal Scoring


Output File Format

CART creates the following output variablesRESPONSE – predicted class assignmentNODE – terminal node the case falls intoCORRECT – indicates whether the prediction is correct (assuming that the actual target is known)PROB_x – class probability (constant within each node)PATH_x – path traveled by case in the tree

Case 1 starts in the root node then follows through internal nodes 2, 3, 4, 5 and finally lands into terminal node 1Case 2 starts in the root node and then lands into terminal node 7

Finally CART adds all of the original model variables including the target and predictors

CASEID RESPONSE NODE CORRECT DV PROB_1 PROB_2 PATH_1 PATH_2 PATH_3 PATH_4 PATH_5 PATH_6 XA XB XC1 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 02 1 7 0 0 0.738636 0.261364 1 -7 0 0 0 0 1 0 03 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 0


Model Translation


Output Code


CART AutomationCART Automation


Automated CART RunsStarting with CART 6.0 we have added a new feature called batteriesA battery is a collection of runs obtained via a predefined procedureA trivial set of batteries is varying one of the key CART parameters:

RULES – run every major splitting ruleATOM – vary minimum parent node sizeMINCHILD – vary minimum node sizeDEPTH – vary tree depthNODES – vary maximum number of nodes allowedFLIP – flipping the learn and test samples (two runs only)


More BatteriesDRAW – each run uses a randomly drawn learn sample from the original dataCV – varying number of folds used in cross-validationCVR – repeated cross-validation each time with a different random number seedKEEP – select a given number of predictors at random from the initial larger set

Supports a core set of predictors that must always be present

ONEOFF – one predictor model for each predictor in the given listThe remaining batteries are more advanced and will be discussed individually


Battery MCT

The primary purpose is to test possible model overfitting by running Monte Carlo simulation

Randomly permute the target variableBuild a CART treeRepeat a number of timesThe resulting family of profiles measures intrinsic overfitting due to the learning processCompare the actual run profile to the family of random profiles to check the significance of the results

This battery is especially useful when the relative error of the original run exceeds 90%


Battery MVI

Designed to check the impact of the missing values indicators on the model performance

Build a standard model, missing values are handled via the mechanism of surrogatesBuild a model using MVIs only – measures the amount of predictive information contained in the missing value indicators aloneCombine the twoSame as above with missing penalties turned on


Battery PRIORVary the priors for the specified class within a given range

For example, vary the prior on class 1 from 0.2 to 0.8 in increments of 0.02

Will result to a family of models with drastically different performance in terms of class richness and accuracy (see the earlier discussion)Can be used for hot spot detectionCan also be used to construct a very efficient voting ensemble of treesThis battery is covered in greater length in our Advanced CART class


Battery SHAVINGAutomates variable selection processShaving from the top – each step the most important variable is eliminated

Capable to detect and eliminate “model hijackers” – variables that appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable)

Shaving from the bottom – each step the least important variable is eliminated

May drastically reduce the number of variables used by the model

SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least

May result to a very long sequence of runs (quadratic complexity)


Battery SAMPLE

Build a series of models in which the learn sample is systematically reducedUsed to determine the impact of the learn sample size on the model performanceCan be combined with BATTERY DRAW to get additional spread


Battery TARGET

Attempt to build a model for each variable in the given list as a target using the remaining variables as predictorsVery effective in capturing inter-variable dependency patternsCan be used as an input to variable clustering and multi-dimensional scaling approachesWhen run on MVIs only, can reveal missing value patterns in the data


Stable Trees and ConsistencyStable Trees and Consistency


Notes on Tree StabilityTree are usually stable near the top and become progressively more and more unstable as the nodes become smallerTherefore, smaller trees in the pruning sequence will tend to be more stableNode instability usually manifests itself when comparing node class distributions between the learn and test samplesIdeally we want to identify the largest stable tree in the pruning sequenceThe optimal tree suggested by CART may contain a few unstable nodes structurally needed to optimize the overall accuracyTrain Test Consistency feature (TTC) was designed to address the topic at length


MJ2000 Run

The dataset comes from an undisclosed financial institutionSet up a default classification run with 50% random test sampleCART determines 36-node optimal tree


Node Stability

A quick look at the Learn and Test node distributions reveals a large number of unstable nodesLet’s use TTC feature by pressing the [T/T Consist] button to identify the largest stable tree

Terminal Nodes TTC Report


The Largest Stable Tree

Tree with 11 terminal nodes is identified as stable (all nodes match on both direction and rank)While within the specified tolerance, nodes 2, 6, 8 are somewhat unsatisfactory in terms of rank-order

These nodes will emerge as rank-order unstable if we lower the tolerance level to 0.5

TTC Report


Improving Predictor List

Run BATTERY SHAVE to systematically eliminate the least important predictorsThe resulting optimal tree has even better accuracy than beforeIt also uses a much smaller set of only 6 predictorsLet’s study the tree stability for the new model

Predictor Elimination by Shaving


Comparing the Results

The tree on the left is the original (unshaved) oneThe tree on the right is the result of shavingThe new tree is far more stable than the original one, except for node 8

Original Tree New Tree


Justification

Node 8 is irrelevant for classification but its presence is required in order to allow a very strong split below

Learn Sample Test Sample


Recommended ReadingBreiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: WadsworthBreiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64.Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive Logistic Regression: a Statistical View of Boosting." (Aug. 1998)

introduction to cart_2007

Technology

cart salford systems

intro cart

sample cart modelintroduction

years of struggle cart

typical cart solutionpatients

data tree error rate

new tree

datadivide data