introduction to cart_2007
DESCRIPTION
TRANSCRIPT
Introduction to CART® Salford Systems ©2007
Data Mining with Decision Trees:Data Mining with Decision Trees:An Introduction to CARTAn Introduction to CART®®
Dan SteinbergMikhail Golovnya
[email protected]://www.salford-systems.com
Introduction to CART® Salford Systems ©2007 2
Intro CART OutlineHistorical backgroundSample CART modelBinary Recursive PartitioningFundamentals of tree pruningCompetitors and surrogatesMissing value handlingVariable ImportancePenalties and structured trees
Introduction into splitting rulesConstrained treesIntroduction into priors and costsCross-ValidationStable trees and train-test consistency (TTC)Hot-spot analysisModel scoring and translationBatteries of CART runs
Introduction to CART® Salford Systems ©2007 3
Historical BackgroundHistorical Background
Introduction to CART® Salford Systems ©2007 4
In The Beginning…In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which:
Could separate relevant from irrelevant predictorsDid not require any kind of variable transforms (logs, square roots)Impervious to outliers and missing valuesCould yield relatively simple and easy to comprehend modelsRequired little to no supervision by the analystWas frequently more accurate than traditional logistic regression ordiscriminant analysis, and other parametric tools
Introduction to CART® Salford Systems ©2007 5
The Years of Struggle
CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons
Monograph introducing CART is challenging to readbrilliant book overflowing with insights into tree growing methodology but fairly technical and brief
Method was not expounded in any textbooks Originally taught only in advanced graduate statistics classes at a handful of leading UniversitiesOriginal standalone software came with slender documentation, and output was not self-explanatoryMethod was such a radical departure from conventional statistics
Introduction to CART® Salford Systems ©2007 6
The Final TriumphRising interest in data mining
Availability of huge data sets requiring analysisNeed to automate or accelerate and improve analysis process Comparative performance studies
Advantages of CART over other tree methodshandling of missing valuesassistance in interpretation of results (surrogates)performance advantages: speed, accuracy
New software and documentation make techniques accessible to end usersWord of mouth generated by early adoptersEasy to use professional software available from Salford Systems
Introduction to CART® Salford Systems ©2007 7
Sample CART ModelSample CART Model
Introduction to CART® Salford Systems ©2007 8
So What is CART?Best illustrated with a famous example-the UCSD Heart Disease studyGiven the diagnosis of a heart attack based on
Chest pain, Indicative EKGs, Elevation of enzymes typically released by damaged heart muscle
Predict who is at risk of a 2nd heart attack and early death within 30 days
Prediction will determine treatment program (intensive care or not)
For each patient about 100 variables were available, including: Demographics, medical history, lab results19 noninvasive variables were used in the analysis
Introduction to CART® Salford Systems ©2007 9
Typical CART SolutionExample of a CLASSIFICATION treeDependent variable is categorical (SURVIVE, DIE)Want to predict class membership
PATIENTS = 215
SURVIVE 178 82.8%DEAD 37 17.2%
Is BP<=91?
Terminal Node A
SURVIVE 6 30.0%DEAD 14 70.0%
NODE = DEAD
Terminal Node B
SURVIVE 102 98.1%DEAD 2 1.9%
NODE = SURVIVE
PATIENTS = 195
SURVIVE 172 88.2%DEAD 23 11.8%
Is AGE<=62.5?
Terminal Node C
SURVIVE 14 50.0%DEAD 14 50.0%
NODE = DEAD
PATIENTS = 91
SURVIVE 70 76.9%DEAD 21 23.1%
Is SINUS<=.5?
Terminal Node D
SURVIVE 56 88.9%DEAD 7 11.1%
NODE = SURVIVE
<= 91 > 91
<= 62.5 > 62.5
>.5<=.5
Introduction to CART® Salford Systems ©2007 10
How to Read ItEntire tree represents a complete analysis or modelHas the form of a decision treeRoot of inverted tree contains all dataRoot gives rise to child nodesChild nodes can in turn give rise to their own children At some point a given path ends in a terminal nodeTerminal node classifies objectPath through the tree governed by the answers to QUESTIONS or RULES
Introduction to CART® Salford Systems ©2007 11
Decision QuestionsQuestion is of the form: is statement TRUE?Is continuous variable X ≤ c ?Does categorical variable D take on levels i, j, or k?
e.g. Is geographic region 1, 2, 4, or 7?
Standard split:If answer to question is YES a case goes left; otherwise it goes rightThis is the form of all primary splits
Question is formulated so that only two answers possibleCalled binary partitioningIn CART the YES answer always goes left
Introduction to CART® Salford Systems ©2007 12
The Tree is a ClassifierClassification is determined by following a case’s path down the treeTerminal nodes are associated with a single class
Any case arriving at a terminal node is assigned to that classThe implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later)
Other algorithms often use only the majority rule or a limited set of fixed rules for node class assignment thus being less flexible
In standard classification, tree assignment is not probabilistic
Another type of CART tree, the class probability tree does report distributions of class membership in nodes (discussed later)
With large data sets we can take the empirical distribution in a terminal node to represent the distribution of classes
Introduction to CART® Salford Systems ©2007 13
The Importance of Binary SplitsSome researchers suggested using 3-way, 4-way, etc. splits at each node (univariate)There are strong theoretical and practical reasons why we feel strongly against such approaches:
Exhaustive search for a binary split has linear algorithm complexity, the alternatives have much higher complexityChoosing the best binary split is “less ambiguous” than the suggested alternativesMost importantly, binary splits result to the smallest rate of data fragmentation thus allowing a larger set of predictors to enter a modelAny multi-way split can be replaced by a sequence of binary splits
There are numerous practical illustrations on the validity of the above claims
Introduction to CART® Salford Systems ©2007 14
Accuracy of a TreeFor classification trees
All objects reaching a terminal node are classified in the same waye.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS
Classification is same regardless of other medical history and lab results
The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix)The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number
Overall accuracyAverage within-class accuracyMost generally – the Expected Cost
Introduction to CART® Salford Systems ©2007 15
Prediction Success Table
A large number of different terms exists to name various simple ratios based on the table above:
Sensitivity, specificity, hit rate, recall, overall accuracy, false/true positive, false/true negative, predictive value, etc.
Classified AsTRUE Class 1 Class 2
"Survivors" "Early Deaths" Total % CorrectClass 1"Survivors" 158 20 178 88%
Class 2"Early Deaths" 9 28 37 75%
Total 167 48 215 86.51%
Introduction to CART® Salford Systems ©2007 16
Tree Interpretation and UsePractical decision tool
Trees like this are used for real world decision makingRules for physiciansDecision tool for a nationwide team of salesmen needing to classify potential customers
Selection of the most important prognostic variablesVariable screen for parametric model buildingExample data set had 100 variablesUse results to find variables to include in a logistic regression
Insight — in our example AGE is not relevant if BP is not high
Suggests which interactions will be important
Introduction to CART® Salford Systems ©2007 17
Introducing CART InterfaceIntroducing CART Interface
Introduction to CART® Salford Systems ©2007 18
Classification with CART®
Real world study early 1990sFixed line service provider offering a new mobile phone serviceWants to identify customers most likely to accept new mobile offerData set based on limited market trial
Introduction to CART® Salford Systems ©2007 19
Real World Trial830 households offered a mobile phone packageAll offered identical package but pricing was varied at random
Handset prices ranged from low to high valuesPer minute prices ranged separately from low to high rate rates
Household asked to make yes or no decision on offer15.2% of the households accepted the offerOur goal was to answer two key questions
Who to make offers to?How to price?
Introduction to CART® Salford Systems ©2007 20
Setting Up the Model
Only requirement is to select TARGET (dependent) variable. CART will do everything else automatically.
Introduction to CART® Salford Systems ©2007 21
Model Results
CART completes analysis and gives access to all results from the NAVIGATOR
Shown on the next slideUpper section displays tree of a selected sizeLower section displays error rate for trees of all possible sizesGreen bar marks most accurate treeWe display a compact 10 node tree for further scrutiny
Introduction to CART® Salford Systems ©2007 22
Navigator Display
Most accurate tree is marked with the green bar. We select the 10 node tree for convenience of a more compact display.
Introduction to CART® Salford Systems ©2007 23
Root Node
Root node displays details of TARGET variable in overall training dataWe see that 15.2% of the 830 households accepted the offerGoal of the analysis is now to extract patterns characteristic of responders
If we could only use a single piece of information to separate responders from non-responders CART chooses the HANDSET PRICEThose offered the phone with a price > 130 contain only 9.9% respondersThose offered a lower price respond at 21.9%
Introduction to CART® Salford Systems ©2007 24
Maximal Tree
Maximal tree is the raw material for the best modelGoal is to find optimal tree embedded inside maximal treeWill find optimal tree via “pruning”Like backwards stepwise regression
Introduction to CART® Salford Systems ©2007 25
Main Drivers
Red= Good Response Blue=Poor Response)High values of a split variable always go to the right; low values go left
Introduction to CART® Salford Systems ©2007 26
Examine Individual Node
Hover mouse over node to see insideEven though this node is on the “high price” of the tree it still exhibits the strongest response across all terminal node segments (43.5% response)Here we select rules expressed in CEntire tree can also be rendered in Java, XML/PMML, or SAS
Introduction to CART® Salford Systems ©2007 27
Configure Print Image
Variety of fine controls to position and scale image nicely
Introduction to CART® Salford Systems ©2007 28
Variable Importance Ranking
CART offers multiple ways to compute variable importance
Introduction to CART® Salford Systems ©2007 29
Predictive Accuracy
This model is not very accurate but ranks responders well
Introduction to CART® Salford Systems ©2007 30
Gains and ROC Curves
In top decile model captures about 40% of responders
Introduction to CART® Salford Systems ©2007 31
The Big PictureThe Big Picture
Introduction to CART® Salford Systems ©2007 32
Binary Recursive Partitioning
Data is split into two partitionsthus “binary” partition
Partitions can also be split into sub-partitionshence procedure is recursive
CART tree is generated by repeated dynamic partitioning of data set
Automatically determines which variables to useAutomatically determines which regions to focus onThe whole process is totally data driven
Introduction to CART® Salford Systems ©2007 33
Three Major Stages
Tree growingConsider all possible splits of a nodeFind the best split according to the given splitting rule (best improvement)Continue sequentially until the largest tree is grown
Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branchesOptimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence
Introduction to CART® Salford Systems ©2007 34
Stage 1Stage 2
Stage 3
General Workflow
Historical Data
Learn
Test
Validate
Build a Sequence of Nested Trees
Monitor Performance
Best
Confirm Findings
Introduction to CART® Salford Systems ©2007 35
Searching All Possible SplitsFor any node CART will examine ALL possible splits
Computationally intensive but there are only a finite number of splits
Consider first the variable AGE--in our data set has minimum value 40
The ruleIs AGE ≤ 40? will separate out these cases to the left — the youngest persons
Next increase the AGE threshold to the next youngest person
Is AGE ≤ 43This will direct six cases to the left
Continue increasing the splitting threshold value by value
Introduction to CART® Salford Systems ©2007 36
Split TablesSplit tables are used to locate continuous splits
There will be as many splits as there are unique values minus one
AGE BP SINUST SURVIVE40 91 0 SURVIVE40 110 0 SURVIVE40 83 1 DEAD43 99 0 SURVIVE43 78 1 DEAD43 135 0 SURVIVE45 120 0 SURVIVE48 119 1 DEAD48 122 0 SURVIVE49 150 0 DEAD49 110 1 SURVIVE
Sorted by Age Sorted by Blood Pressure
AGE BP SINUST SURVIVE43 78 1 DEAD40 83 1 DEAD40 91 0 SURVIVE43 99 0 SURVIVE40 110 0 SURVIVE49 110 1 SURVIVE48 119 1 DEAD45 120 0 SURVIVE48 122 0 SURVIVE43 135 0 SURVIVE49 150 0 DEAD
Introduction to CART® Salford Systems ©2007 37
Categorical Predictors
Categorical predictors spawn many more splits
For example, only four levels A, B, C, D result to 7 splitsIn general, there will be 2K-1-1 splits
CART considers all possible splits based on a categorical predictor unless the number of categories exceeds 15
Left Right
1 A B, C, D
2 B A, C, B
3 C A, B, D
4 D A, B, C
5 A, B C, D
6 A, C B, D
7 A, D B, C
Introduction to CART® Salford Systems ©2007 38
Binary Target ShortcutWhen the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2K-1-1) in levels
30 levels takes twice as long as 15, not 10,000+ times as long
When you have high-level categorical variables in a multi-class problem
Create a binary classification problemTry different definitions of DPV (which groups to combine)Explore predictor groupings produced by CARTFrom a study of all results decide which are most informativeCreate new grouped predicted variable for multi-class problem
Introduction to CART® Salford Systems ©2007 39
Pruning and Variable ImportancePruning and Variable Importance
Introduction to CART® Salford Systems ©2007 40
GYM Model Example
The original data comes from a financial institution and disguised as a health clubProblem: need to understand a market research clustering schemeClusters were created externally using 18 variables and conventional clustering softwareNeed to find simple rules to describe cluster membershipCART tree provides a nice way to arrive at an intuitive story
Introduction to CART® Salford Systems ©2007 41
not usedtarget
Variable DefinitionsID Identification # of memberCLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10)ANYRAQT Racquet ball usage (binary indicator coded 0, 1)ONRCT Number of on-peak racquet ball usesANYPOOL Pool usage (binary indicator coded 0, 1)ONPOOL Number of on-peak pool usesPLRQTPCT Percent of pool and racquet ball usageTPLRCT Total number of pool and racquet ball usesONAER Number of on-peak aerobics classes attendedOFFAER Number of off-peak aerobics classes attendedSAERDIF Difference between number of on- and off-peak aerobics visitsTANNING Number of visits to tanning salonPERSTRN Personal trainer (binary indicator coded 0, 1)CLASSES Number of classes takenNSUPPS Number of supplements/vitamins/frozen dinners purchasedSMALLBUS Small business discount (binary indicator coded 0, 1)OFFER Terms of offerIPAKPRIC Index variable for package priceMONFEE Monthly fee paidFIT Fitness scoreNFAMMEN Number of family membersHOME Home ownership (binary indicator coded 0, 1)
predictors
Introduction to CART® Salford Systems ©2007 42
Things to Do
Dataset: gymsmall.sydSet the target and predictorsSet 50% test sampleNavigate the largest, smallest, and optimal treesCheck [Splitters] and [Tree Details]Observe the logic of pruningCheck [Summary Reports]Understand competitors and surrogatesUnderstand variable importance
Introduction to CART® Salford Systems ©2007 43
Pruning Multiple Nodes
Note that Terminal Node 5 has class assignment 1All the remaining nodes have class assignment 2Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal
Introduction to CART® Salford Systems ©2007 44
Pruning Weaker Split FirstEliminating nodes 3 and 4 results to “lesser” damage –loosing one correct dominant class assignmentEliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment
Introduction to CART® Salford Systems ©2007 45
Understanding Variable Importance
Both FIT and CLASSES are reported as equally importantYet the tree only includes CLASSESONPOOL is one of the main splitters yet is has very low importanceWant to understand why
Introduction to CART® Salford Systems ©2007 46
Competitors and Surrogates
Node details for the root node split are shown aboveImprovement measures class separation between two sidesCompetitors are next best splits in the given node ordered by improvement valuesSurrogates are splits most resembling (case by case) the main split
Association measures the degree of resemblanceLooking for competitors is free, finding surrogates is expensiveIt is possible to have an empty list of surrogates
FIT is a perfect surrogate (and competitor) for CLASSES
Introduction to CART® Salford Systems ©2007 47
The Utility of SurrogatesSuppose that the main splitter is INCOME and comes from a survey formPeople with very low or very high income are likely not to report it – informative missingness in both tails of the income distributionThe splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case)Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sidesFor example, all low education subjects will join the low income side while all high education subjects will join the high income side
Introduction to CART® Salford Systems ©2007 48
Competitors and Surrogates Are Different
For symmetry considerations, both splits must result to identical improvementsYet one split can’t be substituted for anotherHence Split X and Split Y are perfect competitors (same improvement) but very poor surrogates (low association)The reverse does not hold – a perfect surrogate is always a perfect competitor
ABC
AB
CSplit X
ABC
AC
BSplit Y
Three nominal classesEqually presentEqual priors, unit costs, etc.
Introduction to CART® Salford Systems ©2007 49
Calculating AssociationTen observations in a nodeMain splitter XSplit Y mismatches on 2 observationsThe default rule sends all cases to the dominant side mismatching 3 observations
Split X Split Y Default
Association = (Default – Split Y)/Default = (3-2)/3 = 1/3
Introduction to CART® Salford Systems ©2007 50
Notes on AssociationMeasure local to a node, non-parametric in natureAssociation is negative for nodes with very poor resemblance, not limited by -1 from belowAny positive association is useful (better strategy than the default rule)Hence define surrogates as splits with the highest positive associationIn calculating case mismatch, check the possibility for split reversal (if the rule is met, send cases to the right)
CART marks straight splits as “s” and reverse splits as “r”
Introduction to CART® Salford Systems ©2007 51
Calculating Variable ImportanceList all main splitters as well as all surrogates along with theresulting improvementsAggregate the above list by variable accumulating improvementsSort the result descendingDivide each entry by the largest value and scale to 100%The resulting importance list is based on the given tree exclusively, if a different tree a grown the importance list will most likely changeVariable importance reveals a deeper structure in stark contrast to a casual look at the main splitters
Introduction to CART® Salford Systems ©2007 52
Mystery SolvedIn our example improvements in the root node dominate the whole treeConsequently, the variable importance reflects the surrogates table in the root node
Importance is a function of the OVERALL tree including deepest nodesSuppose you grow a large exploratory tree — reviewimportancesThen find an optimal tree via test set or CV yielding smaller treeOptimal tree SAME as exploratory tree in the top nodesYET importances might be quite different.WHY? Because larger tree uses more nodes to compute the importanceWhen comparing results be sure to compare similar or same sized trees
Variable Importance Caution
Introduction to CART® Salford Systems ©2007 54
Splitting Rules and FriendsSplitting Rules and Friends
Introduction to CART® Salford Systems ©2007 55
Splitting Rules - IntroductionSplitting rule controls the tree growing stageAt each node, a finite pool of all possible splits exists
The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in thenode
The task of the splitting rule is to rank all splits in the poolaccording to their improvements
The winner will become the main splitterThe top runner-ups will become competitors
It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split”
Introduction to CART® Salford Systems ©2007 56
Splitting Rules Control
Selected in the “Method” tab of the Model Setup dialogThere are 6 alternative rules for classification and 2 rules for regressionSince there is no theoretical justification on the best splitting rule, we recommend trying all of them and picking the best resulting model
Introduction to CART® Salford Systems ©2007 57
Further InsightsOn binary classification problem, GINI and TWOING are equivalent
Therefore, we recommend setting “Favor even Splits” control to 1 when TWOING is used
GINI and Symmetric GINI only differ when an asymmetric cost matrix is used and identical otherwiseOrdered TWOING is a restricted version of TWOING on multinomial targets coded into a set of contiguous integersClass Probability is identical to GINI during tree growing but differs during the pruning
Introduction to CART® Salford Systems ©2007 58
Battery RULES
The battery was designed to automatically try all six alternative splitting rulesAccording to the output above, Entropy splitting rule resulted to the best accuracy model on the given dataIndividual model navigators are also available
Introduction to CART® Salford Systems ©2007 59
Controlled Tree GrowthCART was originally conceived to be fully automated thus protecting an analyst from making erroneous split decisionsIn some cases, however, it could be desirable to influence tree evolution to address specific modeling needsModern CART engine allows multiple direct and indirect ways of accomplishing this:
Penalties on variables, missing values, and categoriesControlling which variables are allowed or disallowed at different levels in the tree and node sizesDirect forcing of splits at the top two levels in the tree
Introduction to CART® Salford Systems ©2007 60
Penalizing Individual Variables
We use variable penalty control to penalize CLASSES a little bitThis is enough to ensure FIT the main splitter positionThe improvement associated with CLASSES is now downgraded by 1 minus the penalty factor
Introduction to CART® Salford Systems ©2007 61
Penalizing Missing and Categorical Predictors
Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantageIntroducing systematic penalty set to the proportion of the variable missing effectively removes the possible biasSame holds for categorical variables with large number of distinct levelsWe recommend that the user always set both controls to 1
Introduction to CART® Salford Systems ©2007 62
Forcing Splits
CART allows direct forcing of the split variable (and possibly split value) in the top three nodes in the tree
Introduction to CART® Salford Systems ©2007 63
Notes on Forcing SplitsIt is not guaranteed that a forced split below the root node level will be preserved (because of pruning)Do not simply overturn “illogical” automatic splits by forcing something else – look for an explanation why CART makes a “wrong” choice, it could indicate a problem with your data
Introduction to CART® Salford Systems ©2007 64
Constrained TreesMany predictive models can benefit from Salford’s patent pending “Structured Trees”Trees constrained in how they are grown to reflect decision support requirementsIn mobile phone example: want tree to first segment on customer characteristics and then complete using price variablesPrice variables are under the control of the companyCustomer characteristics are not under company control
Introduction to CART® Salford Systems ©2007 65
Setting Up Constraints
Green indicates where in the tree variables of group are allowed to appear
Introduction to CART® Salford Systems ©2007 66
Constrained Tree
Demographic and spend information at top of treeHandset (HANDPRIC) and per minute pricing (USEPRICE) at bottom
An essential component of any CART analysisA prior probability distribution for the classes is needed to doproper splitting since prior probabilities are used to calculatethe improvements in the Gini formulaExample: Suppose data contains
99% type A (nonresponders),1% type B (responders),
CART might focus on not missing any class A objectsone "solution" classify all objects nonresponders
Advantages: Only 1% error rateVery simple predictive rules, although not informativeBut realize this could be an optimal rule
Prior Probabilities
Most common priors: PRIORS EQUALThis gives each class equal weight regardless of its frequency in the dataPrevents CART from favoring more prevalent classesIf we have 900 class A and 100 class B and equal priors
With equal priors prevalence measured as percent of OWN class sizeMeasurements relative to own class not entire learning setIf PRIORS DATA both child nodes would be class AEqual priors puts both classes on an equal footing
A: 600B: 90
2/3 of all A9/10 of all BClass as B
A: 300B: 10
1/3 of all A1/10 of all BClass as A
A: 900 B: 100
Priors EQUAL
For equal priors class node is class 1 if
Classify any node by “count ratio in node” relative to “count ratio in root”Classify by which level has greatest relative richness
N tN t
NN
NN
i
i
1
2
1
2
( )( )
>
= number of cases in class i at the root node(t) = number of cases in class i at the node t
Class Assignment Formula I
When priors are not equal: π1N1(t) / π2N2(t) > N1 / N2
where πi is the prior for class iSince π1 / π2 always appears as ratio just think in terms of boosting amount Ni(t) by a factor, e.g. 9:1, 100:1, 2:1If priors are EQUAL, πi terms cancel out
If priors are DATA, then π1 / π2 = N1 / N2which simplifies formula to plurality rule (simple counting) classify as class 1 if N1(t) > N2(t)
If priors are neither EQUAL nor DATA then formula will provide boost to up-weighted classes
Class Assignment Formula II
A: 300B: 10
A: 300B: 10
A: 600B: 90
A: 600B: 90
30010
900100> so class as A
In general the tests w ould be w eighted by the appropriate priors(B ) (A ) (A )
(B )
tests can be w ritten as AB or B
A as convenient
ππ
ππ
90600
100900
30010
900100> >
90600
100900> so class as B
A: 900B: 100
A: 900B: 100
Example
Introduction to CART® Salford Systems ©2007 72
Cross ValidationCross Validation
Cross-validation is a recent computer intensive development in statisticsPurpose is to protect oneself from over fitting errors
Don't want to capitalize on chance — track idiosyncrasies of this dataIdiosyncrasies which will NOT be observed on fresh data
Ideally would like to use large test data sets to evaluate trees, N>5000Practically, some studies don't have sufficient data to spare for testingCross-validation will use SAME data for learning and for testing
Cross-Validation
Begin by growing maximal tree on ALL dataDivide data into 10 portions stratified on dependent variable levelsReserve first portion for test, grow new tree on remaining 9 portionsUse the 1/10 test set to measure error rate for this 9/10 data tree
Error rate is measured for the maximal tree and for all subtrees
Now rotate out new 1/10 test data setGrow new tree on remaining 9 portionsRepeat until all 10 portions of the data have been used as test sets
10-fold CV: Industry Standard
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
Test
Test
Test
Test
Learn
Learn
Learn
LearnLearn
ETC...
Learn
The CV Procedure
Every observation is used as test case exactly onceIn 10 fold CV each observation is used as a learning case 9 timesIn 10 fold CV 10 auxiliary CART trees are grown in addition to initial treeWhen all 10 CV trees are done the error rates are cumulated (summed)Summing of error rates is done by TREE complexitySummed Error (cost) rates are then attributed to INITIAL treeObserve that no two of these trees need be the sameResults are subject to random fluctuations — sometimes severe
CV Details
Look at the separate CV tree results of a single CV analysisCART produces a summary of the results for each tree at end of outputTable titled "CV TREE COMPETITOR LISTINGS" reports results for each CV tree Root (TOP), Left Child of Root (LEFT), and Right Child of Root (RIGHT)Want to know how stable results are -- do different variables split root node in different trees?Only difference between different CV trees is random elimination of 1/10 dataSpecial battery called CVR exists to study the effects of repeated cross-validation process
CV Replication and Stability
Introduction to CART® Salford Systems ©2007 78
Battery CVR
Introduction to CART® Salford Systems ©2007 79
Regression TreesRegression Trees
Introduction to CART® Salford Systems ©2007 80
General CommentsRegression trees result to piece-wise constant models (multi-dimensional staircase) on an orthogonal partition of the data space
Thus usually not the best possible performer in terms of conventional regression loss functions
Only a very limited number of controls is available to influence the modeling process
Priors and costs are no longer applicableThere are two splitting rules: LS and LAD
Very powerful in capturing high-order interactions but somewhat weak in explaining simple main effects
Introduction to CART® Salford Systems ©2007 81
Boston Housing Data SetHarrison, D. and D. Rubinfeld. Hedonic Housing Prices & Demand For Clean Air. Journal of Environmental Economics and Management, v5, 81-102 , 1978
506 census tracts in City of Boston for the year 1970Goal: study relationship between quality of life variables and property valuesMV median value of owner-occupied homes in tract (‘000s)CRIM per capita crime ratesNOX concentration of nitrogen oxides (pphm)AGE percent built before 1940DIS weighted distance to centers of employmentRM average number of rooms per houseLSTAT percent neighborhood ‘lower SES’RAD accessibility to radial highwaysCHAS borders Charles River (0/1)INDUS percent non-retail businessTAX tax ratePT pupil teacher ratio
Introduction to CART® Salford Systems ©2007 82
Splitting the Root Node
Improvement is defined in terms of the greatest reduction in the sum of squared errors when a single constant prediction is replaced by two separate constants on each side
Introduction to CART® Salford Systems ©2007 83
Regression Tree Model
All cases in the given node are assigned the same predicted response – the node average of the original targetNodes are color-coded according to the predicted responseWe have a convenient segmentation of the population according to the average response levels
Introduction to CART® Salford Systems ©2007 84
The Best and the Worst Segments
Introduction to CART® Salford Systems ©2007 85
Scoring and DeploymentScoring and Deployment
Introduction to CART® Salford Systems ©2007 86
Notes on Scoring
Trees cannot be expressed as a simple analytical expression, instead, a lengthy collection of rules is needed to express the tree logicNonetheless, the scoring operation is very fast once the rules are compiled into the machine codeScoring can be done in two ways:
Internally – by using the SCORE commandExternally – by using the TRANSLATE command to generate C, SAS, Java, or PMML code
Model of any size can be scored or translated
Introduction to CART® Salford Systems ©2007 87
Internal Scoring
Introduction to CART® Salford Systems ©2007 88
Output File Format
CART creates the following output variablesRESPONSE – predicted class assignmentNODE – terminal node the case falls intoCORRECT – indicates whether the prediction is correct (assuming that the actual target is known)PROB_x – class probability (constant within each node)PATH_x – path traveled by case in the tree
Case 1 starts in the root node then follows through internal nodes 2, 3, 4, 5 and finally lands into terminal node 1Case 2 starts in the root node and then lands into terminal node 7
Finally CART adds all of the original model variables including the target and predictors
CASEID RESPONSE NODE CORRECT DV PROB_1 PROB_2 PATH_1 PATH_2 PATH_3 PATH_4 PATH_5 PATH_6 XA XB XC1 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 02 1 7 0 0 0.738636 0.261364 1 -7 0 0 0 0 1 0 03 0 1 1 0 0.998795 0.001205 1 2 3 4 5 -1 0 0 0
Introduction to CART® Salford Systems ©2007 89
Model Translation
Introduction to CART® Salford Systems ©2007 90
Output Code
Introduction to CART® Salford Systems ©2007 91
CART AutomationCART Automation
Introduction to CART® Salford Systems ©2007 92
Automated CART RunsStarting with CART 6.0 we have added a new feature called batteriesA battery is a collection of runs obtained via a predefined procedureA trivial set of batteries is varying one of the key CART parameters:
RULES – run every major splitting ruleATOM – vary minimum parent node sizeMINCHILD – vary minimum node sizeDEPTH – vary tree depthNODES – vary maximum number of nodes allowedFLIP – flipping the learn and test samples (two runs only)
Introduction to CART® Salford Systems ©2007 93
More BatteriesDRAW – each run uses a randomly drawn learn sample from the original dataCV – varying number of folds used in cross-validationCVR – repeated cross-validation each time with a different random number seedKEEP – select a given number of predictors at random from the initial larger set
Supports a core set of predictors that must always be present
ONEOFF – one predictor model for each predictor in the given listThe remaining batteries are more advanced and will be discussed individually
Introduction to CART® Salford Systems ©2007 94
Battery MCT
The primary purpose is to test possible model overfitting by running Monte Carlo simulation
Randomly permute the target variableBuild a CART treeRepeat a number of timesThe resulting family of profiles measures intrinsic overfitting due to the learning processCompare the actual run profile to the family of random profiles to check the significance of the results
This battery is especially useful when the relative error of the original run exceeds 90%
Introduction to CART® Salford Systems ©2007 95
Battery MVI
Designed to check the impact of the missing values indicators on the model performance
Build a standard model, missing values are handled via the mechanism of surrogatesBuild a model using MVIs only – measures the amount of predictive information contained in the missing value indicators aloneCombine the twoSame as above with missing penalties turned on
Introduction to CART® Salford Systems ©2007 96
Battery PRIORVary the priors for the specified class within a given range
For example, vary the prior on class 1 from 0.2 to 0.8 in increments of 0.02
Will result to a family of models with drastically different performance in terms of class richness and accuracy (see the earlier discussion)Can be used for hot spot detectionCan also be used to construct a very efficient voting ensemble of treesThis battery is covered in greater length in our Advanced CART class
Introduction to CART® Salford Systems ©2007 97
Battery SHAVINGAutomates variable selection processShaving from the top – each step the most important variable is eliminated
Capable to detect and eliminate “model hijackers” – variables that appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable)
Shaving from the bottom – each step the least important variable is eliminated
May drastically reduce the number of variables used by the model
SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least
May result to a very long sequence of runs (quadratic complexity)
Introduction to CART® Salford Systems ©2007 98
Battery SAMPLE
Build a series of models in which the learn sample is systematically reducedUsed to determine the impact of the learn sample size on the model performanceCan be combined with BATTERY DRAW to get additional spread
Introduction to CART® Salford Systems ©2007 99
Battery TARGET
Attempt to build a model for each variable in the given list as a target using the remaining variables as predictorsVery effective in capturing inter-variable dependency patternsCan be used as an input to variable clustering and multi-dimensional scaling approachesWhen run on MVIs only, can reveal missing value patterns in the data
Introduction to CART® Salford Systems ©2007 100
Stable Trees and ConsistencyStable Trees and Consistency
Introduction to CART® Salford Systems ©2007 101
Notes on Tree StabilityTree are usually stable near the top and become progressively more and more unstable as the nodes become smallerTherefore, smaller trees in the pruning sequence will tend to be more stableNode instability usually manifests itself when comparing node class distributions between the learn and test samplesIdeally we want to identify the largest stable tree in the pruning sequenceThe optimal tree suggested by CART may contain a few unstable nodes structurally needed to optimize the overall accuracyTrain Test Consistency feature (TTC) was designed to address the topic at length
Introduction to CART® Salford Systems ©2007 102
MJ2000 Run
The dataset comes from an undisclosed financial institutionSet up a default classification run with 50% random test sampleCART determines 36-node optimal tree
Introduction to CART® Salford Systems ©2007 103
Node Stability
A quick look at the Learn and Test node distributions reveals a large number of unstable nodesLet’s use TTC feature by pressing the [T/T Consist] button to identify the largest stable tree
Terminal Nodes TTC Report
Introduction to CART® Salford Systems ©2007 104
The Largest Stable Tree
Tree with 11 terminal nodes is identified as stable (all nodes match on both direction and rank)While within the specified tolerance, nodes 2, 6, 8 are somewhat unsatisfactory in terms of rank-order
These nodes will emerge as rank-order unstable if we lower the tolerance level to 0.5
TTC Report
Introduction to CART® Salford Systems ©2007 105
Improving Predictor List
Run BATTERY SHAVE to systematically eliminate the least important predictorsThe resulting optimal tree has even better accuracy than beforeIt also uses a much smaller set of only 6 predictorsLet’s study the tree stability for the new model
Predictor Elimination by Shaving
Introduction to CART® Salford Systems ©2007 106
Comparing the Results
The tree on the left is the original (unshaved) oneThe tree on the right is the result of shavingThe new tree is far more stable than the original one, except for node 8
Original Tree New Tree
Introduction to CART® Salford Systems ©2007 107
Justification
Node 8 is irrelevant for classification but its presence is required in order to allow a very strong split below
Learn Sample Test Sample
Introduction to CART® Salford Systems ©2007 108
Recommended ReadingBreiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: WadsworthBreiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64.Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive Logistic Regression: a Statistical View of Boosting." (Aug. 1998)