data mining with decision trees - salford...

Data Mining with Decision TreesData Mining with Decision Trees

Ingo BentrottIngo Bentrott

Salford Salford SystemsSystems

August 15, 2005Slide 2

Goals of ClassGoals of Class

!! Provide Overview of Decision TreesProvide Overview of Decision Trees

–– Motivation for research and areas of researchMotivation for research and areas of research

–– Historical backgroundHistorical background

!! Key Decision Tree ConceptsKey Decision Tree Concepts

–– Tree growing Tree growing

–– Testing & pruningTesting & pruning

–– Selection of “optimal” treeSelection of “optimal” tree

–– Refinement of analysisRefinement of analysis

–– TroubleshootingTroubleshooting

!! Guide to interpreting Decision Trees using CARTGuide to interpreting Decision Trees using CART

–– How to interpret statistics and reportsHow to interpret statistics and reports

–– What to ask for in the way of diagnostic reportsWhat to ask for in the way of diagnostic reports

!! Presentation and CART software available at the following FTP siPresentation and CART software available at the following FTP site (about 30 te (about 30 MB): MB):

–– ftp://audrey.salfordftp://audrey.salford--systems.com/private/datamine/pres_and_data.zipsystems.com/private/datamine/pres_and_data.zip

–– ftp://audrey.salfordftp://audrey.salford--systems.com/private/datamine/CARTsetup.exesystems.com/private/datamine/CARTsetup.exe


A Brief History of Decision TreesA Brief History of Decision Trees

!! First developed in 1960s because of proposed a segmentation apprFirst developed in 1960s because of proposed a segmentation approach oach

in marketing.in marketing.

!! As data sets get larger how to deal with competing, and possiblyAs data sets get larger how to deal with competing, and possibly mismis--

specified models.specified models.

!! Decision trees are a supervised learning tool.Decision trees are a supervised learning tool.

!! Decision trees divide the descriptive space into regions, each Decision trees divide the descriptive space into regions, each

associated with a class.associated with a class.

!! Compact trees use a recursive partitioningCompact trees use a recursive partitioning

!! Trees are a series of nodes.Trees are a series of nodes.

–– Look for purity/homogeneity in terminal nodes.Look for purity/homogeneity in terminal nodes.


Types of commercial of Decision TreesTypes of commercial of Decision Trees

!! AID AID

–– first brought into in the 1960’sfirst brought into in the 1960’s

!! CHAID CHAID –– Kass Kass 19801980

–– Grows until some goodness of split criterion is met.Grows until some goodness of split criterion is met.

–– Uses ChiUses Chi--Square methodologySquare methodology

!! Enterprise Miner (SAS)Enterprise Miner (SAS)

–– Has a C&RT, Has a C&RT, ChaidChaid and a mix of the two in it.and a mix of the two in it.

!! Answer Tree (SPSS)Answer Tree (SPSS)

–– Has a C&RT and Has a C&RT and ChaidChaid in itin it

!! CARTCART

–– OverfitOverfit and then prunes backand then prunes back

–– Handling of missing valuesHandling of missing values

!! C4.5C4.5

–– Ross QuinlanRoss Quinlan

–– Using information gain to select the most`discriminatory featureUsing information gain to select the most`discriminatory feature (for tree and sub(for tree and sub--

trees)trees)


Application of Decision TreesApplication of Decision Trees

!! ManufacturingManufacturing

–– PC board failures. PC board failures.

–– Which components are problematic Which components are problematic –– use measurement datause measurement data

!! MedicalMedical

–– Drug Trials.Drug Trials.

–– If patient has a reaction, what are the traits of those patientsIf patient has a reaction, what are the traits of those patients??

!! FinancialFinancial

–– Credit card fraud detectionCredit card fraud detection

!! MarketingMarketing

–– Up selling customersUp selling customers

–– Churn prediction modelsChurn prediction models

!! Temporal data miningTemporal data mining

–– See trends over timeSee trends over time


CART Decision Tree CART Decision Tree -- ProsPros

!! Automatic separation of relevant from irrelevant predictors (varAutomatic separation of relevant from irrelevant predictors (variable iable

selection)selection)

!! Does not require a transform such as log, square root (model Does not require a transform such as log, square root (model

specification)specification)

!! Automatic interaction detection (model specification)Automatic interaction detection (model specification)

!! Impervious to outliers (can handle dirty data)Impervious to outliers (can handle dirty data)

!! Unaffected by missing values (does not require listUnaffected by missing values (does not require list--wise deletion or wise deletion or

missing value imputation)missing value imputation)

!! Requires only moderate supervision by the analystRequires only moderate supervision by the analyst


CART Decision Tree CART Decision Tree –– Possible DrawbacksPossible Drawbacks

!! CART is notoriously weak at capturing strong linear structureCART is notoriously weak at capturing strong linear structure

!! CART recognizes the structure but cannot represent it effectivelCART recognizes the structure but cannot represent it effectivelyy

!! With many variables, several of which enter a model linearly, With many variables, several of which enter a model linearly,

structure will not be obvious from CART outputstructure will not be obvious from CART output

!! CART can produce a very large tree in an attempt to represent veCART can produce a very large tree in an attempt to represent very ry

simple relationshipssimple relationships

!! LOGIT easily captures and represents linear structureLOGIT easily captures and represents linear structure

!! Many nonMany non--linear structures can still be reasonably approximated with a linear structures can still be reasonably approximated with a

linear structure, hence even incorrectly specified LOGIT can perlinear structure, hence even incorrectly specified LOGIT can perform form

wellwell

!! Discontinuous response (unlike Discontinuous response (unlike LogitLogit))

–– Small change in Small change in xx could lead to a large change in could lead to a large change in yy


In 1984 Berkeley and Stanford statisticians In 1984 Berkeley and Stanford statisticians

announced a new classification toolannounced a new classification tool

!! A computer intensive technique that could A computer intensive technique that could automaticallyautomatically analyze dataanalyze data

!! Method could sift through any number of variablesMethod could sift through any number of variables

–– Could separate relevant from irrelevant predictorsCould separate relevant from irrelevant predictors

–– Did not require any kind of variable transforms (logs, square roDid not require any kind of variable transforms (logs, square roots)ots)

–– Impervious to outliers and missing valuesImpervious to outliers and missing values

–– Could yield relatively simple and easy to comprehend modelsCould yield relatively simple and easy to comprehend models

–– Required little to no supervision by the analystRequired little to no supervision by the analyst

–– Was frequently more accurate than traditional logistic regressioWas frequently more accurate than traditional logistic regression orn or

discriminantdiscriminant analysis, and other parametric toolsanalysis, and other parametric tools


Why didn't we learn Why didn't we learn

about CART in school?about CART in school?

!! CART was slow to gain widespread recognition for several reasonsCART was slow to gain widespread recognition for several reasons

–– Monograph introducing CART is challenging to readMonograph introducing CART is challenging to read

»» brilliant book overflowing with insights into tree growing methobrilliant book overflowing with insights into tree growing methodology but fairly dology but fairly

technical and brieftechnical and brief

–– Method was not expounded in any textbooks Method was not expounded in any textbooks

–– Originally taught only in advanced graduate statistics classes aOriginally taught only in advanced graduate statistics classes at a handful t a handful

of Universitiesof Universities

–– Original standalone software came with slender documentation, anOriginal standalone software came with slender documentation, and d

output was not selfoutput was not self--explanatoryexplanatory

–– Method was such a radical departure from conventional statisticsMethod was such a radical departure from conventional statistics


Why is CART finally Why is CART finally

receiving more attention?receiving more attention?

!! Rising interest in data miningRising interest in data mining

–– Availability of huge data sets requiring analysisAvailability of huge data sets requiring analysis

–– Need to automate or accelerate and improve analysis process CompNeed to automate or accelerate and improve analysis process Comparative arative

performance studies performance studies

!! Advantages of CART over other tree methodsAdvantages of CART over other tree methods

–– handling of missing valueshandling of missing values

–– assistance in interpretation of results (surrogates)assistance in interpretation of results (surrogates)

–– performance advantages: speed, accuracyperformance advantages: speed, accuracy

!! New software and documentation make techniques accessible to endNew software and documentation make techniques accessible to end

usersusers

!! Word of mouth generated by early adoptersWord of mouth generated by early adopters


So what is CART?So what is CART?

!! Best illustrated with an example: Pima Indians Diabetes Study.Best illustrated with an example: Pima Indians Diabetes Study.

!! Given the diagnosis of a diabetes based onGiven the diagnosis of a diabetes based on

–– Number of times pregnant, Plasma glucose concentration a 2 hoursNumber of times pregnant, Plasma glucose concentration a 2 hours in an in an

oral glucose tolerance test, Diastolic blood pressure (mm Hg), Toral glucose tolerance test, Diastolic blood pressure (mm Hg), Triceps riceps

skin fold thickness (mm), 2skin fold thickness (mm), 2--Hour serum insulin (Hour serum insulin (mumu U/ml), BMI, and U/ml), BMI, and

Age.Age.

!! Predict who is at risk of developing diabetesPredict who is at risk of developing diabetes

–– Prediction will determine treatment program (medications or not)Prediction will determine treatment program (medications or not)

!! For each patient 7 variables were available, including: For each patient 7 variables were available, including:

–– Age and weight demographics, medical history, lab resultsAge and weight demographics, medical history, lab results

!! Both noninvasive and invasive variables were used in the analysiBoth noninvasive and invasive variables were used in the analysiss


Diabetes Risk TreeDiabetes Risk Tree

!! Example of a Example of a

CLASSIFICATION CLASSIFICATION

treetree

!! Dependent variable Dependent variable

is categorical is categorical

(Negative, Positive)(Negative, Positive)

!! Want to predict Want to predict

class membershipclass membership


What's in this report?What's in this report?

!! Entire tree represents a complete analysis or modelEntire tree represents a complete analysis or model

!! Has the form of a decision treeHas the form of a decision tree

!! Root of inverted tree contains all dataRoot of inverted tree contains all data

!! Root gives rise to child nodesRoot gives rise to child nodes

!! Child nodes can in turn give rise to their own children Child nodes can in turn give rise to their own children

!! At some point a given path ends in a terminal nodeAt some point a given path ends in a terminal node

!! Terminal node classifies objectTerminal node classifies object

!! Path through the tree governed by the answers to QUESTIONS or Path through the tree governed by the answers to QUESTIONS or

RULESRULES


Key Components of Tree Structured Data Key Components of Tree Structured Data

AnalysisAnalysis

!! Tree growingTree growing

–– Consider all possible splits of a nodeConsider all possible splits of a node

–– Find the best split according to the given splitting rule (best Find the best split according to the given splitting rule (best improvement)improvement)

–– Continue sequentially until the largest tree is grownContinue sequentially until the largest tree is grown

!! Tree Pruning Tree Pruning –– creating a sequence of nested trees (pruning sequence) creating a sequence of nested trees (pruning sequence)

by systematic removal of the weakest branchesby systematic removal of the weakest branches

!! Optimal Tree Selection Optimal Tree Selection –– using test sample or crossusing test sample or cross--validation to find validation to find

the best tree in the pruning sequencethe best tree in the pruning sequence


Searching all Possible SplitsSearching all Possible Splits

!! For any node CART will examine ALL possible splitsFor any node CART will examine ALL possible splits

–– Computationally intensive but there are only a finite number of Computationally intensive but there are only a finite number of splitssplits

!! Consider first the variable BMI Consider first the variable BMI -- in our data set it has minimum value in our data set it has minimum value

of 18.2of 18.2

–– The ruleThe rule

»» Is BMI Is BMI ≤≤ 18.2? will separate out these three cases to the left 18.2? will separate out these three cases to the left —— the slender peoplethe slender people

!! Next increase the AGE threshold to the next youngest personNext increase the AGE threshold to the next youngest person»» Is BMI Is BMI ≤ ≤ 19.419.4

»» This will direct seven cases to the leftThis will direct seven cases to the left

!! Continue increasing the splitting threshold value by valueContinue increasing the splitting threshold value by value


Split TablesSplit Tables

!! Sorted by BMISorted by BMI !! Sorted by Plasma Glucose Sorted by Plasma Glucose

ConcentrationConcentration

BMIPlasma Glucose

Concentration Age Disease

18.20 97 21 Negative



18.40 104 27 Negative



19.40 103 22 Negative

19.50 100 28 Negative



19.60 119 72 Negative

19.60 129 60 Negative


20.00 105 22 Negative

BMIPlasma Glucose

Concentration Age Disease









19.50 100 28 Negative

19.40 103 22 Negative

18.40 104 27 Negative

20.00 105 22 Negative

19.60 119 72 Negative

19.60 129 60 Negative


Question is of the form:Question is of the form:

Is statement TRUE?Is statement TRUE?

!! Is continuous variable Is continuous variable XX ≤≤ cc ??

!! Does categorical variable D take on levels i, j, or k?Does categorical variable D take on levels i, j, or k?

–– e.g. Is geographic region 1, 2, 4, or 7?e.g. Is geographic region 1, 2, 4, or 7?

!! Standard split:Standard split:

–– If answer to question is YES a case goes left; otherwise it goesIf answer to question is YES a case goes left; otherwise it goes rightright

–– This is the form of all primary splitsThis is the form of all primary splits

!! Question is formulated so that only Question is formulated so that only twotwo answers possibleanswers possible

–– Called binary partitioningCalled binary partitioning

–– In CART the YES answer always goes leftIn CART the YES answer always goes left


Classification is determinedClassification is determined

by following a case’sby following a case’s

path down the treepath down the tree

!! Terminal nodes are associated with a single classTerminal nodes are associated with a single class

–– Any case arriving at a terminal node is assigned to that classAny case arriving at a terminal node is assigned to that class

!! In standard classification, tree assignment is not probabilisticIn standard classification, tree assignment is not probabilistic

–– Another type of CART tree, the Another type of CART tree, the class probabilityclass probability tree does report tree does report

distributions of class membership in nodes (may be discussed latdistributions of class membership in nodes (may be discussed later)er)

!! With large data sets we can take the empirical distribution in aWith large data sets we can take the empirical distribution in a terminal terminal

node to represent the distribution of classesnode to represent the distribution of classes


Accuracy of a TreeAccuracy of a Tree

!! For classification trees For classification trees

–– All objects reaching a terminal node are classified in the same All objects reaching a terminal node are classified in the same wayway

»» e.g. All Pima Indians with Plasma Glucose <= 127.5 and AGE less e.g. All Pima Indians with Plasma Glucose <= 127.5 and AGE less than or equal to 28.5 than or equal to 28.5

are classified as NEGATIVEare classified as NEGATIVE

»» Classification is same regardless of other medical history and lClassification is same regardless of other medical history and lab resultsab results

!! Some cases may be misclassified:Some cases may be misclassified:

–– Simplest measure: Simplest measure:

»» R(T)R(T)=percent of learning sample misclassified in tree T=percent of learning sample misclassified in tree T

»» R(R(tt))=percent of learning sample in node =percent of learning sample in node tt misclassified misclassified

!! T identifies a TREET identifies a TREE

!! tt identifies a NODEidentifies a NODE


Prediction Success TablePrediction Success Table

!! In CART terminology the performance is described by the error raIn CART terminology the performance is described by the error rate te R(T)R(T), , where where TT indexes a specific tree indexes a specific tree

–– In the example In the example R(T)=1 R(T)=1 -- .2643=.7357.2643=.7357


Interpretation and use of the CART TreeInterpretation and use of the CART Tree

!! Practical decision toolPractical decision tool

–– Trees like this are used for real world decision makingTrees like this are used for real world decision making

»» Rules for physiciansRules for physicians

»» Decision tool for a nationwide team of salesmen needing to classDecision tool for a nationwide team of salesmen needing to classify potential customersify potential customers

!! Selection of the most important prognostic variablesSelection of the most important prognostic variables

–– Variable screen for parametric model buildingVariable screen for parametric model building

–– Example data set had 7 variablesExample data set had 7 variables

–– Use results to find variables to include in a logistic regressioUse results to find variables to include in a logistic regressionn

!! In our example AGE is not relevant if Plasma Glucose is not lowIn our example AGE is not relevant if Plasma Glucose is not low

–– Suggests which interactions will be importantSuggests which interactions will be important


CART is a form of Binary CART is a form of Binary

Recursive PartitioningRecursive Partitioning

!! Data is split into two partitionsData is split into two partitions

–– thus “binary” partitionthus “binary” partition

!! Partitions can also be split into subPartitions can also be split into sub--partitionspartitions

–– hence procedure is recursivehence procedure is recursive

!! CART tree is generated by repeated partitioning of data setCART tree is generated by repeated partitioning of data set


CART is also NonCART is also Non--parametricparametric——No No

Predetermined Functional FormPredetermined Functional Form

!! Approximates the data pattern via local summariesApproximates the data pattern via local summaries

!! BUTBUT

–– Determines which variables to use dynamicallyDetermines which variables to use dynamically

–– Determines which regions to focus on dynamicallyDetermines which regions to focus on dynamically

–– Focuses on only a moderate and possibly very small number of varFocuses on only a moderate and possibly very small number of variablesiables

–– All automatically determined by the dataAll automatically determined by the data

!! We saw in the Pima Indian study CART focused on just 3 of 7 We saw in the Pima Indian study CART focused on just 3 of 7

variables variables


Tracing a CART analysis Tracing a CART analysis --

Three types of flowerThree types of flower

IrisIris VersicolorVersicolor

Iris Iris SetosaSetosa

Iris Iris VirginicaVirginica


Tracing a CART AnalysisTracing a CART Analysis

!! IRIS data setIRIS data set

!! 3 classes of species3 classes of species

!! We can use theWe can use the

PETALLEN & PETALWIDPETALLEN & PETALWID

variables to produce a treevariables to produce a tree

!! KeyKey

ΟΟ = Species 1= Species 1

∆∆ = Species 2= Species 2

!! = Species 3= Species 3


First Split: Partition Data First Split: Partition Data

Into Two SegmentsInto Two Segments

!! Partitioning line parallel to an Partitioning line parallel to an axisaxis

!! Root node split firstRoot node split first

–– ≤≤ 2.4502.450

–– Isolates all the type 1 species Isolates all the type 1 species from from rest of the samplerest of the sample

!! This gives us two child nodesThis gives us two child nodes

–– One is a Terminal Node with One is a Terminal Node with only type 1 speciesonly type 1 species

–– The other contains only type The other contains only type 2 and 32 and 3

!! Note: entire data set divided into Note: entire data set divided into

two separate partstwo separate parts


Second Split: Partitions Second Split: Partitions

Only Portion of the DataOnly Portion of the Data

!! Again, partition with line Again, partition with line

parallel to one of the two parallel to one of the two

axesaxes

!! CART selects PETALWID CART selects PETALWID

to split this NODEto split this NODE

–– Split it at Split it at ≤≤ 1.751.75

–– Gives a tree with a Gives a tree with a

misclassification rate of 4%misclassification rate of 4%

!! Split applies only to a single Split applies only to a single

partition of the datapartition of the data

!! Each partition is analyzed Each partition is analyzed

separatelyseparately


Setting up a CART RunSetting up a CART Run


Model Setup Model Setup –– Variable SelectionVariable Selection


Model Setup Model Setup -- CategoricalsCategoricals


Categorical PredictorsCategorical Predictors

!! CART considers all possible splits based on a categorical predicCART considers all possible splits based on a categorical predictor unless the tor unless the number of categories exceeds 15number of categories exceeds 15

!! Example: four regions Example: four regions -- A, B, C, DA, B, C, D

!! Each decision is a possible split of the node and each is evaluaEach decision is a possible split of the node and each is evaluated for ted for improvement of impurityimprovement of impurity

Left Right

1 A B, C, D

2 B A, C, B

3 C A, B, D

4 D A, B, C

5 A, B C, D

6 A, C B, D

7 A, D B, C


Shortcut when the Target is BinaryShortcut when the Target is Binary

!! Here special algorithms reduce compute time to linear in number Here special algorithms reduce compute time to linear in number of of

levelslevels

!! 30 levels takes twice as long as 1530 levels takes twice as long as 15-- not 10,000+ times as longnot 10,000+ times as long

!! When you have highWhen you have high--level categorical variables in a multilevel categorical variables in a multi--class class

problemproblem

–– Create a binary classification problemCreate a binary classification problem

–– Try different definitions of DPV (which groups to combine)Try different definitions of DPV (which groups to combine)

–– Explore predictor groupings produced by CARTExplore predictor groupings produced by CART

–– From a study of all results decide which are most informativeFrom a study of all results decide which are most informative

–– Create new grouped predicted variable for multiCreate new grouped predicted variable for multi--class problemclass problem


Model Setup Model Setup -- TestingTesting


Tree Growing Tree Growing —— Stopping CriteriaStopping Criteria

!! CART differs from earlier methods such as AID or CHAID on CART differs from earlier methods such as AID or CHAID on

stoppingstopping

!! STOPPING not an essential component of CARTSTOPPING not an essential component of CART

!! IF time and data were available CART would NOT stop!IF time and data were available CART would NOT stop!

!! Grow tree until further growth is not possibleGrow tree until further growth is not possible

–– Terminal nodes only have one caseTerminal nodes only have one case

–– Terminal nodes with more than one case are identical on predictoTerminal nodes with more than one case are identical on predictor r

variablesvariables

!! Result is called the MAXIMAL treeResult is called the MAXIMAL tree

!! In practice certain limits can be imposedIn practice certain limits can be imposed

–– Do not attempt to split smaller nodesDo not attempt to split smaller nodes

»» 100 cases in binary dependent variable market research problems100 cases in binary dependent variable market research problems


Tree PruningTree Pruning

!! Take some large tree such as the maximal treeTake some large tree such as the maximal tree

!! Tree may be radicallyTree may be radically overfitoverfit

–– Tracks all the idiosyncrasies of THIS data setTracks all the idiosyncrasies of THIS data set

–– Tracks patterns that may not be found in other data setsTracks patterns that may not be found in other data sets

–– Analogous to a regression with very large number of variablesAnalogous to a regression with very large number of variables

!! PRUNE branches from the large treePRUNE branches from the large tree

!! CHALLENGE is HOW TO PRUNECHALLENGE is HOW TO PRUNE

–– Which branch to cut first?Which branch to cut first?

–– What sequence What sequence —— There are hundreds, if not thousands of pruning There are hundreds, if not thousands of pruning sequencessequences

!! If you have 200 terminal nodes, then…If you have 200 terminal nodes, then…

–– 200 ways to prune away 1st. deleted node200 ways to prune away 1st. deleted node

–– 199 ways to prune away 2nd. deleted node, etc.199 ways to prune away 2nd. deleted node, etc.


Order of PruningOrder of Pruning

!! Prune away Prune away "weakest link" "weakest link" —— the nodes that add least to overall the nodes that add least to overall

accuracyaccuracy

!! If several nodes add the same overall accuracy they all prune awIf several nodes add the same overall accuracy they all prune away ay

simultaneouslysimultaneously

!! Hence more than two terminal nodes could be cut off in one pruniHence more than two terminal nodes could be cut off in one pruningng

!! Often happens in the larger trees; less likely as tree gets smalOften happens in the larger trees; less likely as tree gets smallerler


Pruning Sequence: Classic OutputPruning Sequence: Classic Output

!! Let's examine the tree sequence below:Let's examine the tree sequence below:

==========================

TREE SEQUENCETREE SEQUENCE

==========================

Dependent variable: CLUSTERDependent variable: CLUSTER

Terminal Test SetTerminal Test Set Resubstitution Resubstitution ComplexityComplexity

Tree Nodes Relative Cost Relative Cost ParameTree Nodes Relative Cost Relative Cost Parameterter

------------------------------------------------------------------------------------------------------------------------------------

1 2937 0.679 +/1 2937 0.679 +/-- 0.002 0.577 0.0000000.002 0.577 0.000000

10** 115 0.659 +/10** 115 0.659 +/-- 0.002 0.646 0.0000960.002 0.646 0.000096

23 11 0.667 +/23 11 0.667 +/-- 0.002 0.665 0.0003400.002 0.665 0.000340

24 9 0.667 +/24 9 0.667 +/-- 0.002 0.666 0.0003510.002 0.666 0.000351

25 8 0.668 +/25 8 0.668 +/-- 0.002 0.667 0.0007040.002 0.667 0.000704

26 7 0.672 +/26 7 0.672 +/-- 0.002 0.670 0.0033010.002 0.670 0.003301

27 6 0.676 +/27 6 0.676 +/-- 0.002 0.675 0.0045080.002 0.675 0.004508

28 5 0.684 +/28 5 0.684 +/-- 0.002 0.684 0.0081580.002 0.684 0.008158

29 4 0.732 +/29 4 0.732 +/-- 0.002 0.730 0.0412580.002 0.730 0.041258

30 3 0.806 +/30 3 0.806 +/-- 0.001 0.805 0.0669860.001 0.805 0.066986

31 2 0.889 +/31 2 0.889 +/-- .309661E.309661E--04 0.889 0.07590704 0.889 0.075907

32 1 1.000 +/32 1 1.000 +/-- .602225E.602225E--04 1.000 0.09999004 1.000 0.099990

Initial misclassification cost = 0.900Initial misclassification cost = 0.900

Initial class assignment = 3Initial class assignment = 3


ResubstitutionResubstitution Cost vs. True CostCost vs. True Cost

!! Compare error rates measured by Compare error rates measured by

–– learn data learn data

–– large test set large test set

!! Learn Learn R(T)R(T) always decreases as always decreases as

tree growstree grows

!! Test Test R(T)R(T) first declines then first declines then

increasesincreases

!! Much previous disenchantment Much previous disenchantment

with tree methods due to reliance with tree methods due to reliance

on learn on learn R(T)R(T)

!! Can lead to disasters when applied Can lead to disasters when applied

to new datato new data

71 .00 .42

63 .00 .40

58 .03 .39

40 .10 .32

34 .12 .32

19 .20 .31

*10 .29 .30

9 .32 .34

7 .41 .47

6 .46 .54

5 .53 .61

2 .75 .82

1 .86 .91

No.

Terminal

Nodes

R(T) Rts(T)


Why look atWhy look at resubstitutionresubstitution error rates (or cost) error rates (or cost)

at all?at all?

!! First, provides a rough guide of how you are doingFirst, provides a rough guide of how you are doing

–– Truth will typically be WORSE thanTruth will typically be WORSE than resubstitutionresubstitution measuremeasure

–– If tree performing poorly onIf tree performing poorly on resubstitutionresubstitution error may not want to pursue error may not want to pursue

furtherfurther

–– Resubstitution Resubstitution error more accurate for smaller treeserror more accurate for smaller trees»» So better guide for small nodesSo better guide for small nodes

»» Should be poor guide for many nodesShould be poor guide for many nodes

–– ResubstitutionResubstitution rate useful for comparing SIMILAR SIZED treesrate useful for comparing SIMILAR SIZED trees

–– Even though true error rate is not measured correctly relative rEven though true error rate is not measured correctly relative ranking of anking of

different trees probably is correctdifferent trees probably is correct

»» different trees because of different splitting rulesdifferent trees because of different splitting rules

»» different trees because of different variables allowed in searchdifferent trees because of different variables allowed in search


The Optimal TreeThe Optimal Tree

!! Within a single CART run which tree is Within a single CART run which tree is best?best?

!! The process of pruning the maximal tree The process of pruning the maximal tree can yield many subcan yield many sub--treestrees

!! The test data set or crossThe test data set or cross-- validation validation measures the error rate of each treemeasures the error rate of each tree

!! Current wisdom Current wisdom —— select the tree with select the tree with smallest error ratesmallest error rate

!! Running CV with a new random seed Running CV with a new random seed could yield a different sized treecould yield a different sized tree

!! Typical error rate as a function of tree Typical error rate as a function of tree size has flat regionsize has flat region

!! Minimum could be anywhere in this Minimum could be anywhere in this regionregion

The Best Pruned Subtree:

An Estimation Problem

0 10 20 30 40 50

Tk

R(Tk)

0

1

| |~

^


In what sense is the optimal tree best?In what sense is the optimal tree best?

!! Tree has lowest or near lowest cost as determined by a test procTree has lowest or near lowest cost as determined by a test procedureedure

!! Tree should exhibit very similar accuracy when applied to new daTree should exhibit very similar accuracy when applied to new datata

!! This is key to the CART selection procedure This is key to the CART selection procedure —— applicability to applicability to

external dataexternal data

!! BUT Tree is NOT necessarily unique BUT Tree is NOT necessarily unique —— other tree structures may be other tree structures may be

as goodas good

!! Other tree structures might be found byOther tree structures might be found by

–– Using other learning dataUsing other learning data

–– Using other growing rulesUsing other growing rules

!! Running CV with a different random number seedRunning CV with a different random number seed

!! Some variability of results needs to be expectedSome variability of results needs to be expected

!! Overall story should be similar for good size data setsOverall story should be similar for good size data sets


CrossCross--ValidationValidation

!! CrossCross--validation is a recent computer intensive development in validation is a recent computer intensive development in

statisticsstatistics

!! Purpose is to protect oneself from over fitting errorsPurpose is to protect oneself from over fitting errors

–– Don't want to capitalize on chance Don't want to capitalize on chance —— track idiosyncrasies track idiosyncrasies

of this dataof this data

–– Idiosyncrasies which will NOT be observed on fresh dataIdiosyncrasies which will NOT be observed on fresh data

!! Ideally would like to use large test data sets to evaluate treesIdeally would like to use large test data sets to evaluate trees, N>5000, N>5000

!! Practically, some studies don't have sufficient data to spare foPractically, some studies don't have sufficient data to spare for testingr testing

!! CrossCross--validation will use SAME data for learning and for testingvalidation will use SAME data for learning and for testing


1010--Fold CrossFold Cross--Validation: Validation:

The Industry StandardThe Industry Standard

!! Begin by growing maximal tree on ALL data; put results asideBegin by growing maximal tree on ALL data; put results aside

!! Divide data into 10 portions stratified on dependent variable leDivide data into 10 portions stratified on dependent variable levelsvels

!! Reserve first portion for testReserve first portion for test

!! Grow new tree on remaining 9 portionsGrow new tree on remaining 9 portions

!! Use the 1/10 test set to measure error rate for this 9/10 data tUse the 1/10 test set to measure error rate for this 9/10 data treeree

–– Error rate is measured for the maximal tree and for allError rate is measured for the maximal tree and for all subtreessubtrees

–– Error rate available for 2, 3, 4, 5,....etc nodes of 9/10 data sError rate available for 2, 3, 4, 5,....etc nodes of 9/10 data subub--treestrees

!! Now rotate out new 1/10 test data setNow rotate out new 1/10 test data set

!! Grow new tree on remaining 9 portionsGrow new tree on remaining 9 portions

–– Compute error rates as beforeCompute error rates as before

!! Repeat until all 10 portions of the data have been used as test Repeat until all 10 portions of the data have been used as test setssets


CrossCross--Validation ProcedureValidation Procedure

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

1 102 93 4 5 6 7 8

Test

Test

Test

Test

Learn

Learn

Learn

LearnLearn

etc...

Learn


CrossCross--Validation DetailsValidation Details

!! Every observation is used as test case exactly onceEvery observation is used as test case exactly once

!! In 10 fold CV each observation is used as a learning case 9 timeIn 10 fold CV each observation is used as a learning case 9 timess

!! In 10 fold CV 10 auxiliary CART trees are grown in addition to iIn 10 fold CV 10 auxiliary CART trees are grown in addition to initial nitial

treetree

!! When all 10 CV trees are done the error rates are cumulated (sumWhen all 10 CV trees are done the error rates are cumulated (summed)med)

!! Summing of error rates is done by TREE complexitySumming of error rates is done by TREE complexity

!! Summed Error (cost) rates are then attributed to INITIAL treeSummed Error (cost) rates are then attributed to INITIAL tree

!! Observe that no two of these trees need be the sameObserve that no two of these trees need be the same

!! Even the primary splitter of the root node may be different acroEven the primary splitter of the root node may be different across CV ss CV

treestrees

!! Results are subject to random fluctuations Results are subject to random fluctuations —— sometimes severesometimes severe


CrossCross--Validation Replication & Validation Replication &

Tree InstabilityTree Instability

!! Look at the separate CV tree results of a single CV analysisLook at the separate CV tree results of a single CV analysis

!! CART produces a summary of the results for each tree at end of oCART produces a summary of the results for each tree at end of outpututput

!! Table titled "CV TREE COMPETITOR LISTINGS" reports results for Table titled "CV TREE COMPETITOR LISTINGS" reports results for

each CV tree Root (TOP), Left Child of Root (LEFT), and Right Cheach CV tree Root (TOP), Left Child of Root (LEFT), and Right Child ild

of Root (RIGHT)of Root (RIGHT)

!! Want to know how stable results are Want to know how stable results are ---- do different variables split root do different variables split root

node in different trees?node in different trees?

!! Only difference between different CV trees is random eliminationOnly difference between different CV trees is random elimination of of

1/10 data1/10 data


CVCV--TREE Variable TREE Variable

Scaled Importance MeasuresScaled Importance Measures

!! Note: Initial tree might be huge and variable importance measureNote: Initial tree might be huge and variable importance measures based on it s based on it are not usefulare not useful

!! V05 is important in many CV trees. V01 is never importantV05 is important in many CV trees. V01 is never important

V01 V02 V03 V04 V05

INIT | 0.001 0.000 0.002 0.000 0.001

CV 1 | 0.044 0.072 0.203 0.199 0.574

CV 2 | 0.039 0.056 0.074 0.149 0.439

CV 3 | 0.050 0.051 0.114 0.203 0.286

CV 4 | 0.038 0.082 0.130 0.065 0.682

CV 5 | 0.068 0.122 0.211 0.177 0.816

CV 6 | 0.123 0.048 0.269 0.116 0.370

CV 7 | 0.114 0.123 0.193 0.190 0.346

CV 8 | 0.045 0.000 0.202 0.195 0.685

CV 9 | 0.137 0.067 0.127 0.256 0.405

CV 10 | 0.048 0.044 0.135 0.223 1.000

FINAL | 0.000 0.000 0.079 0.000 0.328

V01 V02 V03 V04 V05

INIT | 0.001 0.000 0.002 0.000 0.001

CV 1 | 0.044 0.072 0.203 0.199 0.574

CV 2 | 0.039 0.056 0.074 0.149 0.439

CV 3 | 0.050 0.051 0.114 0.203 0.286

CV 4 | 0.038 0.082 0.130 0.065 0.682

CV 5 | 0.068 0.122 0.211 0.177 0.816

CV 6 | 0.123 0.048 0.269 0.116 0.370

CV 7 | 0.114 0.123 0.193 0.190 0.346

CV 8 | 0.045 0.000 0.202 0.195 0.685

CV 9 | 0.137 0.067 0.127 0.256 0.405

CV 10 | 0.048 0.044 0.135 0.223 1.000

FINAL | 0.000 0.000 0.079 0.000 0.328

Variable

Cross-Validation

Tree Number


Node Specific Error RatesNode Specific Error Rates

!! Although pruning is done node by node, CART trees are evaluated Although pruning is done node by node, CART trees are evaluated as as

treestrees

!! Error rates reported based on the test set or CV refer to the trError rates reported based on the test set or CV refer to the tree overallee overall

!! Why? Tree is a statistical object and error rate is expected forWhy? Tree is a statistical object and error rate is expected for the the

structurestructure

!! In practice you WILL want to look at node error rates AND sampleIn practice you WILL want to look at node error rates AND sample

sizesize

!! May want to reduce confidence in nodes that look inaccurate or sMay want to reduce confidence in nodes that look inaccurate or small mall

sizesize


Model Setup Model Setup –– Select CasesSelect Cases


Select CasesSelect Cases

!! Can have up to 10 selection criteriaCan have up to 10 selection criteria

!! Selection criteria valid for any variable in the data setSelection criteria valid for any variable in the data set

–– Even if variable not specified as a predictorEven if variable not specified as a predictor

!! For instance, you want to run CART on accidents caused by personFor instance, you want to run CART on accidents caused by persons s

under age 21 and another for people over 65 years of age.under age 21 and another for people over 65 years of age.


Model Setup Model Setup –– Best treeBest tree


One SE Rule One SE Rule ----

The One Standard Error RuleThe One Standard Error Rule

!! Original CART monograph recommends NOT choosing minimum Original CART monograph recommends NOT choosing minimum error tree because of possible instability of results from run terror tree because of possible instability of results from run to runo run

!! Instead suggest SMALLEST TREE within 1 SE of the minimum error Instead suggest SMALLEST TREE within 1 SE of the minimum error treetree

!! 1 SE tree is smaller than minimum error tree1 SE tree is smaller than minimum error tree

!! Lies one standard error away from minimum error treeLies one standard error away from minimum error tree

!! Tends to provide very stable results from run to runTends to provide very stable results from run to run

!! Is possibly as accurate as minimum cost tree yet simplerIs possibly as accurate as minimum cost tree yet simpler

–– BOPTIONS SERULE=0BOPTIONS SERULE=0

–– BOPTIONS SERULE=1 BOPTIONS SERULE=1 these are all options for the analystthese are all options for the analyst

–– BOPTIONS SERULE=.5BOPTIONS SERULE=.5

!! Current learning Current learning —— one SERULE is generally too conservative, will one SERULE is generally too conservative, will prune up to too small a treeprune up to too small a tree


Variable ImportanceVariable Importance

!! How can we assess the relative importance of the variables in ouHow can we assess the relative importance of the variables in our datar data

!! Issue is a measure of the splitting potential of all variablesIssue is a measure of the splitting potential of all variables

!! CART includes a general measure with very interesting characteriCART includes a general measure with very interesting characteristicsstics

–– The most important variables might not ever be used to split a nThe most important variables might not ever be used to split a node!ode!

–– Importance is related to both potential and actual splitting behImportance is related to both potential and actual splitting behavioravior

–– If one variable say EDUCATION is masked by another say INCOME thIf one variable say EDUCATION is masked by another say INCOME then...en...

»» Both variables have SIMILAR splitting informationBoth variables have SIMILAR splitting information

»» In any one node only one variable will be best but the other mayIn any one node only one variable will be best but the other may be a close 2nd bestbe a close 2nd best

–– If INCOME appears just once as primary splitter and EDUCATION isIf INCOME appears just once as primary splitter and EDUCATION is never a never a

primary splitterprimary splitter

»» But EDUCATION appears as a strong surrogate in several differentBut EDUCATION appears as a strong surrogate in several different nodesnodes

»» Then EDUCATION may have more potential splitting powerThen EDUCATION may have more potential splitting power

–– Means that if you had to live with just one of the variables Means that if you had to live with just one of the variables

»» EDUCATION would most likely to be the better performerEDUCATION would most likely to be the better performer


Model Setup Model Setup –– MethodMethod


CART Splitting Rules for Binary and MultiCART Splitting Rules for Binary and Multi--

class Problemsclass Problems

!! Primary splitting rules for classification:Primary splitting rules for classification: GiniGini,, TwoingTwoing, ,

Entropy, Power ModifiedEntropy, Power Modified TwoingTwoing

!! Other rules available: OrderedOther rules available: Ordered TwoingTwoing, Symmetric, Symmetric GiniGini


What splitting method is best for me?What splitting method is best for me?!! Entropy tends to be similar to TWOING when there are rare classeEntropy tends to be similar to TWOING when there are rare classes that can be separated perfectly or nearly s that can be separated perfectly or nearly

perfectly (some going all or almost all left and others going alperfectly (some going all or almost all left and others going all or almost all right). GINI tries to get the most l or almost all right). GINI tries to get the most numerous class right, or failing that get one daughter to be as numerous class right, or failing that get one daughter to be as pure as possible. Entropy is also happy with a pure as possible. Entropy is also happy with a pure daughter, but not happy with a nearly pure daughter if the pure daughter, but not happy with a nearly pure daughter if the odd cases are from a number of different odd cases are from a number of different classes.classes.

!! GiniGini: Well: Well--known standard splitting rule, tends to create endknown standard splitting rule, tends to create end--cut splits (small nodes with only one target class cut splits (small nodes with only one target class prevailing) on multilevel targets. Costs are incorporated by adprevailing) on multilevel targets. Costs are incorporated by adjusting prior probabilities.justing prior probabilities.

!! TwoingTwoing: An important splitting rule with properties very different fr: An important splitting rule with properties very different from GINI, on multilevel targets.om GINI, on multilevel targets. TwoingTwoingtends to generate more even splits, with whole groups of classestends to generate more even splits, with whole groups of classes being separated.being separated.

!! Entropy: Another wellEntropy: Another well--known splitting rule that is related to the likelihood function.known splitting rule that is related to the likelihood function. With multilevel targets it With multilevel targets it tends to look for splits where some as many levels as possible atends to look for splits where some as many levels as possible are divided perfectly or near perfectly. As a re divided perfectly or near perfectly. As a result Entropy puts more emphasis on getting rare levels right rresult Entropy puts more emphasis on getting rare levels right relative to common levels that either GINI or elative to common levels that either GINI or TWOING. In different circumstance its properties may be similarTWOING. In different circumstance its properties may be similar to GINI or TWOING or some wherein to GINI or TWOING or some wherein between them.between them.

!! OrderedOrdered TwoingTwoing: A modification of: A modification of twoingtwoing designed to handle ordered targets. This splitting rule only designed to handle ordered targets. This splitting rule only considers grouping together target classes adjacent to each otheconsiders grouping together target classes adjacent to each other. For example, standardr. For example, standard twoingtwoing may consider may consider the following partition of levels (1,3,7,8) (2,4,5,6), whereas othe following partition of levels (1,3,7,8) (2,4,5,6), whereas orderedrdered twoingtwoing will consider only partitions such will consider only partitions such as (1,2,3,4) (5,6,7,8).as (1,2,3,4) (5,6,7,8).


Which split is better?Which split is better?!! Easier to assess if we use proportionsEasier to assess if we use proportions

.17 DEAD .83 SURVIVE.17 DEAD .83 SURVIVE

Is AGE Is AGE ≤ ≤ 5555

40% Yes 60% No40% Yes 60% No

.27 DEAD .20 DEAD .27 DEAD .20 DEAD

.73 SURVIVE .80 SURVIVE.73 SURVIVE .80 SURVIVE

!! Consider another splitConsider another split

.17 DEAD .83 SURVIVE.17 DEAD .83 SURVIVE

Is BP Is BP ≤ ≤ 102102

32% Yes 68% No32% Yes 68% No

.25 DEAD .13 DEAD .25 DEAD .13 DEAD

.75 SURVIVE .87 SURVIVE.75 SURVIVE .87 SURVIVE

!! This latter split seems to be a little betterThis latter split seems to be a little better

!! Relatively less class DEAD in the left Relatively less class DEAD in the left nodenode

!! More of class SURVIVE in the right More of class SURVIVE in the right nodenode

PATIENTS = 215

SURVIVE 178 82.8%

DEAD 37 17.2%

Is AGE<=55?

Terminal Node A

SURVIVE 73.0%

DEAD 27.0%

Terminal Node B

SURVIVE 80.0%

DEAD 20.0%

NoYes

PATIENTS = 215

SURVIVE 178 82.8%

DEAD 37 17.2%

Is BP<=102?

Terminal Node A

SURVIVE 75.0%

DEAD 25.0%

Terminal Node B

SURVIVE 87.0%

DEAD 13.0%

NoYes


Evaluation of Splits withEvaluation of Splits with

the Impurity Functionthe Impurity Function

!! A node which contains members of only one class is perfectly purA node which contains members of only one class is perfectly puree

!! A node which contains an equal proportion of every class is leasA node which contains an equal proportion of every class is least puret pure

!! CART evaluates the goodness of any candidate split using an impuCART evaluates the goodness of any candidate split using an impurity rity

function:function:

–– Must be 0 for perfectly pure nodeMust be 0 for perfectly pure node

–– Must attain maximum for least pure nodeMust attain maximum for least pure node

–– Must be convex (need to accelerate towards 0 as node becomes purMust be convex (need to accelerate towards 0 as node becomes pure)e)

!! I(t) I(t) –– impurity measure of the node timpurity measure of the node t


A Number of Good ImpurityA Number of Good Impurity

Functions are AvailableFunctions are Available

!! You can experiment with You can experiment with

several in CARTseveral in CART

!! GINI impurityGINI impurity

"" i(t) = 4p (1i(t) = 4p (1--p)p)

## ENTROPY impurityENTROPY impurity

"" i(t) = i(t) = --p log(p)p log(p)

1

0 0.5 1


Why impurity?Why impurity?

Why not predictive accuracy?Why not predictive accuracy?

!! Can almost always improve purityCan almost always improve purity

–– Often unable to improve accuracy for a specific nodeOften unable to improve accuracy for a specific node

–– we will see later: if parent and both children are all assigned we will see later: if parent and both children are all assigned to same class to same class

then split does nothing for accuracythen split does nothing for accuracy

!! Predictive accuracy is the long run objective of the treePredictive accuracy is the long run objective of the tree

–– This goal is not served well by maximizing it at every node!This goal is not served well by maximizing it at every node!

!! Instead we need splitting rules that encourage good tree evolutiInstead we need splitting rules that encourage good tree evolutionon

–– Since splitting is myopic (looks only at current node Since splitting is myopic (looks only at current node ——

the next step)the next step)

–– Need rules that are in some sense geared towards the long run ouNeed rules that are in some sense geared towards the long run outcometcome


A Simple ExampleA Simple Example

!! Both splits below result to the same accuracy Both splits below result to the same accuracy –– 200 cases remain 200 cases remain

misclassifiedmisclassified

!! However, split #2 is unambiguously better However, split #2 is unambiguously better –– the right node is pure, no the right node is pure, no

more work is needed on this sidemore work is needed on this side

!! A convex impurity function will favor the 2nd splitA convex impurity function will favor the 2nd split

Buyer 300No Buyer 100



Buyer 200No Buyer 0

Split #1 Split #2


The Improvement Measure isThe Improvement Measure is

Decrease in ImpurityDecrease in Impurity

∆( , ) ( ) ( ) ( )t s i t p i t p i tL L R R= − − Parent node impurity minus weighted average of the impurities inParent node impurity minus weighted average of the impurities in each child each child

nodenode

•• ppLL = probability of case going left (fraction of node going left)= probability of case going left (fraction of node going left)

•• ppRR = probability of case going right (fraction of node going right= probability of case going right (fraction of node going right))

•• tt = node= node

•• ss = splitting rule= splitting rule

impurityL impurityR

Impurity

Parent

probL probR


Example: Six Segments in a Market, all Example: Six Segments in a Market, all

Equally LikelyEqually Likely

!! Here we have a very Here we have a very

satisfactory splitsatisfactory split

!! 3 segments go left in 3 segments go left in

their entiretytheir entirety

!! Remainder go rightRemainder go right

!! What has happened to What has happened to

impurity?impurity?

1/6, 1/6, 1/6, 1/6, 1/6, 1/6

1/3, 1/3, 1/3, 0, 0, 0 0, 0, 0, 1/3, 1/3, 1/3

BA


GiniGini Measure: Measure: "−=i

ipti 21)(

!! Parent impurity is:Parent impurity is:

i(t)= 1 i(t)= 1 -- (1/6)(1/6)22 --(1/6) (1/6) 22 -- (1/6)(1/6)22 --(1/6) (1/6) 22 -- (1/6) (1/6) 22 --(1/6) (1/6) 22

= 1= 1-- 6(1/36) = 5/66(1/36) = 5/6

!! Left child node impurity=Left child node impurity=

i(t) = 1 i(t) = 1 -- (1/3) (1/3) 22 -- (1/3) (1/3) 22 --(1/3) (1/3) 22 -- 0 0 -- 0 0 -- 00

= 1 = 1 -- 3(1/9) =2/3 =4/63(1/9) =2/3 =4/6

!! Right child node has same impurity ofRight child node has same impurity of

!! Weighted average of the two is 4/6Weighted average of the two is 4/6

!! The improvement 5/6 The improvement 5/6 -- 4/6 = 1/6 = .16 4/6 = 1/6 = .16

!! Note the numeric scale of improvementsNote the numeric scale of improvements——often rather small often rather small

numbersnumbers

5/6

4/6

4/6

=1/6


TwoingTwoing Criterion forCriterion for

MulticlassMulticlass ProblemProblem

!! Classes are numbered Classes are numbered {1,2,...,{1,2,...,JJ}}

!! At each node group the classes into At each node group the classes into twotwo subsetssubsets

!! e.g. In the 10 class gym example the two subsets identified by te.g. In the 10 class gym example the two subsets identified by thehe

twoingtwoing root node splitter wereroot node splitter were

–– C1= {1,4,6,7,9,10} C2= {2,3,5,8}C1= {1,4,6,7,9,10} C2= {2,3,5,8}

!! Then the best split for separating these two groups is foundThen the best split for separating these two groups is found

!! The process can be repeated for all possible groupingsThe process can be repeated for all possible groupings

–– best overall is the selected splitterbest overall is the selected splitter

!! Same as GINI for a binary dependent variableSame as GINI for a binary dependent variable


TheThe TwoingTwoing MeasureMeasure

!! TheThe twoing criterontwoing criteron function Ø(s, t) is:function Ø(s, t) is:

!! The bestThe best twoingtwoing split maximizes Æ(s, t)split maximizes Æ(s, t)

!! C* is given by:C* is given by:

!! C* is the set of classes with higher probabilities of going leftC* is the set of classes with higher probabilities of going left

φ ( )=s,t p p p j t p(j|tL RL R

j4

2

| ( | ) )| .−"#

$

%%%%%

&

'

(((((

C j p jt p j tL R1

* * *: ( | ) ( | ) ,= ≥)

*+

,+

-

.+

/+


TwoingTwoing InterpretedInterpreted

!! Maximizes sum of differences between the fraction of a class goiMaximizes sum of differences between the fraction of a class going left & ng left &

fraction going right.fraction going right.

!! Note lead multiplying factorNote lead multiplying factor ppLLppRR

!! ppLLppRR is greatest whenis greatest when ppLL == ppRR = 1/2= 1/2

!! Sum is scaled downwards for very uneven splitsSum is scaled downwards for very uneven splits

!! So first find split which maximizes this sum of probability diffSo first find split which maximizes this sum of probability differenceserences

–– Group C1 is defined as those classes more likely to go leftGroup C1 is defined as those classes more likely to go left

–– Group C2 is defined as those classes more likely to go rightGroup C2 is defined as those classes more likely to go right

–– CART authors think of these two groups as containing strategic CART authors think of these two groups as containing strategic

informationinformation

!! Exhibit class similaritiesExhibit class similarities

p p p j t p (j|tL RL R

j4

2

| ( | ) ) |−"#

$

%%%%

&

'

((((


GINI orGINI or TwoingTwoing: :

Which splitting criterion to use?Which splitting criterion to use?

!! Monograph suggestsMonograph suggests GiniGini is usually betteris usually better

!! Will be problem dependentWill be problem dependent

!! When endWhen end--cut splits (uneven sizes) need to be avoided usecut splits (uneven sizes) need to be avoided use twoingtwoing

–– but can also just set POWER>0but can also just set POWER>0

!! If the target variable has many levels, considerIf the target variable has many levels, consider twoingtwoing

!! Always experiment Always experiment -- try bothtry both


GiniGini vs.vs. TwoingTwoing ExampleExample

GiniGini Best SplitBest Split

A 40

B 30

C 20

D 10

Class

A 40

B 30

C 20

D 10

Class

TwoingTwoing Best SplitBest Split

A 40

B 0

C 0

D 0

Class A 0

B 30

C 20

D 10

Class

A 40

B 0

C 0

D 10

Class A 0

B 30

C 20

D 0

Class


Competitor SplitsCompetitor Splits

!! CART searches for the BEST splitter at every nodeCART searches for the BEST splitter at every node

!! To accomplish this it first finds the best split for a specific To accomplish this it first finds the best split for a specific variablevariable

!! Then it repeats this search over all other variablesThen it repeats this search over all other variables

!! Results in a split quality measure for EVERY variableResults in a split quality measure for EVERY variable

!! The variable with the highest score is the PRIMARY SPLITTERThe variable with the highest score is the PRIMARY SPLITTER

!! The other variables are the COMPETITORSThe other variables are the COMPETITORS

!! Can see as many COMPETITORS as you want Can see as many COMPETITORS as you want —— they have they have allall been been

computedcomputed

!! BOPTIONS COMPETITORS=5 is the default for number to PRINTBOPTIONS COMPETITORS=5 is the default for number to PRINT

!! To see scores for every variable try BOPTIONS COMPETITORS=250To see scores for every variable try BOPTIONS COMPETITORS=250


Linear Combination SplitsLinear Combination Splits

!! Decision tree methods are notoriously awkward in tracking linearDecision tree methods are notoriously awkward in tracking linear structurestructure

!! If functional relationship is If functional relationship is y =y = XXββ + error+ error, CART will generate a , CART will generate a sequence of splits of the formsequence of splits of the form

–– Is X < c1Is X < c1

–– Is X < c2Is X < c2

–– Is X < c3 etc.Is X < c3 etc.

!! Awkward, crude, and not always easy to recognize when multiple Awkward, crude, and not always easy to recognize when multiple variables are involvedvariables are involved

!! Can instead allow CART to search for linear combinations of predCan instead allow CART to search for linear combinations of predictorsictors

–– So decision rule could be:So decision rule could be: If .85*X1 + .52*X2 < If .85*X1 + .52*X2 < --5 then go left5 then go left

!! If the best linear combination found beats all standard splits iIf the best linear combination found beats all standard splits it becomes the t becomes the primary splitterprimary splitter

!! The linear combination will be reported as a splitter only, neveThe linear combination will be reported as a splitter only, never as a r as a surrogate or a competitorsurrogate or a competitor

!! A good example would be from gene researchA good example would be from gene research: when a joint : when a joint combination of many genes (say, up to a hundred) is responsible combination of many genes (say, up to a hundred) is responsible for a for a certain feature (target level) to show up. Having an individualcertain feature (target level) to show up. Having an individual gene splits gene splits (standard CART tree) is not efficient to present this structure;(standard CART tree) is not efficient to present this structure; having an having an LC tree will better capture the structure hereLC tree will better capture the structure here


Linear Combination CaveatsLinear Combination Caveats

!! Linear combination involves only numeric not categorical variablLinear combination involves only numeric not categorical variableses

!! Scale not unique; coefficients normalized to sum of squared coefScale not unique; coefficients normalized to sum of squared coefficients ficients

=1=1

!! Linear combinations searched using a stepwise algorithm Linear combinations searched using a stepwise algorithm —— global best global best

may not be foundmay not be found

!! Backward deletion used; start with allBackward deletion used; start with all numericsnumerics and back downand back down

!! May want to limit search to larger nodesMay want to limit search to larger nodes

!! LINEAR N=500 prevents search when number of cases in node < 500LINEAR N=500 prevents search when number of cases in node < 500

!! Benefit to favoring small number of variables in combinationsBenefit to favoring small number of variables in combinations

–– Easier to interpret and assessEasier to interpret and assess

–– Does combination make any empirical senseDoes combination make any empirical sense


Linear Combination Caveats, Cont’d.Linear Combination Caveats, Cont’d.

!! LINEAR N=500 DELETE=.40 permits deletion if variable LINEAR N=500 DELETE=.40 permits deletion if variable

γγ--value <.4value <.4

–– Would generate linear combinations with few variablesWould generate linear combinations with few variables

!! Linear combinations not invariant to variable transformationsLinear combinations not invariant to variable transformations

–– Quality of result influenced by log or square root transformsQuality of result influenced by log or square root transforms

!! Default DELETE = .20Default DELETE = .20

–– Linear combination Linear combination —— get best improvement get best improvement ∆ Ι∆ Ι–– Drop one variable and reDrop one variable and re--optimize coefficientsoptimize coefficients

–– Repeat for each variableRepeat for each variable

–– Rank loss in Rank loss in ∆ Ι∆ Ι


Model Setup Model Setup -- AdvancedAdvanced


CostCost--Complexity PruningComplexity Pruning

!! Begin with a large enough tree Begin with a large enough tree —— one that is larger than the truthone that is larger than the truth

–– With no processing constraints grow until 1 case in each terminaWith no processing constraints grow until 1 case in each terminal nodel node

!! PRUNING AT A NODE means making that node terminal by deleting PRUNING AT A NODE means making that node terminal by deleting

its descendantsits descendants

!! IdeaIdea

–– Suppose we have a 100 node tree and we want to prune to a 90 nodSuppose we have a 100 node tree and we want to prune to a 90 node treee tree

–– Should prune to 90 node subShould prune to 90 node sub--tree that has the smallest errortree that has the smallest error

–– Similarly for all other subSimilarly for all other sub--trees with 89, 88,...5, 4, 3, 2 nodestrees with 89, 88,...5, 4, 3, 2 nodes

–– Prune to subPrune to sub--tree with smallest misclassification cost for that tree sizetree with smallest misclassification cost for that tree size


Model Setup Model Setup -- CostsCosts


Costs of MisclassificationCosts of Misclassification

!! For most classification schemes...For most classification schemes...

–– "AN ERROR IS AN ERROR IS AN ERROR""AN ERROR IS AN ERROR IS AN ERROR"

!! No distinction made between different types of errorNo distinction made between different types of error

–– All count equally All count equally ---- are equally badare equally bad

!! In practical applications different errors are quite different iIn practical applications different errors are quite different in importancen importance

–– In medical application: classifying a breast tumor as malignant In medical application: classifying a breast tumor as malignant or benignor benign

–– Misclassify malignant as benign Misclassify malignant as benign —— possible death of patientpossible death of patient

–– Misclassify benign as malignant Misclassify benign as malignant —— unnecessary open breast biopsyunnecessary open breast biopsy

!! Want both classes classified correctly Want both classes classified correctly —— either mistake is seriouseither mistake is serious

!! May want to upMay want to up--weight the misclassification of malignant tumors errorweight the misclassification of malignant tumors error


Market Segmentation ExamplesMarket Segmentation Examples

!! Martin & Wright (1974): segmenting population into buyers vs. Martin & Wright (1974): segmenting population into buyers vs.

nonnon--buyersbuyers

!! Purpose: allocate marketing effort to buyersPurpose: allocate marketing effort to buyers

–– Marketing to a nonMarketing to a non--buyer loses the company $1buyer loses the company $1

–– Not marketing to a buyer forgoes $3 of profitNot marketing to a buyer forgoes $3 of profit

!! Want to include this in the analysis Want to include this in the analysis

!! Consider group of customers to be classified on demographicsConsider group of customers to be classified on demographics

–– Actually contain 40% buyers 60% nonActually contain 40% buyers 60% non--buyersbuyers

–– Would want to classify this group as ALL buyers!Would want to classify this group as ALL buyers!

–– Even though we would be misclassifying MOST people in the groupEven though we would be misclassifying MOST people in the group

–– .40 · $3 + .60 · (.40 · $3 + .60 · (--$1) = $1.20 $1) = $1.20 -- $0.60 = $0.60 per person profit$0.60 = $0.60 per person profit


Explicit Cost MatrixExplicit Cost Matrix

!! Misclassifying buyers is three times as bad as misclassifying noMisclassifying buyers is three times as bad as misclassifying nonn--buyersbuyers

Default matrix is: Classified as

non-buyer buyer

Truth non-buyer 0 1

buyer 1 0

Want to specify explicit costs as:Classified as

non-buyer buyer

Truth non-buyer 0 1

buyer 3 0


Quick Formula for Binary Dependent VariableQuick Formula for Binary Dependent Variable

!! C(2|1) = Cost of classifying as 2 when it is really a class 1C(2|1) = Cost of classifying as 2 when it is really a class 1

!! Classify a node as class 1 rather than class 2 if:Classify a node as class 1 rather than class 2 if:

!! Note the adjustments are (a)Note the adjustments are (a) reweightreweight by priors and (b)by priors and (b) reweightreweight by by costscosts

!! Generalize by comparing any two classes i, j with:Generalize by comparing any two classes i, j with:

!! node classified as class i if the inequality held for all j ( jnode classified as class i if the inequality held for all j ( j¹ i )¹ i )

C( | ) ( )N (t)

C( | ) ( )N (t)

N

N

21 1

12 2

1

2

1

2

ππ >

C(j|i) (i)N (t)

C(i| j) (j)N (t)

N

N

i

j

i

j

ππ >


Then Costs Incorporated Into Then Costs Incorporated Into

GINI Splitting RuleGINI Splitting Rule

!! Formula simplifies only when UNIT costs are usedFormula simplifies only when UNIT costs are used

!! This is an entirely new splitting ruleThis is an entirely new splitting rule

!! Note:Note:

!! Rule usesRule uses symmetrizedsymmetrized costscosts

"

""

i

i j

tip

tjptipjiCGINICOSTS

2)|(-1=

)|()|()|(=-

( ) ( )[ ] ( ) ( )

C i j p i t p j t C j i p j t p i t

C i j C j i p i t p j t

( | ) ( | ) ( | ) + ( | ) ( | ) ( | ) =

+


Model Setup Model Setup -- PriorsPriors


Prior ProbabilitiesPrior Probabilities

!! An essential component of any CART analysisAn essential component of any CART analysis

!! A prior probability distribution for the classes is needed to doA prior probability distribution for the classes is needed to do proper proper

splitting since prior probabilities are used to calculate the splitting since prior probabilities are used to calculate the ppii in thein the GiniGini

formulaformula

!! Example: Suppose data contains Example: Suppose data contains

99% type A (99% type A (nonrespondersnonresponders),),

1% type B (responders), 1% type B (responders),

!! CART might focus on not missing any class A objectsCART might focus on not missing any class A objects

–– one "solution" classify all objectsone "solution" classify all objects nonrespondersnonresponders

!! Advantages: Only 1% error rateAdvantages: Only 1% error rate

!! Very simple predictive rules, although not informativeVery simple predictive rules, although not informative

!! But realize this could be an optimal ruleBut realize this could be an optimal rule


Terminology: ReTerminology: Re--weight Classes with weight Classes with

“PRIOR PROBABILITIES”“PRIOR PROBABILITIES”!! Most common priors: PRIORS EQUALMost common priors: PRIORS EQUAL

!! This gives each class equal weight regardless of its frequency iThis gives each class equal weight regardless of its frequency in the datan the data

!! Prevents CART from favoring more prevalent classesPrevents CART from favoring more prevalent classes

!! If we have 900 class A and 100 class B and equal priorsIf we have 900 class A and 100 class B and equal priors

!! With equal priors prevalence measured as percent of OWN With equal priors prevalence measured as percent of OWN class sizeclass size

!! Measurements relative to own class not entire learning setMeasurements relative to own class not entire learning set

!! If PRIORS DATA both child nodes would be class AIf PRIORS DATA both child nodes would be class A

!! Equal priors puts both classes on an equal footingEqual priors puts both classes on an equal footing

A: 600

B: 90

2/3 of all A

9/10 of all B

Class as B

A: 300

B: 10

1/3 of all A

1/10 of all B

Class as A

A: 900

B: 100


Quick formula for Binary Dependent Variable Quick formula for Binary Dependent Variable

(Version 1)(Version 1)

!! For equal priors class node is class 1 ifFor equal priors class node is class 1 if

!! Classify any node by “count ratio in node” relative to “count raClassify any node by “count ratio in node” relative to “count ratio in tio in

root”root”

!! Classify by which level has greatest relative richnessClassify by which level has greatest relative richness

N t

N t

N

N

N

N

i

i

1

2

1

2

( )

( )>>>>

= number of cases in class i at the root node

(t) = number of cases in class i at the node t


Quick formula for Binary Quick formula for Binary

Dependent Variable (Version 2)Dependent Variable (Version 2)

!! When priors are not equal:When priors are not equal:

where where ππιι is the prior for that classis the prior for that class

!! Since Since ππ (1)(1)/ π/ π (j) (j) always appears as ratio just think in terms of boosting always appears as ratio just think in terms of boosting

amount amount Ni(t)Ni(t) by a factor, e.g. 9:1, 100:1, 2:1by a factor, e.g. 9:1, 100:1, 2:1

!! If priors are EQUAL, If priors are EQUAL, ππ terms cancel outterms cancel out

!! If priors are DATA, thenIf priors are DATA, then

which simplifies formula to plurality rulewhich simplifies formula to plurality rule

(simple counting) classify as class 1 if(simple counting) classify as class 1 if

!! If priors are neither EQUAL nor DATA then formula will provide bIf priors are neither EQUAL nor DATA then formula will provide boost to upoost to up--

weighted classes weighted classes

π(1)N1(t) N1

π(2)N2(t) N2>

π(1) N1

π(2) N2=

N1(t) > N2(t)


Recalling our Example 900 Class A and 100 Class B Recalling our Example 900 Class A and 100 Class B

in the Root Nodein the Root Node

A: 300

B: 10

A: 300

B: 10A: 600

B: 90

A: 600

B: 90

30010

900100

>>>> so class as A

In general the tests w ould be w eighted by the appropriate priors

(B) (A )

(A ) (B )

tests can be w ritten as AB

or BA

as convenient

ππππππππ

ππππππππ

90600

100900

30010

900100

>>>> >>>>

A: 900

B: 100

A: 900

B: 100

90600

100900

>>>> so class as B


Equal Priors Tend to Equalize Equal Priors Tend to Equalize

Misclassification RatesMisclassification Rates

!! EQUAL PRIORS is the default setting for all CART analysesEQUAL PRIORS is the default setting for all CART analyses

!! Gives each class an equal chance at being correctly classifiedGives each class an equal chance at being correctly classified

!! Example of PRIORSExample of PRIORS

–– PRIORS EQUALPRIORS EQUAL Default Default -- should use this to start withshould use this to start with

–– PRIORS DATAPRIORS DATA Empirical frequency, whatever is found in Empirical frequency, whatever is found in

datadata

–– PRIORS MIXPRIORS MIX Average of EQUAL and DATA Average of EQUAL and DATA ---- shades shades

toward empiricaltoward empirical

–– PRIORS = n1,n2PRIORS = n1,n2 Explicit priorsExplicit priors

–– PRIORS = 2,1PRIORS = 2,1 Makes class 1 prior twice class 2 priorMakes class 1 prior twice class 2 prior

–– PRIORS = .67, .33PRIORS = .67, .33 Same as PRIORS 2,1Same as PRIORS 2,1


Priors Incorporated Into Splitting CriterionPriors Incorporated Into Splitting Criterion

!! ppii(t) = (t) = withinwithin--node probability of node probability of

classclass i i in nodein node t t

!! If priors DATA then If priors DATA then

!! Proportions of class Proportions of class ii in node t with data priors in node t with data priors

!! Otherwise proportions are always calculated as weighted shares uOtherwise proportions are always calculated as weighted shares using sing

priors adjusted priors adjusted ppii

2=Gini " ip-1

π(i)=N

N

i

N(t)

(t)N

(t)N

(t)N=t

i

j

i ="

)p(

"

=

Nj

Nj(t)(j)

Ni

Ni(t)(i)

tp π

π)(


Using Priors to Using Priors to

Reflect ImportanceReflect Importance!! Putting a larger prior on a class will tend to decrease its miscPutting a larger prior on a class will tend to decrease its misclassification lassification

raterate

!! Hence can use priors to shape an analysisHence can use priors to shape an analysis

!! Example:Example:

MISCLASSIFICATION RATES BY CLASS FORMISCLASSIFICATION RATES BY CLASS FOR

ALTERNATIVE PRIORS: Number Misclassified (N) and Probability ofALTERNATIVE PRIORS: Number Misclassified (N) and Probability of

Misclassification (%)Misclassification (%)

Sample Relative Prior on Class 2 and N misclassified

Size 1.0 1.2 1.4 1.6

Class N % N % N % N %

1 12 3 25.0 5 41.7 5 41.7 6 50.02 49 47 95.9 30 61.2 26 53.1 15 30.63 139 46 33.1 74 53.2 77 55.4 94 67.6

Total 200 96 109 108 115


Misclassification Costs vs. PriorsMisclassification Costs vs. Priors

!! Can increase the Can increase the "importance" "importance" of a class by increasing priorof a class by increasing prior

!! Prior weights equivalent to raising all costs for a given classPrior weights equivalent to raising all costs for a given class

!! Misclassification costs can vary by specific errorMisclassification costs can vary by specific error»» Misclassify a class B as a class CMisclassify a class B as a class C high costhigh cost

»» Misclassify a class B as a class AMisclassify a class B as a class A low costlow cost

!! This level of control not available by manipulating priorsThis level of control not available by manipulating priors

Sample cost matrix Sample cost matrix AA BB CC

AA • • 33 11

BB 11 •• 44

CC 22 22 ••


Model Setup Model Setup -- PenaltyPenalty


Handling Missing Values Handling Missing Values

in Tree Growingin Tree Growing

!! Allow cases with missing split variable to follow majorityAllow cases with missing split variable to follow majority

!! Assign cases with missing split variable to go left or right Assign cases with missing split variable to go left or right

probabilistically, using PL and PR as probabilitiesprobabilistically, using PL and PR as probabilities

!! Allow missing to be a Allow missing to be a valuevalue of variableof variable


Missing Values on Primary SplitterMissing Values on Primary Splitter

!! One option: send all cases withOne option: send all cases with missingsmissings with the majority of that nodewith the majority of that node

!! Some machine learning programs use this procedureSome machine learning programs use this procedure

!! CHAID treats missing as a categorical value; allCHAID treats missing as a categorical value; all missingsmissings go the same go the same

wayway

!! CART uses a more refined method CART uses a more refined method ——a surrogate is used as a stand in a surrogate is used as a stand in

for a missing primary fieldfor a missing primary field

!! Consider variable like INCOME Consider variable like INCOME —— could often be missingcould often be missing

!! Other variables like Father's or Mother's Education or Mean IncoOther variables like Father's or Mother's Education or Mean Income ofme of

OccupOccup. might work as good surrogates. might work as good surrogates

!! Using surrogate means that missing on primary not all treated saUsing surrogate means that missing on primary not all treated same me

wayway

!! Whether go left or right depends on surrogate valueWhether go left or right depends on surrogate value


Surrogates Surrogates —— Mimicking Alternatives to Mimicking Alternatives to

Primary SplittersPrimary Splitters

!! A primary splitter is the best splitter of a nodeA primary splitter is the best splitter of a node

!! A surrogate is a splitter that splits in a fashion similar to thA surrogate is a splitter that splits in a fashion similar to the primarye primary

!! Surrogate Surrogate —— A Variable with possibly equivalent informationA Variable with possibly equivalent information

!! Why UsefulWhy Useful

–– Reveals structure of the information in the variablesReveals structure of the information in the variables

–– If the primary is expensive or difficult to gather and the surroIf the primary is expensive or difficult to gather and the surrogate is notgate is not

»» Then consider using the surrogate insteadThen consider using the surrogate instead

»» Loss in predictive accuracy might be slightLoss in predictive accuracy might be slight

–– If primary splitter is MISSING then CART will use a surrogateIf primary splitter is MISSING then CART will use a surrogate


AssociationAssociation

!! How well can another variable mimic the primary splitter?How well can another variable mimic the primary splitter?

!! Predictive association between primary and surrogate splitterPredictive association between primary and surrogate splitter

!! Consider another variable and allow reverse splitsConsider another variable and allow reverse splits

–– A reverse split sends cases to the RIGHT if the condition is metA reverse split sends cases to the RIGHT if the condition is met

–– Standard split (primary) always sends cases to the LEFTStandard split (primary) always sends cases to the LEFT

!! Consider a default splitter to mimic primaryConsider a default splitter to mimic primary

–– Send ALL cases to the node that the primary splitter favorsSend ALL cases to the node that the primary splitter favors

–– Default mismatch rate is min(Default mismatch rate is min(PPLL, P, PRR))

»» e.g. if primary sends 400 cases left and 100 cases right with PRe.g. if primary sends 400 cases left and 100 cases right with PRIORS DATAIORS DATA

»» PPLL =.8 =.8 PPRR =.2=.2

»» If default sends all cases left then 400 cases are matched and 1If default sends all cases left then 400 cases are matched and 100 mismatched00 mismatched

»» Default mismatch rate is .2Default mismatch rate is .2

»» A credible surrogate must yield a mismatch rate LESS THAN the deA credible surrogate must yield a mismatch rate LESS THAN the defaultfault


Evaluating SurrogatesEvaluating Surrogates

!! Matches are evaluated on a case by case basisMatches are evaluated on a case by case basis

!! If surrogate matches on 85% of cases then error rate is 15%If surrogate matches on 85% of cases then error rate is 15%

!! In this example: In this example: Association= (.2 Association= (.2 -- .15)/.2 = .05/.2 =.25.15)/.2 = .05/.2 =.25

!! Note that surrogate is quite good by Note that surrogate is quite good by a prioria priori standards yet measure standards yet measure

seems lowseems low

!! Also note that association could be negative and will be for mosAlso note that association could be negative and will be for most t

variablesvariables

!! If match is perfect then surrogate If match is perfect then surrogate mismatch=0mismatch=0 and and association=1association=1

Association=default mismatch-surrogate mismatch

default mismatch


Competitors vs. SurrogatesCompetitors vs. Surrogates

Class A 100

Class B 100

Class C 100

Primary

Split

Competitor

Split

Surrogate

Split

Class A 90 10

Class B 80 20

Class C 15 85

Class A 80 20

Class B 25 75

Class C 14 86

Class A 78 22

Class B 74 26

Class C 21 79

Left Right


Classification Is Accomplished by Entire Tree Classification Is Accomplished by Entire Tree

Not Just One NodeNot Just One Node

!! Even if a case goes wrong way on a node (say surrogate is imperfEven if a case goes wrong way on a node (say surrogate is imperfect) it ect) it

is not necessarily a problemis not necessarily a problem

!! Case may be get correctly classified on next node or further dowCase may be get correctly classified on next node or further down treen tree

!! CART trees will frequently contain selfCART trees will frequently contain self--correcting informationcorrecting information


Surrogates Surrogates —— How Many How Many

Can You GetCan You Get

!! Depends on the data Depends on the data —— you might not get any!you might not get any!

!! Often will see fewer than the top 5 that would be printed by defOften will see fewer than the top 5 that would be printed by defaultault

!! A splitter qualifies as a surrogate ONLY if association value >0A splitter qualifies as a surrogate ONLY if association value >0

!! Note surrogates are ranked in order of associationNote surrogates are ranked in order of association

!! A relatively weak surrogate might have better IMPROVEMENT than A relatively weak surrogate might have better IMPROVEMENT than

best surrogatebest surrogate

!! A variable can be both a good competitor and a good surrogateA variable can be both a good competitor and a good surrogate

–– Competitor split values might be different than surrogate split Competitor split values might be different than surrogate split valuevalue


Interpretation of CART ResultsInterpretation of CART Results


Detailed Example: Detailed Example:

GYM Cluster ModelGYM Cluster Model

!! Problem: needed to understand a market research clustering scheProblem: needed to understand a market research clustering schememe

!! Clusters were created using 18 variables and conventional clusteClusters were created using 18 variables and conventional clustering ring

softwaresoftware

!! Wanted simple rules to describe cluster membershipWanted simple rules to describe cluster membership

!! Experimented with CART tree to see if there was an intuitive stoExperimented with CART tree to see if there was an intuitive storyry


Gym Example Gym Example --

Variable DefinitionsVariable DefinitionsCLUSTERCLUSTER Cluster assigned from clustering scheme (10 level categorical coCluster assigned from clustering scheme (10 level categorical coded 1ded 1--10) 10)

ANYPOOLANYPOOL Pool usage (no=0, yes=1)Pool usage (no=0, yes=1)

ANYRAQTANYRAQT Racquet ball usage (no=0, yes=1)Racquet ball usage (no=0, yes=1)

BABY BABY Have a baby (no=0, yes=1)Have a baby (no=0, yes=1)

CLASSESCLASSES Number of classes taken Number of classes taken

FEMALE FEMALE Are you female (no=0, yes=1)Are you female (no=0, yes=1)

FITFIT Fitness score Fitness score

HOMEHOME Home ownership (no=0, yes=1)Home ownership (no=0, yes=1)

IPAKPRICIPAKPRIC Index variable for package price Index variable for package price

MONFEEMONFEE Monthly fee paid Monthly fee paid

NFAMMENNFAMMEN Number of family members Number of family members

NSUPPSNSUPPS Number of supplements/vitamins/frozen dinners purchased Number of supplements/vitamins/frozen dinners purchased

OFFAEROFFAER Number of offNumber of off--peak aerobics classes attended peak aerobics classes attended

ONAERONAER Number of onNumber of on--peak aerobics classes attended peak aerobics classes attended

ONPOOLONPOOL Number of onNumber of on--peak pool uses peak pool uses

ONRCTONRCT Number of onNumber of on--peak racquet ball uses peak racquet ball uses

PERSTRNPERSTRN Personal trainer (no=0, yes=1)Personal trainer (no=0, yes=1)

PLRQTPCTPLRQTPCT Percent of pool and racquet ball usage Percent of pool and racquet ball usage

SAERDIFSAERDIF Difference between number of onDifference between number of on-- and offand off--peak aerobics visits peak aerobics visits

SMALLBUSSMALLBUS Small business discount (no=0, yes=1)Small business discount (no=0, yes=1)

TANNINGTANNING Number of visits to tanning salon Number of visits to tanning salon

TPLRCTTPLRCT Total number of pool and racquet ball uses Total number of pool and racquet ball uses


CART NavigatorCART Navigator


Classic CART output:Classic CART output:

Tree SequenceTree Sequence

==========================

TREE SEQUENCETREE SEQUENCE

==========================

Dependent variable: CLUSTERDependent variable: CLUSTER

Terminal Test SetTerminal Test Set Resubstitution Resubstitution ComplexityComplexity

Tree Nodes Relative Cost Relative Cost ParameTree Nodes Relative Cost Relative Cost Parameterter

------------------------------------------------------------------------------------------------------------------------------------

1 2937 0.679 +/1 2937 0.679 +/-- 0.002 0.577 0.0000000.002 0.577 0.000000

10** 115 0.659 +/10** 115 0.659 +/-- 0.002 0.646 0.0000960.002 0.646 0.000096

23 11 0.667 +/23 11 0.667 +/-- 0.002 0.665 0.0003400.002 0.665 0.000340

24 9 0.667 +/24 9 0.667 +/-- 0.002 0.666 0.0003510.002 0.666 0.000351

25 8 0.668 +/25 8 0.668 +/-- 0.002 0.667 0.0007040.002 0.667 0.000704

26 7 0.672 +/26 7 0.672 +/-- 0.002 0.670 0.0033010.002 0.670 0.003301

27 6 0.676 +/27 6 0.676 +/-- 0.002 0.675 0.0045080.002 0.675 0.004508

28 5 0.684 +/28 5 0.684 +/-- 0.002 0.684 0.0081580.002 0.684 0.008158

29 4 0.732 +/29 4 0.732 +/-- 0.002 0.730 0.0412580.002 0.730 0.041258

30 3 0.806 +/30 3 0.806 +/-- 0.001 0.805 0.0669860.001 0.805 0.066986

31 2 0.889 +/31 2 0.889 +/-- .309661E.309661E--04 0.889 0.07590704 0.889 0.075907

32 1 1.000 +/32 1 1.000 +/-- .602225E.602225E--04 1.000 0.09999004 1.000 0.099990

Initial misclassification cost = 0.900Initial misclassification cost = 0.900

Initial class assignment = 3Initial class assignment = 3


GUI CART Output GUI CART Output

Skeleton Tree Diagram:Skeleton Tree Diagram:


Navigator Navigator –– Summary ReportsSummary Reports


Gains ChartsGains Charts

!! The xThe x--axis represents the % of the data included and the yaxis represents the % of the data included and the y--axis axis

represents the percentage of target class included.represents the percentage of target class included.

!! The 45The 45--degree line maps the % of the target class you would expect if degree line maps the % of the target class you would expect if

each node were a random sample of the population.each node were a random sample of the population.

!! The blue curved line represents the cumulative % of class 1 (colThe blue curved line represents the cumulative % of class 1 (column 5 umn 5

in the grid) versus the cumulative % of the total population (coin the grid) versus the cumulative % of the total population (column 6), lumn 6),

with the data ordered from the richest to the poorest node.with the data ordered from the richest to the poorest node.

!! The vertical distance between these two lines depicts the gains The vertical distance between these two lines depicts the gains or lift at or lift at

each point along the xeach point along the x--axis.axis.


Variable ImportanceVariable Importance


Variable Importance MeasuredVariable Importance Measured

by Impurity Improvementby Impurity Improvement

!! Only look at primary splitters and SURROGATES Only look at primary splitters and SURROGATES notnot

COMPETITORSCOMPETITORS

!! Reason: competitors are trying to make a particular splitReason: competitors are trying to make a particular split

–– If prevented from making split at one node will try again at nexIf prevented from making split at one node will try again at next nodet node

–– Next node attempt could be SAME split conceptually leading to doNext node attempt could be SAME split conceptually leading to double uble

countcount

!! Focus is on improvement among surrogates; Focus is on improvement among surrogates;

nonnon--surrogates not countedsurrogates not counted

!! Hence IMPROVEMENT is a measure relative to a given tree structurHence IMPROVEMENT is a measure relative to a given tree structuree

!! Changing the tree structure could yield different importance meaChanging the tree structure could yield different importance measuressures


Variable Importance CautionVariable Importance Caution

!! Importance is a function of the OVERALL tree including deepest Importance is a function of the OVERALL tree including deepest

nodesnodes

!! Suppose you grow a large exploratory tree Suppose you grow a large exploratory tree —— reviewreview importancesimportances

!! Then find an optimal tree via test set or CV yielding smaller trThen find an optimal tree via test set or CV yielding smaller treeee

!! Optimal tree SAME as exploratory tree in the top nodesOptimal tree SAME as exploratory tree in the top nodes

!! YETYET importancesimportances might be quite different.might be quite different.

!! WHY? Because larger tree uses more nodes to compute the importanWHY? Because larger tree uses more nodes to compute the importancece

!! When comparing results be sure to compare similar or same sized When comparing results be sure to compare similar or same sized treestrees


Variable Importance & Number Variable Importance & Number

of Surrogates Usedof Surrogates Used

!! Importance determined by number of surrogates trackedImportance determined by number of surrogates tracked

!! Tables below are derived from SAME treeTables below are derived from SAME tree

!! WAVE data set exampleWAVE data set example

Allowing up to 5 Surrogates Allowing Only 1 Surrogate

V07 100.000 V15 100.000

V15 93.804 V11 73.055

V08 78.773 V14 71.180

V11 68.529 V09 32.635

V14 66.770 V08 28.914

V06 65.695 V10 18.787

V16 61.355 V12 12.268

V09 37.319 V17 5.699

V12 32.853 V04 0.000

V05 32.767 V07 0.000


Navigator Navigator –– Prediction SuccessPrediction Success


Navigator Navigator –– misclassificationmisclassification


Scoring DataScoring Data


Applying the Tree to New DataApplying the Tree to New Data

Called Dropping Data Down TreeCalled Dropping Data Down Tree

!! Remember that missing values are not a problemRemember that missing values are not a problem

–– CART will use surrogates (if available)CART will use surrogates (if available)

!! Core results go to SAVE fileCore results go to SAVE file

–– classification for each caseclassification for each case

–– specific terminal node this case reachedspecific terminal node this case reached

–– complete path down treecomplete path down tree

–– whether class was correct or notwhether class was correct or not

!! Can also save other variablesCan also save other variables

–– up to 50 user specified variables such as IDup to 50 user specified variables such as ID

–– will want ID variables for tracking and mergingwill want ID variables for tracking and merging

–– optionally, all variables used in original modeloptionally, all variables used in original model

–– splitters, surrogatessplitters, surrogates


Variables Saved By CASEVariables Saved By CASE

!! One record saved for each caseOne record saved for each case

!! RESPONSE: classification assigned by CARTRESPONSE: classification assigned by CART

!! NODE: terminal node number NODE: terminal node number

!! DEPTH: depth level of the terminal node DEPTH: depth level of the terminal node

!! PATH(n): nonPATH(n): non--terminal node number at each depth terminal node number at each depth


Records of the Output DatasetRecords of the Output Dataset

IDID NODENODE RESPONSERESPONSE DEPTHDEPTH PATH1PATH1 PATH2PATH2 PATH3PATH3 PATH4PATH4

11 55 11 55 22 55 66 77

22 77 00 33 22 55 --77 00

33 11 00 44 22 33 44 --11

44 55 11 55 22 55 66 77

55 55 11 55 22 55 66 77

66 44 00 55 22 55 66 77

77 44 00 55 22 55 66 77

!! CASE 2: drops down into NODE = 7, predicted RESPONSE = 0, finalCASE 2: drops down into NODE = 7, predicted RESPONSE = 0, final DEPTH = DEPTH =

33

!! CASE 2: From Root node 1, splits to node 2. Splits again to noCASE 2: From Root node 1, splits to node 2. Splits again to node 5. Splits again, de 5. Splits again,

reaching terminal node 7reaching terminal node 7


Regression TreesRegression Trees


Boston Housing DataBoston Housing Data

!! Harrison, D. and D.Harrison, D. and D. RubinfeldRubinfeld. Hedonic Housing Prices & Demand . Hedonic Housing Prices & Demand For Clean Air. For Clean Air. Journal of Environmental Economics and Journal of Environmental Economics and

Management,Management, v5, 81v5, 81--102 , 1978102 , 1978–– 506 census tracts in City of Boston for the year 1970506 census tracts in City of Boston for the year 1970

–– Goal: study relationship between quality of life variables and pGoal: study relationship between quality of life variables and property valuesroperty values

–– MVMV median value of ownermedian value of owner--occupied homes in tract (‘000s)occupied homes in tract (‘000s)

–– CRIMCRIM per capita crime ratesper capita crime rates

–– NOXNOX concentration of nitrogen oxides (concentration of nitrogen oxides (pphmpphm))

–– AGEAGE percent built before 1940percent built before 1940

–– DISDIS weighted distance to centers of employmentweighted distance to centers of employment

–– RMRM average number of rooms per houseaverage number of rooms per house

–– LSTATLSTAT percent neighborhood ‘lower SES’percent neighborhood ‘lower SES’

–– RADRAD accessibility to radial highwaysaccessibility to radial highways

–– ZNZN percent land zoned for lotspercent land zoned for lots

–– CHASCHAS borders Charles River (0/1)borders Charles River (0/1)

–– INDUSINDUS percent nonpercent non--retail businessretail business

–– TAXTAX tax ratetax rate

–– PTPT pupil teacher ratiopupil teacher ratio


About Regression TreesAbout Regression Trees

!! Improvement is the reduction in variance due to splitImprovement is the reduction in variance due to split

!! RR--Squared is measured as: 1Squared is measured as: 1--relative relative resubstitutionresubstitution costcost

–– In the Boston case: 1In the Boston case: 1--.076=.924.076=.924

!! Regression trees are not as strong with highly linear dataRegression trees are not as strong with highly linear data

–– Use another tool like MARSUse another tool like MARS


Practical AdvicePractical Advice


Practical Advice Practical Advice —— First RunsFirst Runs

!! Start with ERROR EXPLOREStart with ERROR EXPLORE

!! Set FORMAT=9 to get more decimal places printed in outputSet FORMAT=9 to get more decimal places printed in output

!! Fast runs since only one tree grownFast runs since only one tree grown

!! If you have a trivial model (perfect classifiers) you'll learn qIf you have a trivial model (perfect classifiers) you'll learn quickly, and uickly, and

won’t waste time testingwon’t waste time testing

!! If yourIf your resubstitutionresubstitution error rate is very high, test based results will only error rate is very high, test based results will only

be worsebe worse

!! Look for some good predictor variables while still exploringLook for some good predictor variables while still exploring


Practical Advice Practical Advice —— Set Complexity to a NonSet Complexity to a Non--

Zero ValueZero Value

!! Review your TREE SEQUENCEReview your TREE SEQUENCE

!! How large a tree is likely to be needed to give accurate resultsHow large a tree is likely to be needed to give accurate results??

!! Gives you an idea of where to set complexity to limit tree growtGives you an idea of where to set complexity to limit tree growthh

!! Consider sample output Consider sample output ——Maximal tree grown had 354 nodes many more Maximal tree grown had 354 nodes many more

than credibly neededthan credibly needed

!! So choose a complexity that will limit maximal tree to say 100 nSo choose a complexity that will limit maximal tree to say 100 nodesodes

–– May have to guess at a complexity valueMay have to guess at a complexity value

–– Just use something other than 0 for large problemsJust use something other than 0 for large problems

!! Each crossEach cross--validation tree will grow a tree until target complexity reachedvalidation tree will grow a tree until target complexity reached

–– Limiting trees to 100 instead of 300 nodes could substantially rLimiting trees to 100 instead of 300 nodes could substantially reduce your run educe your run

times without risking errortimes without risking error


Practical Advice Practical Advice —— Set the SE Rule to Suit Set the SE Rule to Suit

Your DataYour Data

!! Default of SERULE=1 could trim trees back too farDefault of SERULE=1 could trim trees back too far

!! New wisdom SERULE=0 could give you trees that are too large to New wisdom SERULE=0 could give you trees that are too large to

absorbabsorb

!! Could try something in between (SERULE=.5 or SERULE=.25)Could try something in between (SERULE=.5 or SERULE=.25)

!! Once you have settled on a model and test results don’t need SE Once you have settled on a model and test results don’t need SE rule at rule at

allall

–– go back to ERROR EXPLORE and use complexity to select treego back to ERROR EXPLORE and use complexity to select tree

–– within an interactive analysis choose another tree from the TREEwithin an interactive analysis choose another tree from the TREE

SEQUENCE with the PICK commandSEQUENCE with the PICK command

–– PICK NODES=10PICK NODES=10


Practical Advice Practical Advice —— Set Up aSet Up a

Battery of RunsBattery of Runs

!! Prepare batches of CART runs with varying control parametersPrepare batches of CART runs with varying control parameters

!! Experienced CART users realize many runs may be necessary to reaExperienced CART users realize many runs may be necessary to really lly

understand dataunderstand data

!! Experiment with splitting rulesExperiment with splitting rules

–– GINI is default and should always be usedGINI is default and should always be used

–– TWOING will differ for multiTWOING will differ for multi--class problemsclass problems

–– TWOING POWER=1 will develop different treesTWOING POWER=1 will develop different trees

!! POWER option can break an impasse; sometimes allows very difficuPOWER option can break an impasse; sometimes allows very difficult lt

problems to make headwayproblems to make headway


Practical Advice Practical Advice ——

Experiment with PRIORSExperiment with PRIORS

!! PRIORS EQUAL is default PRIORS EQUAL is default —— best chance of good resultsbest chance of good results

!! OlshenOlshen favors PRIORS MIX favors PRIORS MIX —— bow in direction of databow in direction of data

!! PRIORS DATA least likely to give satisfactory resultsPRIORS DATA least likely to give satisfactory results

!! Try a grid search. For a binary classification tree…Try a grid search. For a binary classification tree…

–– march their priors from (.3, .7) to (.7, .3)march their priors from (.3, .7) to (.7, .3)

–– see if a shift in priors grows a more interesting treesee if a shift in priors grows a more interesting tree

!! Remember, priors shiftRemember, priors shift CART’sCART’s emphasis from one class to anotheremphasis from one class to another

!! A small shift in priors can make an unbalanced tree more balanceA small shift in priors can make an unbalanced tree more balanced and d and

cause CART to prune a tree a little differentlycause CART to prune a tree a little differently

!! The testing and pruning process will keep trees honestThe testing and pruning process will keep trees honest


Practical Advice Practical Advice —— Experiment Experiment

with TEST Methodswith TEST Methods

!! CV=10 should be used for small data setsCV=10 should be used for small data sets

!! Rerun CV with new random number seedRerun CV with new random number seed

–– e.g. SEED 100,200,500e.g. SEED 100,200,500

!! Check for general agreement of estimated error rateCheck for general agreement of estimated error rate

!! Splits and initial tree will not changeSplits and initial tree will not change

–– ONLY ERROR RATE estimate and possibly optimal size tree will chaONLY ERROR RATE estimate and possibly optimal size tree will changenge

!! Try ERROR P=pTry ERROR P=p

–– Where p=.1 or .25 or .5, etc.Where p=.1 or .25 or .5, etc.

–– Very fast compared to CVVery fast compared to CV

–– Problematic when sample is very smallProblematic when sample is very small


Practical Advice Practical Advice —— Prepare a Summary of all Prepare a Summary of all

Your RunsYour Runs

!! Print out just the TREE SEQUENCESPrint out just the TREE SEQUENCES

!! In a separate report print out just the primary split variablesIn a separate report print out just the primary split variables

!! In a separate report print out the top of the VARIABLE In a separate report print out the top of the VARIABLE

IMPORTANCE listIMPORTANCE list


Practical Advice for CART Analyses Practical Advice for CART Analyses ——

Variable SelectionVariable Selection

!! In theory CART can find the important variables for youIn theory CART can find the important variables for you

!! Theory holds when you have massive data sets (say 50,000 per levTheory holds when you have massive data sets (say 50,000 per level)el)

!! In practice you need to be judicious and helpIn practice you need to be judicious and help

!! Eliminate nonsense variables such as ID, Account numbersEliminate nonsense variables such as ID, Account numbers

–– They probably track other informationThey probably track other information

!! Eliminate variations of the dependent variableEliminate variations of the dependent variable

–– You might have Y and LOGY in the data setYou might have Y and LOGY in the data set

–– You might have a close but imperfect variant of DPVYou might have a close but imperfect variant of DPV

!! Can control these nuisance variables with EXCLUDE commandCan control these nuisance variables with EXCLUDE command


Practical Advice for CART Analyses Practical Advice for CART Analyses ——

Variable Selection IIVariable Selection II

!! Worthwhile to think about variables that should matterWorthwhile to think about variables that should matter

!! Begin with excluding those that should not Begin with excluding those that should not —— TO GET STARTEDTO GET STARTED

!! CART'sCART's performance on model with few variables could be much performance on model with few variables could be much

better than with a larger search setbetter than with a larger search set

!! Reason is potential forReason is potential for mistrackingmistracking in INITIAL treein INITIAL tree

–– Say a chance variation makes X50 a primary splitterSay a chance variation makes X50 a primary splitter

–– Testing finds that this does not hold upTesting finds that this does not hold up

–– Error rate on tree is high and results disappointingError rate on tree is high and results disappointing

!! If fluke variable not even in search set then bad split doesn't If fluke variable not even in search set then bad split doesn't happenhappen

–– Instead of fluky X50 a less improving but solid variable is usedInstead of fluky X50 a less improving but solid variable is used

–– Testing supports it and final tree performs betterTesting supports it and final tree performs better


Troubleshooting Problem:Troubleshooting Problem:

CrossCross--Validation Breaks DownValidation Breaks Down

!! CART requires at least one case in each level of DPVCART requires at least one case in each level of DPV

!! If only a few cases for a given level in data set, random test sIf only a few cases for a given level in data set, random test set during et during

crosscross--validation might get no observations in that levelvalidation might get no observations in that level

!! May need to aggregate levels ORMay need to aggregate levels OR

!! Delete levels with very low representation from problemDelete levels with very low representation from problem


NO TREE GENERATED MessageNO TREE GENERATED Message

!! CART will produce a TREE SEQUENCE and nothing else when test errCART will produce a TREE SEQUENCE and nothing else when test error or

rate for tree of any size is higher than for root noderate for tree of any size is higher than for root node

!! In this case CART prunes ALL branches away leaving no treeIn this case CART prunes ALL branches away leaving no tree

!! Could also be caused by SERULE=1 and large SECould also be caused by SERULE=1 and large SE

!! Trees actually generated Trees actually generated -- including initial and all CV treesincluding initial and all CV trees

!! CART maintains nothing worthwhile to printCART maintains nothing worthwhile to print

!! Need some output to diagnoseNeed some output to diagnose

!! Try an exploratory tree with ERROR EXPLORETry an exploratory tree with ERROR EXPLORE

!! Grows maximal tree but does not testGrows maximal tree but does not test

!! Maximum size of any initial tree or CV tree controlled by COMPLEMaximum size of any initial tree or CV tree controlled by COMPLEXITY XITY

parameterparameter


The EndThe End

Topics of interest that are not covered hereTopics of interest that are not covered here

!! Ask your instructor to send you information on these subjectsAsk your instructor to send you information on these subjects

data mining with decision trees - salford...

Documents