data analytics cmis short course part ii day 1 part 2: trees sam buttrey december 2015
DESCRIPTION
Impurity Measure Any measure of impurity should be 0 when all y’s are the same; otherwise > 0 Natural choice for continuous y’s: compute predictions y-hat and take impurity as D = i (y i – y-hat) 2 –RSS (deviance for Normal model) is preferable to SD; impurity should relate to sample size –This is “just right” if the y’s are Normal and unweighted (we care about everything equally)TRANSCRIPT
![Page 1: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/1.jpg)
Data AnalyticsCMIS Short Course part II
Day 1 Part 2: TreesSam Buttrey
December 2015
![Page 2: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/2.jpg)
Regression Trees
• The usual set-up: numeric responses y1, …, yn; predictors Xi for each yi– These might be numeric, categorical,
logical…• Start with all the y’s in one node; measure
the “impurity” of that node (the extent to which the y’s are spread out)
• Object will be to divide the observations into sub-nodes of high purity (that is, with similar y’s)
![Page 3: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/3.jpg)
Impurity Measure
• Any measure of impurity should be 0 when all y’s are the same; otherwise > 0
• Natural choice for continuous y’s: compute predictions y-hat and take impurity as
D =i (yi – y-hat)2
– RSS (deviance for Normal model) is preferable to SD; impurity should relate to sample size
– This is “just right” if the y’s are Normal and unweighted (we care about everything equally)
![Page 4: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/4.jpg)
Reducing Impurity
• R implementations in tree(), rpart()• Both measure impurity by RSS by default• Now consider each X column in turn1. If Xj is numeric, divide the Y’s into two
pieces, one where Xj a and one where Xj > a (a “split”)– Try every a; there are at most n – 1 of these
• E.g. Alcohol < 4; Alcohol < 4.5, etc.; Price < 3; Price < 3.5, etc; Sodium < 10, …
![Page 5: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/5.jpg)
Reducing Impurity, cont’d
• In left and right “child” nodes, compute separate means yL-hat, yR-hat and separate deviances DL and DR
• The decrease in deviance (i.e. increase in purity) for this split is D – (DR + DL)
• Our goal is to find the split for which this decrease is largest, or the sum of DL and DR is smallest (i.e. for which the two resulting nodes are purest)
![Page 6: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/6.jpg)
6
Beer Example
• “Root” impurity is 20320 calories2
n = 35D = 20320
Alc 4 Alc > 4
n = 3D = 1241
n = 32D = 8995
Sum: 10236
n = 35D = 20320
Alc 4.5 Alc > 4.5
n = 10D = 7308
n = 25D = 1584
Sum: 8892
n = 35D = 20320
Alc 4.9 Alc > 4.9
n = 27D = 14778
n = 8D = 929
Sum: 15707
n = 35D = 20320
Price 2.75 Price > 2.75
n = 21D = 15279
n = 14D = 3647
Sum: 18926
n = 35D = 20320
Cost 0.46 Cost > 0.46
n = 21D = 15279
n = 14D = 3647
Sum: 18926
n = 35D = 20320
Sod 10 Sod > 10
n = 11D = 9965
n = 24D = 9431
Sum: 19396
Bestamongthese
![Page 7: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/7.jpg)
Reducing Impurity
2. If X is categorical, split the data into two pieces, one with one subset of categories and one with the rest (a “split”)– For k categories, there are 2k–1 – 1 of these
• E.g. divide men from women, (sub/air) from (surface/supply), (old/young) from (med.)
• Measure decrease in deviance exactly as before
• Select the best split among all possibilities– Subject to rules on minimum node size etc.
![Page 8: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/8.jpg)
The Recursion
• Now split the two “child” nodes (say, #2 and #3) separately
• #2’s split will normally be different from #3’s; it could be on the same variable used at the root, but usually won’t be
• Split #2 into 4 and 5, #3 into 6 and 7, so as to decrease the impurity as much as possible; then split the resulting children– Node q’s children are numbered 2q, 2q+1
![Page 9: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/9.jpg)
Some Practical Considerations
1. When do we stop splitting?– Clearly too much splitting over-fitting– Stay tuned for this one
2. It feels bad to create child nodes with one observation – maybe even < 10
3. The addition of deviances implies we think they’re on the same scale – that is, we assume homoscedasticity when we use RSS as our impurity measure
![Page 10: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/10.jpg)
Prediction
• For a new case, find the terminal node it falls into (based on its X’s)
• Predict the average of the y’s there– SDs are harder to come by!
• Diagnostic: residuals, fitted values, within-node variance vs. mean
• Instead of fitting a plane in the X space, we’re dividing it up into oblong pieces and assuming constant y in each piece
![Page 11: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/11.jpg)
11
R Example• R has tree and rpart libraries for trees• Example 1: beer
– There’s a tie for best split• plot() plus text() draws pictures
– Or use rpart.plot() from that library• In this case, the linear model is better…
– Unless you include both Price and Cost…– Tree model unaffected by “multi-collinearity”
• Trees are easy to understand; it’s easy to make nice pictures; almost no theoretical results
![Page 12: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/12.jpg)
Tree Example (1985 cps wage)
• Training set (n = 427), test set (107)• Step 1: produce tree for Wage
– Notice heteroscedasticity, as reported by meanvar()
– We hope for a roughly flat picture, indicating constant variance across leaves
– In this case let’s take the log of Wage– Let tree stop splitting according to its
defaults• We will discuss these shortly!
![Page 13: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/13.jpg)
13
R Interlude
• Wage tree using log of wage• Compare performance to several linear
models• In this example the tree is not as strong as
a reasonable linear model…• …But sometimes it’s better…• …And stay tuned for ensembles of models
![Page 14: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/14.jpg)
14
Tree Benefits
• Trees are interpretable, easy to understand• Extend naturally to classification, including
the case with more than two classes• Insensitive to monotonic transformation of
X’s variables (unlike linear model)– Reduces impact of outliers
• Interactions included automatically• Smart missing value handling
– Both building the tree and predicting
![Page 15: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/15.jpg)
15
Tree Benefits and Drawbacks
• (Almost) Entirely automatic model generation– Model is local, rather than global like regression
• No problem when columns overlap (e.g. beer) or # colums > # rows
• On the other hand…• Abrupt decision boundaries look weird• Some problems in life are approximately
linear, at least after lots of analyst input• No inference – just test set performance
![Page 16: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/16.jpg)
16
Bias vs Variance
All Relationships
Linear
Linear Plus
Best Linear Model
Best LM with transformations, interactions…
Best Tree
True Model
![Page 17: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/17.jpg)
17
Bias vs Variance
All Relationships
Linear
Linear Plus
Best Linear Model
Best LM with transformations, interactions…
Bias
Best Tree
True Model
![Page 18: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/18.jpg)
Stopping Rules
• Defaults– Do not split a node with < 20 observations– Do not create a node with < 7– R2 must increase by .01 each step (this
value is the complexity parameter cp)– Maximum depth = 30 (!)
• The plan is to intentionally grow a tree that’s too big, then “prune” it back to the optimal size using…
• Cross-validation!
![Page 19: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/19.jpg)
19
Cross-validation in Rpart
• Cross-validation is done automatically– Results in the cp table of the rpart object
CP nsplit rel error xerror xstd1 0.162166 0 1.000000 1.001213 0.0645322 0.045349 1 0.837834 0.842130 0.0543463 0.042957 2 0.792485 0.835233 0.0539274 0.029107 3 0.749529 0.780557 0.0545245 0.028094 4 0.720422 0.774898 0.0551086 0.015638 5 0.692328 0.751739 0.0535137 0.013300 6 0.676690 0.776740 0.0618038 0.010000 7 0.663390 0.785274 0.062075
Impurity as a fraction of the impurity at the root – always goes
down for larger trees
Cross-validated error: often goes down and then back up
Find minimum xerror value…
…And use corresponding cp (rounding upwards)
![Page 20: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/20.jpg)
20
Pruning Recap
• In wage example: wage.rp <- rpart (LW ~ ., data = wage) plotcp (wage.rp) # show minimum prune (wage.rp, cp = .02)• “Optimal” tree size is random – it depends
on the cross-validation• Or prune to one-SE rule by selecting the
smallest size whose xerror < (min xerror + 1 corresponding SE)– Less variable?
![Page 21: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/21.jpg)
Rpart Features
• rpart.control() function to set things like minimum leaf size, minimum within-leaf deviance
• For y’s like Poisson counts, compute Poisson deviance
• Methods also exist for exponential data (e.g. component lifetimes) with censoring
• Case weights can be applied• Most importantly, rpart can also handle
binary or multinomial categorical y’s
![Page 22: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/22.jpg)
Missing Value Handling
• Missing values handling:• Building: surrogate splits
– Splits that “look like” the best split, to be used at prediction time if the “best” has NAs
– Cases with missing values deleted, but…– (tree) na.tree.replace() for categoricals
• Predicting: use avg. y at stopping pointWe really want to avoid this because in real life lots of observations are missing
at least a little data
![Page 23: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/23.jpg)
Classification
• Two of the big problems in statistics:– Regression: estimate E(yi | Xi) when y is
continuous (numeric)– Classification: predict yi | Xi when y is
categorical • Example 1: two classes (“0” and “1”)• One method: logistic regression• Choose prediction threshold c, say 0.5• Predicted p > c classify object into class
1; otherwise classify into class 0• For > 2 categories, logit can be extended
![Page 24: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/24.jpg)
Classification
• Object: produce a rule that classifies new observations accurately– Or that assigns a good probability estimate– …Or at least rank-orders observations well
• Measure of quality: Area under the ROC curve, or misclassification rate, or deviance, or something else
• Issues: rare (or common) events are hard to classify, since the “null model” is so good– E.g. “No airplane will ever crash”
![Page 25: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/25.jpg)
Other Classifiers
• Other techniques include neural nets...• …Classification trees• Y is categorical, X’s can be either• Model says that at each leaf the dist’n of
categories of Y is binomial/multinomial• Multinomial deviance @ leaf t =
–2 across the k categories• Reduce deviance by splitting, just as with
regression trees
![Page 26: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/26.jpg)
Classification Tree
• Same machinery as regression tree• Still need to find optimal size of tree
– As before, with plotcp/prune• Example: Fisher iris data (3 classes);
predict species based on measurements• plot() + text(), rpart.plot() as
before• Example: spam data
– Logistic regression tricky to build here
![Page 27: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/27.jpg)
More on classification trees
• predict() works with a classification tree, just as with regression ones
• Follow each observation to its leaf; plurality “vote” determines classification
• By default, predict() produces a vector of class proportions in each obs.’s chosen node
• Use type="class" to choose the most probable class
• As always, there is more to a good model than just raw misclassification rate
![Page 28: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/28.jpg)
Splitting Rules
• Several kinds of splits for classification trees in R
• It’s not a good idea to split so as to minimized misclassifcation rate
• Example:151/200
75% “Yes”
52/10052% “Yes”
99/10099% “Yes”
Misclass rate: 49/200 = 0.245
Misclass rate: (1 + 48)/(100 + 100) = 0.245
![Page 29: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/29.jpg)
Misclass Rate• Minimizing misclass rate deprives us of
splits that produce more homogeneous children, if all children continue to be majority 1 (or 0)…
• …So misclassification rate is particularly weak for very rare or very common events
• In our example, deviance starts at -2 [151 log (151/200) + 49 log (49/200)] = 222 and goes to
• -2 {[99 log (99/100) + 1 log (1/100)] + [52 log (52/100) + 48 log (48/100)]} = 150
Decrease!
![Page 30: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/30.jpg)
Other Choices of Criterion
• Gini index within node t : 1 – j p( j)2
• Smallest when one p is 1, the others 0• Gini split is then the weighted average of the
two child-node Gini indices• Information (also called “entropy”)• E (t) = 1 – p (j) log p (j)• Compare to deviance, which is
D (t) – nj log p (j) – not much difference• Values not displayed in rpart printout
![Page 31: Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015](https://reader031.vdocument.in/reader031/viewer/2022012914/5a4d1b1a7f8b9ab05999324e/html5/thumbnails/31.jpg)
31
Multinomial Example
• Exactly the same setup when there are multiple classes– Whereas logistic regression gets lots more
complicated– Other natural choice here: neural networks?
• Example: Optical digit data– No obvious stopping rule here (?)