lecture 6 tree-based methods, bagging and random forests · 2020-02-09 · the idea behind...
TRANSCRIPT
Lecture 6 – Tree-based methods,Bagging and Random Forests
Niklas WahlstromDivision of Systems and ControlDepartment of Information TechnologyUppsala University.
Email: [email protected]
Summary of Lecture 5 (I/IV)
When choosing models/methods, we are interested in how well they will performwhen faced with new unseen data.
The new data error
Enew , E? [E(y(x?; T ), y?)]
describes how well a method (which is trained using data set T ) will perform “inproduction”.
E is for instance mean squared error (regression) or misclassification (classification).
The overall goal in supervised machine learning is to achieve small Enew.
Etrain , 1n
∑ni=1E(y(xi; T ), yi) is the training data error.
Not a good estimate of Enew.1 / 29 [email protected]
Summary of Lecture 5 (II/IV)
Two methods for estimating Enew:
1. Hold-out validation data approach: Randomly split the data into a trainingset and a hold-out validation set. Learn the model using the training set.Estimate Enew using the hold-out validation set.
2. k-fold cross-validation: Randomly split the data into k parts (or folds) ofroughly equal size.
a) The first fold is kept aside as a validation set and the model is learned using only theremaining k − 1 folds. Enew is estimated on the validation set.
b) The procedure is repeated k times, each time a different fold is treated as thevalidation set.
c) The average of all k estimates is taken as the final estimate of Enew.
2 / 29 [email protected]
Summary of Lecture 5 (III/IV)
Enew = E?
[(f(x?)− f0(x?)
)2]
︸ ︷︷ ︸Bias2
+E?
[ET
[(y(x?; T )− f(x?)
)2]]
︸ ︷︷ ︸Variance
+ σ2︸︷︷︸Irreducible
error
where f(x) = ET [y(x?; T )]
• Bias: The inability of a method to describe the complicated patterns we wouldlike it to describe.• Variance: How sensitive a method is to the training data.
The more prone a model is to adapt to complicated pattern in the data, the higher themodel complexity (or model flexibility).
3 / 29 [email protected]
Summary of Lecture 5 (IV/IV)
Enew
Irreducible error
Variance
Bias2
OverfitUnderfit
Model complexity
Erro
r
Finding a balanced fit (neither over- nor underfit) is called thethe bias-variance tradeoff.
4 / 29 [email protected]
Contents – Lecture 6
1. Classification and regression trees (CART)2. Bagging – a general variance reduction technique3. Random forests
5 / 29 [email protected]
The idea behind tree-based methods
In both regression and classification settings we seek a function y(x) which maps theinput x into a prediction.
One flexible way of designing this function is topartition the input space into disjoint regions and fita simple model in each region.
x1
x2
= Training data
• Classification: Majority vote within the region.• Regression: Mean of training data within the region.
6 / 29 [email protected]
Finding the partition
The key challenge in using this strategy is to find a good partition.
Even if we restrict our attention to seemingly simple regions (e.g. “boxes”), finding anoptimal partition w.r.t. minimizing the training error is computationally infeasible!
Instead, we use a “greedy” approach: recursive binary splitting.
1. Select one input variable xj and a cut-point s. Partition the input space into twohalf-spaces,
{x : xj < s} {x : xj ≥ s}.
2. Repeat this splitting for each region until some stopping criterion is met (e.g., noregion contains more than 5 training data points).
7 / 29 [email protected]
Recursive binary splitting
x1
x2
Partitioning of input space
= Training datas1
s2
s3
s4R1
R2
R3
R4
R5
Tree representation
x1≤s1
x2≤s2 x1≤s3
x2≤s4
R1 R2 R3
R4 R5
8 / 29 [email protected]
Regression trees (I/II)
Once the input space is partitioned into L regions, R1, R2, . . . , RL the predictionmodel is
y? =L∑
`=1y`I{x? ∈ R`},
where I{x? ∈ R`} is the indicator function
I{x? ∈ R`} ={
1 if x ∈ R`
0 if x /∈ R`
and y` is a constant prediction within each region.For regression trees we use
y` = avarage{yi : xi ∈ R`}
9 / 29 [email protected]
Regression trees (I/II) - How do we find the partition?
Recursive binary splitting is greedy - each split is made in order to minimize the losswithout looking ahead at future splits.
For any j and s, defineR1(j, s) = {x |xj < s} and R2(j, s) = {x |xj ≥ s}.
We then seek (j, s) that minimize∑i:xi∈R1(j,s)
(yi − y1(j, s))2 +∑
i:xi∈R2(j,s)(yi − y2(j, s))2 .
wherey1 = avarage{yi : xi ∈ R1(j, s)}y2 = avarage{yi : xi ∈ R2(j, s)}
This optimization problem is easily solved by ”brute force” by evaluating all possiblesplits.
10 / 29 [email protected]
Classification trees
Classification trees are constructed similarly to regression trees, but with twodifferences.
Firstly, the class prediction for each region is based on the proportion of data pointsfrom each class in that region. Let
π`m = 1n`
∑i:xi∈R`
I{yi = m}
be the proportion of training observations in the lth region that belong to the mthclass. Then we approximate
p(y = m |x) ≈L∑
`=1π`mI{x ∈ R`}
11 / 29 [email protected]
Classification trees
Secondly, the squared loss used to construct the tree needs to be replaced by ameasure suitable to categorical outputs.
Three common error measures are,
Misclassification error: 1−maxm
π`m
Entropy/deviance: −∑M
m=1 π`m log π`m
Gini index:∑M
m=1 π`m(1− π`m)
12 / 29 [email protected]
Classification error measures
For a binary classification problem (M = 2)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
Class-1 probability, π1m
Loss
MisclassificationGini indexDeviance
13 / 29 [email protected]
ex) Spam data
Classification tree for spam data:
cf_! < 0.078
wf_remove < 0.02
cf_$ < 0.17
caps_max < 18
wf_free < 0.29
wf_remove < 0.07
cf_$ < 0.080
wf_hp >= 0.34
00.39
100%0
0.1558%
00.1054%
00.0852%
10.722%
10.834%
10.7242%
00.4217%
00.3012%
00.2311%
00.1510%
10.711%
10.891%
10.824%
10.9126%
00.211%
10.9425%
yes no
Tree LDATest error: 11.3 % 10.9 %
14 / 29 [email protected]
Improving CART
The flexibility/complexity of classification and regression trees (CART) is decided bythe tree depth.
! To obtain a small bias the tree need to be grown deep,! but this results in a high variance!
The performance of (simple) CARTs is often unsatisfactory!
To improve the practical performance:N Pruning – grow a deep tree (small bias) which is then pruned into a smaller one
(reduce variance).N Ensemble methods – average or combine multiple trees.
• Bagging and Random Forests (this lecture)• Boosted trees (next lecture)
15 / 29 [email protected]
Probability detour - Variance reduction by averaging
Let zb, b = 1, . . . , B be identically distributed random variables with mean E [zb] = µand variance Var[σ2]. Let ρ be the correlation between distinct variables.
Then,
E[
1B
B∑b=1
zb
]= µ,
Var[
1B
B∑b=1
zb
]= 1− ρ
Bσ2︸ ︷︷ ︸
small for large B
+ ρσ2.
The variance is reduced by averaging (if ρ < 1) !
16 / 29 [email protected]
Bagging (I/II)
For now, assume that we have access to B independent datasets T 1, . . . , T B. Wecan then train a separate deep tree yb(x) for each dataset, 1, . . . , B.
• Each yb(x) has a low bias but high variance• By averaging
ybag(x) = 1B
B∑b=1
yb(x)
the bias is kept small, but variance is reduced by a factor B!
17 / 29 [email protected]
Bagging (II/II)
Obvious problem We only have access to one training dataset.
Solution Bootstrap the data!
• Sample n times with replacement from the original training data T = {xi, yi}ni=1• Repeat B times to generate B ”bootstrapped” training datasets T 1, . . . , T B
For each bootstrapped dataset T b we train a tree yb(x). Averaging these,
ybbag(x) = 1
B
B∑b=1
yb(x)
is called ”bootstrap aggregation”, or bagging.
18 / 29 [email protected]
Bagging - Toy example
ex) Assume that we have a training set
T = {(x1, y1), (x2, y2), (x3, y3), (x4, y4)}.
We generate, say, B = 3 datasets by bootstrapping:
T 1 = {(x1, y1), (x2, y2), (x3, y3), (x3, y3)}T 2 = {(x1, y1), (x4, y4), (x4, y4), (x4, y4)}T 3 = {(x1, y1), (x1, y1), (x2, y2), (x2, y2)}
Compute B = 3 (deep) regression trees y1(x), y2(x) and y3(x), one for each datasetT 1, T 2, and T 3, and average
ybag(x) = 13
3∑b=1
yb(x)
19 / 29 [email protected]
ex) Predicting US Supreme Court behavior
Random forest classifier built on SCDB data1
to predict the votes of Supreme Court justices:
Y ∈ {affirm, reverse, other}
Result: 70% correct classificationsD. M. Katz, M. J. Bommarito II and J. Blackman. A General Approachfor Predicting the Behavior of the Supreme Court of the United States.arXiv.org, arXiv:1612.03473v2, January 2017.
Fig 5. Justice-Term Accuracy Heatmap Compared Against M=10 NullModel (1915 -2015). Green cells indicate that our model outperformed the baselinefor a given Justice in a given term. Pink cells indicate that our model only matched orunderperformed the baseline. The deeper the color green or pink, the better or worse,respectively, our model performed relative to the M=10 baseline.
performance against the most challenging of null models (i.e., the M = 10 finitememory window). While careful inspection of Figure 5 allows the interested reader toexplore the Justice-by-Justice performance of our model, a high level review of Figure 5reveals the basic achievement of our modeling goals as described earlier. While weperform better with some Justices than others and better in some time periods thanothers, Figure 5 displays generalized and consistent mean-field performance across manyJustices and many historical contexts.
Version 2.02 - 01.16.17 14/18
1http://supremecourtdatabase.org20 / 29 [email protected]
ex) Predicting US Supreme Court behavior
Not only have random forests proven to be “unreasonably effective” in a widearray of supervised learning contexts, but in our testing, random forestsoutperformed other common approaches including support vector machines[. . . ] and feedforward artificial neural network models such as multi-layerperceptron
— Katz, Bommarito II and Blackman (arXiv:1612.03473v2)
21 / 29 [email protected]
Random forests
Bagging can drastically improve the performance of CART!
However, the B bootstrapped dataset are correlated⇒ the variance reduction due to averaging is diminished.
Idea: De-correlate the B trees by randomly perturbing each tree.
A random forest is constructed by bagging, but for each split in each tree only arandom subset of q ≤ p inputs are considered as splitting variables.
Rule of thumb: q = √p for classification trees and q = p/3 for regression trees.2
2Proposed by Leo Breiman, inventor of random forests.22 / 29 [email protected]
Random forest pseudo-code
Algorithm Random forest for regression
1. For b = 1 to B (can run in parallel)(a) Draw a bootstrap data set T of size n from T .(b) Grow a regression tree by repeating the following steps until a minimum node size is
reached:i. Select q out of the p input variables uniformly at random.ii. Find the variable xj among the q selected, and the corresponding split point s, that
minimizes the squared error.iii. Split the node into two children with {xj ≤ s} and {xj > s}.
2. Final model is the average the B ensemble members,
yrf? = 1
B
B∑b=1
yb?.
23 / 29 [email protected]
Random forests
Recall: For i.d. random variables {zb}Bb=1
Var(
1B
B∑b=1
zb
)= 1− ρ
Bσ2 + ρσ2
The random input selection used in random forests:H increases the bias, but often very slowlyH adds to the variance (σ2) of each treeN reduces the correlation (ρ) of the trees
The reduction in correlation is typically the dominant effect ⇒ there is an overallreduction in MSE!
24 / 29 [email protected]
ex) Toy regression model
For the toy model previously considered. . .
0 20 40 60 80 100
1.8
2.0
2.2
2.4
2.6
2.8
3.0
Number of trees
Test
err
or
Random ForestBaggingSingle regression tree
25 / 29 [email protected]
Overfitting?
The complexity of a bagging/random forest model increases with an increasing numberof trees B.
Will this lead to overfitting as B increases?
No – more ensemble members does not increase the flexibility of the model!
Regression case:
yrf? = 1
B
B∑b=1
yb? → E [y? | T ] , as B →∞,
where the expectation is w.r.t. the randomness in the data bootstrapping and inputselection.
26 / 29 [email protected]
Advantages of random forests
Random forests have several computational advantages:N Embarrassingly parallelizable!N Using q < p potential split variables reduces the computational cost of each split.N We could bootstrap fewer than n, say
√n, data points when creating T b — very
useful for “big data” problems.. . . and they also come with some other benefits:N Often works well off-the-shelf – few tuning parametersN Requires little or no input preparationN Implicit input selection
27 / 29 [email protected]
ex) Automatic music generation
ALYSIA: automated music generation using random forests.
• User specifies the lyrics• ALYSIA generates accompanying
music via- rythm model- melody model
• Trained on a corpus of pop songs.
12 Margareta Ackerman and David Loker
5 Songs made with ALYSIA
Three songs were created using ALYSIA. The first two rely on lyrics written bythe authors, and the last borrows its lyrics from a 1917 Vaudeville song. Thesongs were consequently recorded and produced. All recordings are posted onhttp://bit.ly/2eQHado.
The primary technical difference between these songs was the explore/exploitparameter. Why Do I Still Miss You was made using an explore vs exploitparameter that leads to more exploration, while Everywhere and I’m AlwaysChasing Rainbows were made with a higher exploit setting.
5.1 Why Do I Still Miss You
Why Do I Still Miss You, the first song ever written with ALYSIA, was madewith the goal of testing the viability of using computer-assisted composition forproduction-quality songs. The creation of the song started with the writing oforiginal lyrics, written by the authors. The goal was to come up with lyrics thatstylistically match the style of the corpus, that is, lyrics in today’s pop genre(See Figure 5).
Next, each lyrical line was given to ALYSIA, for which she produced be-tween fifteen and thirty variations. In most cases, we were satisfied with one ofthe options ALYSIA proposed within the first fifteen tries. We then combinedthe melodic lines. See Figure 5 for the resulting score. Subsequently, the songwas recorded and produced. One of the authors, who is a professionally trainedsinger, recorded and produced the song. A recording of this song can be heardat http://bit.ly/2eQHado.
Fig. 5: An original song co-written with ALYSIA.https://www.youtube.com/watch?v=whgudcj82_Ihttps://www.withalysia.com/.
M. Ackerman and D. Loker. Algorithmic Songwriting with ALYSIA. In: Correia J., Ciesielski V., Liapis A. (eds) Computational Intelligence in Music,Sound, Art and Design. EvoMUSART, 2017.
28 / 29 [email protected]
A few concepts to summarize lecture 6
CART: Classification and regression trees. A class of nonparametric methods based on partitioning the inputspace into regions and fitting a simple model for each region.
Recursive binary splitting: A greedy method for partitioning the input space into “boxes” aligned with thecoordinate axes.
Gini index and deviance: Commonly used error measures for constructing classification trees.
Ensemble methods: Umbrella term for methods that average or combine multiple models.
Bagging: Bootstrap aggregating. An ensemble method based on the statistical bootstrap.
Random forests: Bagging of trees, combined with random feature selection for further variance reduction (andcomputational gains).
29 / 29 [email protected]