lecture 6 tree-based methods, bagging and random forests · 2020-02-09 · the idea behind...

Post on 26-Apr-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 6 – Tree-based methods,Bagging and Random Forests

Niklas WahlstromDivision of Systems and ControlDepartment of Information TechnologyUppsala University.

Email: niklas.wahlstrom@it.uu.se

niklas.wahlstrom@it.uu.se

Summary of Lecture 5 (I/IV)

When choosing models/methods, we are interested in how well they will performwhen faced with new unseen data.

The new data error

Enew , E? [E(y(x?; T ), y?)]

describes how well a method (which is trained using data set T ) will perform “inproduction”.

E is for instance mean squared error (regression) or misclassification (classification).

The overall goal in supervised machine learning is to achieve small Enew.

Etrain , 1n

∑ni=1E(y(xi; T ), yi) is the training data error.

Not a good estimate of Enew.1 / 29 niklas.wahlstrom@it.uu.se

Summary of Lecture 5 (II/IV)

Two methods for estimating Enew:

1. Hold-out validation data approach: Randomly split the data into a trainingset and a hold-out validation set. Learn the model using the training set.Estimate Enew using the hold-out validation set.

2. k-fold cross-validation: Randomly split the data into k parts (or folds) ofroughly equal size.

a) The first fold is kept aside as a validation set and the model is learned using only theremaining k − 1 folds. Enew is estimated on the validation set.

b) The procedure is repeated k times, each time a different fold is treated as thevalidation set.

c) The average of all k estimates is taken as the final estimate of Enew.

2 / 29 niklas.wahlstrom@it.uu.se

Summary of Lecture 5 (III/IV)

Enew = E?

[(f(x?)− f0(x?)

)2]

︸ ︷︷ ︸Bias2

+E?

[ET

[(y(x?; T )− f(x?)

)2]]

︸ ︷︷ ︸Variance

+ σ2︸︷︷︸Irreducible

error

where f(x) = ET [y(x?; T )]

• Bias: The inability of a method to describe the complicated patterns we wouldlike it to describe.• Variance: How sensitive a method is to the training data.

The more prone a model is to adapt to complicated pattern in the data, the higher themodel complexity (or model flexibility).

3 / 29 niklas.wahlstrom@it.uu.se

Summary of Lecture 5 (IV/IV)

Enew

Irreducible error

Variance

Bias2

OverfitUnderfit

Model complexity

Erro

r

Finding a balanced fit (neither over- nor underfit) is called thethe bias-variance tradeoff.

4 / 29 niklas.wahlstrom@it.uu.se

Contents – Lecture 6

1. Classification and regression trees (CART)2. Bagging – a general variance reduction technique3. Random forests

5 / 29 niklas.wahlstrom@it.uu.se

The idea behind tree-based methods

In both regression and classification settings we seek a function y(x) which maps theinput x into a prediction.

One flexible way of designing this function is topartition the input space into disjoint regions and fita simple model in each region.

x1

x2

= Training data

• Classification: Majority vote within the region.• Regression: Mean of training data within the region.

6 / 29 niklas.wahlstrom@it.uu.se

Finding the partition

The key challenge in using this strategy is to find a good partition.

Even if we restrict our attention to seemingly simple regions (e.g. “boxes”), finding anoptimal partition w.r.t. minimizing the training error is computationally infeasible!

Instead, we use a “greedy” approach: recursive binary splitting.

1. Select one input variable xj and a cut-point s. Partition the input space into twohalf-spaces,

{x : xj < s} {x : xj ≥ s}.

2. Repeat this splitting for each region until some stopping criterion is met (e.g., noregion contains more than 5 training data points).

7 / 29 niklas.wahlstrom@it.uu.se

Recursive binary splitting

x1

x2

Partitioning of input space

= Training datas1

s2

s3

s4R1

R2

R3

R4

R5

Tree representation

x1≤s1

x2≤s2 x1≤s3

x2≤s4

R1 R2 R3

R4 R5

8 / 29 niklas.wahlstrom@it.uu.se

Regression trees (I/II)

Once the input space is partitioned into L regions, R1, R2, . . . , RL the predictionmodel is

y? =L∑

`=1y`I{x? ∈ R`},

where I{x? ∈ R`} is the indicator function

I{x? ∈ R`} ={

1 if x ∈ R`

0 if x /∈ R`

and y` is a constant prediction within each region.For regression trees we use

y` = avarage{yi : xi ∈ R`}

9 / 29 niklas.wahlstrom@it.uu.se

Regression trees (I/II) - How do we find the partition?

Recursive binary splitting is greedy - each split is made in order to minimize the losswithout looking ahead at future splits.

For any j and s, defineR1(j, s) = {x |xj < s} and R2(j, s) = {x |xj ≥ s}.

We then seek (j, s) that minimize∑i:xi∈R1(j,s)

(yi − y1(j, s))2 +∑

i:xi∈R2(j,s)(yi − y2(j, s))2 .

wherey1 = avarage{yi : xi ∈ R1(j, s)}y2 = avarage{yi : xi ∈ R2(j, s)}

This optimization problem is easily solved by ”brute force” by evaluating all possiblesplits.

10 / 29 niklas.wahlstrom@it.uu.se

Classification trees

Classification trees are constructed similarly to regression trees, but with twodifferences.

Firstly, the class prediction for each region is based on the proportion of data pointsfrom each class in that region. Let

π`m = 1n`

∑i:xi∈R`

I{yi = m}

be the proportion of training observations in the lth region that belong to the mthclass. Then we approximate

p(y = m |x) ≈L∑

`=1π`mI{x ∈ R`}

11 / 29 niklas.wahlstrom@it.uu.se

Classification trees

Secondly, the squared loss used to construct the tree needs to be replaced by ameasure suitable to categorical outputs.

Three common error measures are,

Misclassification error: 1−maxm

π`m

Entropy/deviance: −∑M

m=1 π`m log π`m

Gini index:∑M

m=1 π`m(1− π`m)

12 / 29 niklas.wahlstrom@it.uu.se

Classification error measures

For a binary classification problem (M = 2)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

Class-1 probability, π1m

Loss

MisclassificationGini indexDeviance

13 / 29 niklas.wahlstrom@it.uu.se

ex) Spam data

Classification tree for spam data:

cf_! < 0.078

wf_remove < 0.02

cf_$ < 0.17

caps_max < 18

wf_free < 0.29

wf_remove < 0.07

cf_$ < 0.080

wf_hp >= 0.34

00.39

100%0

0.1558%

00.1054%

00.0852%

10.722%

10.834%

10.7242%

00.4217%

00.3012%

00.2311%

00.1510%

10.711%

10.891%

10.824%

10.9126%

00.211%

10.9425%

yes no

Tree LDATest error: 11.3 % 10.9 %

14 / 29 niklas.wahlstrom@it.uu.se

Improving CART

The flexibility/complexity of classification and regression trees (CART) is decided bythe tree depth.

! To obtain a small bias the tree need to be grown deep,! but this results in a high variance!

The performance of (simple) CARTs is often unsatisfactory!

To improve the practical performance:N Pruning – grow a deep tree (small bias) which is then pruned into a smaller one

(reduce variance).N Ensemble methods – average or combine multiple trees.

• Bagging and Random Forests (this lecture)• Boosted trees (next lecture)

15 / 29 niklas.wahlstrom@it.uu.se

Probability detour - Variance reduction by averaging

Let zb, b = 1, . . . , B be identically distributed random variables with mean E [zb] = µand variance Var[σ2]. Let ρ be the correlation between distinct variables.

Then,

E[

1B

B∑b=1

zb

]= µ,

Var[

1B

B∑b=1

zb

]= 1− ρ

Bσ2︸ ︷︷ ︸

small for large B

+ ρσ2.

The variance is reduced by averaging (if ρ < 1) !

16 / 29 niklas.wahlstrom@it.uu.se

Bagging (I/II)

For now, assume that we have access to B independent datasets T 1, . . . , T B. Wecan then train a separate deep tree yb(x) for each dataset, 1, . . . , B.

• Each yb(x) has a low bias but high variance• By averaging

ybag(x) = 1B

B∑b=1

yb(x)

the bias is kept small, but variance is reduced by a factor B!

17 / 29 niklas.wahlstrom@it.uu.se

Bagging (II/II)

Obvious problem We only have access to one training dataset.

Solution Bootstrap the data!

• Sample n times with replacement from the original training data T = {xi, yi}ni=1• Repeat B times to generate B ”bootstrapped” training datasets T 1, . . . , T B

For each bootstrapped dataset T b we train a tree yb(x). Averaging these,

ybbag(x) = 1

B

B∑b=1

yb(x)

is called ”bootstrap aggregation”, or bagging.

18 / 29 niklas.wahlstrom@it.uu.se

Bagging - Toy example

ex) Assume that we have a training set

T = {(x1, y1), (x2, y2), (x3, y3), (x4, y4)}.

We generate, say, B = 3 datasets by bootstrapping:

T 1 = {(x1, y1), (x2, y2), (x3, y3), (x3, y3)}T 2 = {(x1, y1), (x4, y4), (x4, y4), (x4, y4)}T 3 = {(x1, y1), (x1, y1), (x2, y2), (x2, y2)}

Compute B = 3 (deep) regression trees y1(x), y2(x) and y3(x), one for each datasetT 1, T 2, and T 3, and average

ybag(x) = 13

3∑b=1

yb(x)

19 / 29 niklas.wahlstrom@it.uu.se

ex) Predicting US Supreme Court behavior

Random forest classifier built on SCDB data1

to predict the votes of Supreme Court justices:

Y ∈ {affirm, reverse, other}

Result: 70% correct classificationsD. M. Katz, M. J. Bommarito II and J. Blackman. A General Approachfor Predicting the Behavior of the Supreme Court of the United States.arXiv.org, arXiv:1612.03473v2, January 2017.

Fig 5. Justice-Term Accuracy Heatmap Compared Against M=10 NullModel (1915 -2015). Green cells indicate that our model outperformed the baselinefor a given Justice in a given term. Pink cells indicate that our model only matched orunderperformed the baseline. The deeper the color green or pink, the better or worse,respectively, our model performed relative to the M=10 baseline.

performance against the most challenging of null models (i.e., the M = 10 finitememory window). While careful inspection of Figure 5 allows the interested reader toexplore the Justice-by-Justice performance of our model, a high level review of Figure 5reveals the basic achievement of our modeling goals as described earlier. While weperform better with some Justices than others and better in some time periods thanothers, Figure 5 displays generalized and consistent mean-field performance across manyJustices and many historical contexts.

Version 2.02 - 01.16.17 14/18

1http://supremecourtdatabase.org20 / 29 niklas.wahlstrom@it.uu.se

ex) Predicting US Supreme Court behavior

Not only have random forests proven to be “unreasonably effective” in a widearray of supervised learning contexts, but in our testing, random forestsoutperformed other common approaches including support vector machines[. . . ] and feedforward artificial neural network models such as multi-layerperceptron

— Katz, Bommarito II and Blackman (arXiv:1612.03473v2)

21 / 29 niklas.wahlstrom@it.uu.se

Random forests

Bagging can drastically improve the performance of CART!

However, the B bootstrapped dataset are correlated⇒ the variance reduction due to averaging is diminished.

Idea: De-correlate the B trees by randomly perturbing each tree.

A random forest is constructed by bagging, but for each split in each tree only arandom subset of q ≤ p inputs are considered as splitting variables.

Rule of thumb: q = √p for classification trees and q = p/3 for regression trees.2

2Proposed by Leo Breiman, inventor of random forests.22 / 29 niklas.wahlstrom@it.uu.se

Random forest pseudo-code

Algorithm Random forest for regression

1. For b = 1 to B (can run in parallel)(a) Draw a bootstrap data set T of size n from T .(b) Grow a regression tree by repeating the following steps until a minimum node size is

reached:i. Select q out of the p input variables uniformly at random.ii. Find the variable xj among the q selected, and the corresponding split point s, that

minimizes the squared error.iii. Split the node into two children with {xj ≤ s} and {xj > s}.

2. Final model is the average the B ensemble members,

yrf? = 1

B

B∑b=1

yb?.

23 / 29 niklas.wahlstrom@it.uu.se

Random forests

Recall: For i.d. random variables {zb}Bb=1

Var(

1B

B∑b=1

zb

)= 1− ρ

Bσ2 + ρσ2

The random input selection used in random forests:H increases the bias, but often very slowlyH adds to the variance (σ2) of each treeN reduces the correlation (ρ) of the trees

The reduction in correlation is typically the dominant effect ⇒ there is an overallreduction in MSE!

24 / 29 niklas.wahlstrom@it.uu.se

ex) Toy regression model

For the toy model previously considered. . .

0 20 40 60 80 100

1.8

2.0

2.2

2.4

2.6

2.8

3.0

Number of trees

Test

err

or

Random ForestBaggingSingle regression tree

25 / 29 niklas.wahlstrom@it.uu.se

Overfitting?

The complexity of a bagging/random forest model increases with an increasing numberof trees B.

Will this lead to overfitting as B increases?

No – more ensemble members does not increase the flexibility of the model!

Regression case:

yrf? = 1

B

B∑b=1

yb? → E [y? | T ] , as B →∞,

where the expectation is w.r.t. the randomness in the data bootstrapping and inputselection.

26 / 29 niklas.wahlstrom@it.uu.se

Advantages of random forests

Random forests have several computational advantages:N Embarrassingly parallelizable!N Using q < p potential split variables reduces the computational cost of each split.N We could bootstrap fewer than n, say

√n, data points when creating T b — very

useful for “big data” problems.. . . and they also come with some other benefits:N Often works well off-the-shelf – few tuning parametersN Requires little or no input preparationN Implicit input selection

27 / 29 niklas.wahlstrom@it.uu.se

ex) Automatic music generation

ALYSIA: automated music generation using random forests.

• User specifies the lyrics• ALYSIA generates accompanying

music via- rythm model- melody model

• Trained on a corpus of pop songs.

12 Margareta Ackerman and David Loker

5 Songs made with ALYSIA

Three songs were created using ALYSIA. The first two rely on lyrics written bythe authors, and the last borrows its lyrics from a 1917 Vaudeville song. Thesongs were consequently recorded and produced. All recordings are posted onhttp://bit.ly/2eQHado.

The primary technical difference between these songs was the explore/exploitparameter. Why Do I Still Miss You was made using an explore vs exploitparameter that leads to more exploration, while Everywhere and I’m AlwaysChasing Rainbows were made with a higher exploit setting.

5.1 Why Do I Still Miss You

Why Do I Still Miss You, the first song ever written with ALYSIA, was madewith the goal of testing the viability of using computer-assisted composition forproduction-quality songs. The creation of the song started with the writing oforiginal lyrics, written by the authors. The goal was to come up with lyrics thatstylistically match the style of the corpus, that is, lyrics in today’s pop genre(See Figure 5).

Next, each lyrical line was given to ALYSIA, for which she produced be-tween fifteen and thirty variations. In most cases, we were satisfied with one ofthe options ALYSIA proposed within the first fifteen tries. We then combinedthe melodic lines. See Figure 5 for the resulting score. Subsequently, the songwas recorded and produced. One of the authors, who is a professionally trainedsinger, recorded and produced the song. A recording of this song can be heardat http://bit.ly/2eQHado.

Fig. 5: An original song co-written with ALYSIA.https://www.youtube.com/watch?v=whgudcj82_Ihttps://www.withalysia.com/.

M. Ackerman and D. Loker. Algorithmic Songwriting with ALYSIA. In: Correia J., Ciesielski V., Liapis A. (eds) Computational Intelligence in Music,Sound, Art and Design. EvoMUSART, 2017.

28 / 29 niklas.wahlstrom@it.uu.se

A few concepts to summarize lecture 6

CART: Classification and regression trees. A class of nonparametric methods based on partitioning the inputspace into regions and fitting a simple model for each region.

Recursive binary splitting: A greedy method for partitioning the input space into “boxes” aligned with thecoordinate axes.

Gini index and deviance: Commonly used error measures for constructing classification trees.

Ensemble methods: Umbrella term for methods that average or combine multiple models.

Bagging: Bootstrap aggregating. An ensemble method based on the statistical bootstrap.

Random forests: Bagging of trees, combined with random feature selection for further variance reduction (andcomputational gains).

29 / 29 niklas.wahlstrom@it.uu.se

top related