lecture 13 - web.engr.oregonstate.edu

23
Lecture 13 Lecture 13 Oct-27-2007

Upload: others

Post on 24-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 13 - web.engr.oregonstate.edu

Lecture 13Lecture 13

Oct-27-2007

Page 2: Lecture 13 - web.engr.oregonstate.edu

BaggingBagging• Generate T random sample from training set by

b t t ibootstrapping• Learn a sequence of classifiers h1,h2,…,hT from

each of them using base learner Leach of them, using base learner L• To classify an unknown sample X, let each

classifier predictclassifier predict. • Take simple majority vote to make the final

prediction.p

Simple scheme, works well in many situations!

Page 3: Lecture 13 - web.engr.oregonstate.edu

Bias/Variance for classifiersBias/Variance for classifiers• Bias arises when the classifier cannot represent the true

function – that is the classifier underfits the datafunction that is, the classifier underfits the data

• Variance arises when the classifier overfits the data –minor variations in training set cause the classifier to overfit differently

• Clearly you would like to have a low bias and low variance classifier!

T i ll l bi l ifi ( fitti ) h hi h i– Typically, low bias classifiers (overfitting) have high variance– high bias classifiers (underfitting) have low variance– We have a trade-off

Page 4: Lecture 13 - web.engr.oregonstate.edu

Effect of Algorithm Parameters on Bias and Variance

• k-nearest neighbor: increasing k typically increases bias and reduces variance

• decision trees of depth D: increasing D typically increases variance and reduces bias

Page 5: Lecture 13 - web.engr.oregonstate.edu

Why does bagging work?Why does bagging work?

• Bagging takes the average of multipleBagging takes the average of multiple models --- reduces the variance

• This suggests that bagging works the best• This suggests that bagging works the best with low bias and high variance classifiers

Page 6: Lecture 13 - web.engr.oregonstate.edu

BoostingBoosting• Also an ensemble method: the final prediction is a p

combination of the prediction of multiple classifiers.• What is different?

– Its iterative. Boosting: Successive classifiers depends upon its predecessors - look at errors from previous classifiers to

f fdecide what to focus on for the next iteration over dataBagging : Individual classifiers were independent.

– All training examples are used in each iteration, but with different weights – more weights on difficult sexamples. (the ones on which we committed mistakes in the previous iterations)

Page 7: Lecture 13 - web.engr.oregonstate.edu

Adaboost: IllustrationAdaboost: Illustration

h ( )H(X)

Update weights hM(x)

hm(x)H(X)

h (x)

Update weights

Update weights

h2(x)

h3(x)

Original data: uniformly weighted

Update weights

h1(x)uniformly weighted

Page 8: Lecture 13 - web.engr.oregonstate.edu

The AdaBoost Algorithm

CS434 Fall 2007

Page 9: Lecture 13 - web.engr.oregonstate.edu

The AdaBoost Algorithm

Page 10: Lecture 13 - web.engr.oregonstate.edu

AdaBoost(Example)AdaBoost(Example)Original Training set : Equal Weights to all training samplesg g p

Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

Page 11: Lecture 13 - web.engr.oregonstate.edu

AdaBoost(Example)AdaBoost(Example)

ROUND 1

Page 12: Lecture 13 - web.engr.oregonstate.edu

AdaBoost(Example)AdaBoost(Example)

ROUND 2ROUND 2

Page 13: Lecture 13 - web.engr.oregonstate.edu

AdaBoost(Example)AdaBoost(Example)

ROUND 3

Page 14: Lecture 13 - web.engr.oregonstate.edu

AdaBoost(Example)AdaBoost(Example)

Page 15: Lecture 13 - web.engr.oregonstate.edu

Weighted ErrorWeighted Error• Adaboost calls L with a set of prespecified weights • It is often straightforward to convert a base learner L to take• It is often straightforward to convert a base learner L to take into account an input distribution D.

i iDecision trees?

K Nearest Neighbor?g

Naïve Bayes?

• When it is not straightforward we can resample the training dataS according to D and then feed the new data set into the learner.S cco d g o d e eed e ew d se o e e e .

Page 16: Lecture 13 - web.engr.oregonstate.edu

Boosting Decision StumpsBoosting Decision StumpsDecision stumps: very simple rules of thumb that test

diti i l tt ib tcondition on a single attribute.

Among the most commonly used base classifiers – truly weak!

Boosting with decision stumps has been shown to achieve better performance compared to unbounded decision trees.

Page 17: Lecture 13 - web.engr.oregonstate.edu

Boosting Performance• Comparing C4.5, boosting decision stumps, boosting

C4.5 using 27 UCI data set– C4.5 is a popular decision tree learner

Page 18: Lecture 13 - web.engr.oregonstate.edu

Boosting vs Baggingf D i i Tof Decision Trees

Page 19: Lecture 13 - web.engr.oregonstate.edu

Overfitting?Overfitting?• Boosting drives training error to zero, will it overfit?

C i h• Curious phenomenon

• Boosting is often robust to overfitting (not always)g g ( y )• Test error continues to decrease even after training error

goes to zero

Page 20: Lecture 13 - web.engr.oregonstate.edu

Explanation with MarginsL

∑=

⋅=L

lll xhwxf

1)()(

Margin = y ⋅ f(x)

Histogram of functional margin for ensemble just after achieving zero training error

Page 21: Lecture 13 - web.engr.oregonstate.edu

Effect of Boosting: M i i i M iMaximizing Margin

MarginNo examples with small margins!!

Even after zero training error the margin of examples increases.This is one reason that the generalization error may continue decreasing.

Page 22: Lecture 13 - web.engr.oregonstate.edu

Bias/variance analysis of BoostingBias/variance analysis of Boosting

• In the early iterations boosting is primaryIn the early iterations, boosting is primary a bias-reducing method

• In later iterations it appears to be primarily• In later iterations, it appears to be primarily a variance-reducing method

Page 23: Lecture 13 - web.engr.oregonstate.edu

What you need to know about bl h d ?ensemble methods?

• Bagging: a randomized algorithm based on bootstrapping– What is bootstrapping– Variance reduction

Wh t l i l ith ill b d f b i ?– What learning algorithms will be good for bagging?

• Boosting:– Combine weak classifiers (i.e., slightly better than random)( , g y )– Training using the same data set but different weights– How to update weights?– How to incorporate weights in learning (DT, KNN, Naïve Bayes)– One explanation for not overfitting: maximizing the margin