boosting cmput 466/551 principal source: cmu boosting idea we have a weak classifier, i.e., it’s...

30
Boosting CMPUT 466/551 Principal Source: CMU

Upload: jasmine-hamilton

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting

CMPUT 466/551

Principal Source: CMU

Page 2: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting Idea

We have a weak classifier, i.e., it’s error rate is slightly better than 0.5.

Boosting combines a lot of such weak learners to make a strong classifier (the error rate of which is much less than 0.5)

Page 3: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting: Combining Classifiers

What is ‘weighted sample?’

Page 4: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Discrete Ada(ptive)boost Algorithm

• Create weight distribution W(x) over N training points• Initialize W0(x) = 1/N for all x, step T=0• At each iteration T :– Train weak classifier CT(x) on data using weights WT(x)

– Get error rate εT . Set αT = log ((1 - εT )/εT )

– Calculate WT+1(xi ) = WT(xi ) exp[∙ αT I(∙ yi ≠ CT(xi ))]

• Final classifier CFINAL(x) =sign [ ∑ αi Ci (x) ]

• Assumes weak method CT can use weights WT(x)– If this is hard, we can sample using WT(x)

Page 5: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Real Adaboost Algorithm

• Create weight distribution W(x) over N training points• Initialize W0(x) = 1/N for all x, step T=0• At each iteration T :– Train weak classifier CT(x) on data using weights WT(x)

• Obtain class probabilities pT(xi) for each data point xi

– Set fT(x) = ½ log [ pT(xi)/(1- pT(xi)) ]

– Calculate WT+1(xi ) = WT(xi ) exp[∙ yi ∙ fT(x)] for all xi

• Final classifier CFINAL(x) =sign [ ∑ ft(x) ]

Page 6: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting With Decision Stumps

Page 7: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

First classifier

Page 8: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

First 2 classifiers

Page 9: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

First 3 classifiers

Page 10: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Final Classifier learned by Boosting

Page 11: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Performance of Boosting with Stumps

Problem:

1

)5.0(,1 210

10

1

2 j

jXifY

Xj are standard Gaussian variables

About 1000 positive and1000 negative training examples

10,000 test observations

Weak classifier is a “stump” i.e., a two-terminal node classification tree

Page 12: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

AdaBoost is Special

1. The properties of the exponential loss function cause the AdaBoost algorithm to be simple.

2. AdaBoost’s closed form solution is in terms of minimized training set error on weighted data.

• This simplicity is very special and not true for all loss functions!

Page 13: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting: An Additive Model

Consider the additive model:

M

mmm xbxf

1

);()(

Can we minimize this cost function?

N

i

M

mmmi xbyL

Mmm 1 1},{

));(,(min1

N: number of training data points

L: loss function

b: basis functions

This optimization is Non-convex and hard!

Boosting takes a greedyapproach

Page 14: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting: Forward stagewise greedy search

Adding basis one by one

Page 15: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting As Additive Model

• Simple case: Squared-error loss

• Forward stagewise modeling amounts to just fitting the residuals from previous iteration

• Squared-error loss not robust for classification

2))((2

1))(,( xfyxfyL

2

21

1

));((

));()((

));()(,(

iim

iimi

iimi

xbr

xbxfy

xbxfyL

Page 16: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Note that we use a property of the exponential loss function at this step.Many other functions (e.g. absolute loss) would start getting in the way…

Boosting As Additive Model

• AdaBoost for Classification: – L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function– Margin ≡ y ∙ f (x)

N

iimiimi

G

N

iimimi

G

i

N

ii

f

xGyxfy

xGxfy

xfyL

m

m

11

,

11

,

1

))(exp())(exp(minarg

)])()([exp(minarg

))(,(minarg

Page 17: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting As Additive Model

ew

xGyIwee

wexGyIwee

ewew

xfywwherexGyw

xGyxfy

N

i

mi

N

iii

mi

G

N

i

mi

N

iii

mi

G

N

xGy

mi

N

xGy

mi

G

imim

i

N

iimi

mi

G

N

iimiimi

G

iiii

m

m

1

)(

1

)(

1

)(

1

)(

)(

)(

)(

)(

1)(

1

)(

,

11

,

)])(([)(minarg

)])(([)(minarg

minarg

))(exp(,))(exp(minarg

))(exp())(exp(minarg

First assume that β is constant, and minimize G:

Page 18: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

First assume that β is constant, and minimize G:

So if we choose G such that training error errm on the weighted data is minimized, that’s our optimal G.

Boosting As Additive Model

)()(minarg

)])(([)(minarg

1

)(

1

)(

Heerree

ew

xGyIwee

mG

N

i

mi

N

iii

mi

G

Page 19: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting As Additive Model

)1

log(2

1

1

01

0)(1

0)(

)()(

2

2

m

m

m

m

mm

m

m

m

err

err

eerr

err

errerre

eeerre

eeeerrH

eeeerrH

Another property of the exponential loss function is that we get an especially simple derivative

Next, assume we have found this G, so given G, we next minimize β:

Page 20: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting: Practical Issues

• When to stop?– Most improvement for first 5 to 10 classifiers– Significant gains up to 25 classifiers– Generalization error can continue to improve

even after training error is zero!– Methods:

• Cross-validation• Discrete estimate of expected generalization error EG

• How are bias and variance affected?• Variance usually decreases• Boosting can give reduction in both bias and variance

Page 21: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting: Practical Issues

• When can boosting have problems?– Not enough data– Really weak learner– Really strong learner– Very noisy data• Although this can be mitigated• e.g. detecting outliers, or regularization methods• Boosting can be used to detect noise

– Look for very high weights

Page 22: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Features of Exponential Loss

• Advantages– Leads to simple decomposition into observation

weights + weak classifier– Smooth with gradually changing derivatives– Convex

• Disadvantages– Incorrectly classified outliers may get weighted

too heavily (exponentially increased weights), leading to over-sensitivity to noise

][)( )(xFyeEFJ

Page 23: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Squared Error Loss

yfuu

yf

ynotefyyf

fyfy

fy

,)1(

)1(

)1(21

2

)(

2

2

222

22

2

Explanation of Fig. 10.4:

Page 24: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Other Loss Functions For Classification

• Logistic Loss

• Very similar population minimizer to exponential• Similar behavior for positive margins, very

different for negative margins• Logistic is more robust against outliers and

misspecified data

)1log())(,( )(xfyLogistic exfyL

Page 25: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Other Loss Functions For Classification

• Hinge (SVM)

• General Hinge (SVM)

• These can give improved robustness or accuracy, but require more complex optimization methods– Boosting with exponential loss is linear optimization– SVM is quadratic optimization

)](1[))(,( xfyxfyLHINGE

1)](1[))(,( qxfyxfyL qHINGEGEN

Page 26: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Robustness of different Loss function

Page 27: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Loss Functions for Regression

• Squared-error Loss weights outliers very highly– More sensitive to noise, long-tailed error

distributions

• Absolute Loss• Huber Loss is hybrid:

otherwisexfy

xfyifxfyxfyL

)2

|)((|

|)(|)]([))(,( 2

Page 28: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Robust Loss function for Regression

Page 29: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Boosting and SVM

• Boosting increases the margin “yf(x)” by additive stagewise optimization

• SVM also maximizes the margin “yf(x)”• The difference is in the loss function– Adaboost

uses exponential loss, while SVM uses “hinge loss” function

• SVM is more robust to outliers than Adaboost• Boosting can turn base weak classifiers into a

strong one, SVM itself is a strong classifier

Page 30: Boosting CMPUT 466/551 Principal Source: CMU Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines

Summary

• Boosting combines weak learners to obtain a strong one

• From the optimization perspective, boosting is a forward stage-wise minimization to maximize a classification/regression margin

• It’s robustness depends on the choice of the Loss function

• Boosting with trees is claimed to be “best off-the-self classification” algorithm

• Boosting can overfit!