lecture 20 - hacettepe Üniversitesiaykut/classes/... · • take repeated bootstrap samples from...

60
Lecture 20: AdaBoost Aykut Erdem December 2018 Hacettepe University

Upload: others

Post on 09-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Lecture 20:−AdaBoost

Aykut Erdem December 2018Hacettepe University

Page 2: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Last time… Bias/Variance Tradeoff

!2http://scott.fortmann-roe.com/docs/BiasVariance.htmlGraphical illustration of bias and variance.

slide by David Sontag

Page 3: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Last time… Bagging• Leo Breiman (1994)• Take repeated bootstrap samples from training set D.• Bootstrap sampling: Given set D containing N

training examples, create D’ by drawing N examples at random with replacement from D.

• Bagging:- Create k bootstrap samples D1 ... Dk. - Train distinct classifier on each Di. - Classify new instance by majority vote / average.

!3

slide by David Sontag

Page 4: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Last time… Random Forests

!4

slide by Nando de Freitas

Random Forests for classification or regression

[From the book of Hastie, Friedman and Tibshirani]

Tree t=1 t=2 t=3

Page 5: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting

!5

Page 6: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting Ideas • Main idea: use weak learner to create

strong learner. • Ensemble method: combine base

classifiers returned by weak learner. • Finding simple relatively accurate base

classifiers often not hard. • But, how should base classifiers be

combined?!6

slide by Mehryar M

ohri

Page 7: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Example: “How May I Help You?”• Goal: automatically categorize type of call requested by

phone customer (Collect, CallingCard, PersonToPerson, etc.) - yes I’d like to place a collect call long distance please (Collect)

- operator I need to make a call but I need to bill it to my office (ThirdNumber)

- yes I’d like to place a call on my master card please (CallingCard)

- I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of my bill (BillingCredit)

• Observation:- easy to find “rules of thumb” that are “often” correct

• e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ”- hard to find single highly accurate prediction rule

!7[Gorin et al.]

slide by Rob Schapire

Page 8: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting: Intuition• Instead of learning a single (weak) classifier, learn

many weak classifiers that are good at different parts of the input space

• Output class: (Weighted) vote of each classifier - Classifiers that are most “sure” will vote with more

conviction - Classifiers will be most “sure” about a particular part of the

space - On average, do better than single classifier!

• But how do you??? - force classifiers to learn about different parts of the input

space? - weigh the votes of different classifiers?

!8

slide by Aarti Singh & Barnabas Poczos

Page 9: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting [Schapire, 1989]• Idea: given a weak learner, run it multiple times on (reweighted)

training data, then let the learned classifiers vote • On each iteration t:

- weight each training example by how incorrectly it was classified- Learn a hypothesis – ht- A strength for this hypothesis – at

• Final classifier: - A linear combination of the votes of the different classifiers

weighted by their strength

• Practically useful • Theoretically interesting

!9

slide by Aarti Singh & Barnabas Poczos

Page 10: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting: Intuition• Want to pick weak classifiers that contribute something

to the ensemble

!10

Boosting

Boost ing: intuit ion

Want to pick weak classifiers that contribute something to the ensemble.

Greedy algorithm: for m = 1, .. . ,MPick a weak classifier hmAdjust weights: misclassifiedexamples get “heavier”↵m set according to weightederror of hm

Greg Shakhnarovich (TTIC) Lecture 5: Boosting, decision t rees January 14, 2013 10 / 34

Greedy algorithm: for m=1,...,M • Pick a weak classifier hm • Adjust weights: misclassified

examples get “heavier” • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

Page 11: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting

Boost ing: intuit ion

Want to pick weak classifiers that contribute something to the ensemble.

Greedy algorithm: for m = 1, .. . ,MPick a weak classifier hmAdjust weights: misclassifiedexamples get “heavier”↵m set according to weightederror of hm

Greg Shakhnarovich (TTIC) Lecture 5: Boosting, decision t rees January 14, 2013 10 / 34

Boosting: Intuition• Want to pick weak classifiers that contribute something

to the ensemble

!11

Greedy algorithm: for m=1,...,M • Pick a weak classifier hm • Adjust weights: misclassified

examples get “heavier” • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

Page 12: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting: Intuition• Want to pick weak classifiers that contribute something

to the ensemble

!12

Boosting

Boost ing: intuit ion

Want to pick weak classifiers that contribute something to the ensemble.

Greedy algorithm: for m = 1, .. . ,MPick a weak classifier hmAdjust weights: misclassifiedexamples get “heavier”↵m set according to weightederror of hm

Greg Shakhnarovich (TTIC) Lecture 5: Boosting, decision trees January 14, 2013 10 / 34

Greedy algorithm: for m=1,...,M • Pick a weak classifier hm • Adjust weights: misclassified

examples get “heavier” • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

Page 13: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting: Intuition• Want to pick weak classifiers that contribute something

to the ensemble

!13

Boosting

Boosting: intuition

Want to pick weak classifiers that contribute something to the ensemble.

Greedy algorithm: for m = 1, . . . ,M

Pick a weak classifier hm

Adjust weights: misclassifiedexamples get “heavier”

↵m set according to weightederror of hm

Greg Shakhnarovich (TTIC) Lecture 5: Boosting, decision trees January 14, 2013 10 / 34

Greedy algorithm: for m=1,...,M • Pick a weak classifier hm • Adjust weights: misclassified

examples get “heavier” • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

Page 14: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting: Intuition• Want to pick weak classifiers that contribute something

to the ensemble

!14

Boosting

Boosting: intuition

Want to pick weak classifiers that contribute something to the ensemble.

Greedy algorithm: for m = 1, . . . ,M

Pick a weak classifier hm

Adjust weights: misclassifiedexamples get “heavier”

↵m set according to weightederror of hm

Greg Shakhnarovich (TTIC) Lecture 5: Boosting, decision trees January 14, 2013 10 / 34

Greedy algorithm: for m=1,...,M • Pick a weak classifier hm • Adjust weights: misclassified

examples get “heavier” • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

Page 15: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting: Intuition• Want to pick weak classifiers that contribute something

to the ensemble

!15

Boosting

Boosting: intuition

Want to pick weak classifiers that contribute something to the ensemble.

Greedy algorithm: for m = 1, . . . ,M

Pick a weak classifier hm

Adjust weights: misclassifiedexamples get “heavier”

↵m set according to weightederror of hm

Greg Shakhnarovich (TTIC) Lecture 5: Boosting, decision trees January 14, 2013 10 / 34

Greedy algorithm: for m=1,...,M • Pick a weak classifier hm • Adjust weights: misclassified

examples get “heavier” • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

Page 16: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting: Intuition• Want to pick weak classifiers that contribute something

to the ensemble

!16

Greedy algorithm: for m=1,...,M • Pick a weak classifier hm • Adjust weights: misclassified

examples get “heavier” • αm set according to weighted

error of hm

Boosting

Boosting: intuition

Want to pick weak classifiers that contribute something to the ensemble.

Greedy algorithm: for m = 1, . . . ,M

Pick a weak classifier hm

Adjust weights: misclassifiedexamples get “heavier”

↵m set according to weightederror of hm

Greg Shakhnarovich (TTIC) Lecture 5: Boosting, decision trees January 14, 2013 10 / 34

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

Page 17: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

First Boosting Algorithms• [Schapire ’89]:

- first provable boosting algorithm

• [Freund ’90]:- “optimal” algorithm that “boosts by majority”

• [Drucker, Schapire & Simard ’92]:- first experiments using boosting - limited by practical drawbacks

• [Freund & Schapire ’95]:- introduced “AdaBoost” algorithm- strong practical advantages over previous boosting

algorithms!17

slide by Rob Schapire

Page 18: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!18

Page 19: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Toy Example

weak hypotheses = vertical or horizontal half-planes !19

Minimize the error

of rounds . One of the main ideas of the algorithm is to maintain adistribution or set of weights over the training set. The weight of this distribution ontraining example on round is denoted . Initially, all weights are set equally,but on each round, the weights of incorrectly classified examples are increased sothat the base learner is forced to focus on the hard examples in the training set.

The base learner’s job is to find a base classifier appropriatefor the distribution . (Base classifiers were also called rules of thumb or weakprediction rules in Section 1.) In the simplest case, the range of each is binary,i.e., restricted to ; the base learner’s job then is to minimize the error

Once the base classifier has been received, AdaBoost chooses a parameterthat intuitively measures the importance that it assigns to . In the figure,

we have deliberately left the choice of unspecified. For binary , we typicallyset

(1)

as in the original description of AdaBoost given by Freund and Schapire [32]. Moreon choosing follows in Section 3. The distribution is then updated using therule shown in the figure. The final or combined classifier is a weighted majorityvote of the base classifiers where is the weight assigned to .

3 Analyzing the training error

The most basic theoretical property of AdaBoost concerns its ability to reducethe training error, i.e., the fraction of mistakes on the training set. Specifically,Schapire and Singer [70], in generalizing a theorem of Freund and Schapire [32],show that the training error of the final classifier is bounded as follows:

(2)

where henceforth we define

(3)

so that . (For simplicity of notation, we write and asshorthand for and , respectively.) The inequality follows from the factthat if . The equality can be proved straightforwardly byunraveling the recursive definition of .

4

For binary ht , typically use

slide by Rob Schapire

Page 20: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Round 1

!20

Round 1Round 1Round 1Round 1Round 1

h1

α

ε11

=0.30=0.42

2D

slide by Rob Schapire

Page 21: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Round 1

!21

Round 1Round 1Round 1Round 1Round 1

h1

α

ε11

=0.30=0.42

2D

slide by Rob Schapire

Page 22: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Round 1

!22

Round 1Round 1Round 1Round 1Round 1

h1

α

ε11

=0.30=0.42

2D

slide by Rob Schapire

Page 23: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Round 2

!23

Round 2Round 2Round 2Round 2Round 2

α

ε22

=0.21=0.65

h2 3D

slide by Rob Schapire

Page 24: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Round 2

!24

Round 2Round 2Round 2Round 2Round 2

α

ε22

=0.21=0.65

h2 3D

slide by Rob Schapire

Page 25: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Round 2

!25

Round 2Round 2Round 2Round 2Round 2

α

ε22

=0.21=0.65

h2 3D

slide by Rob Schapire

Page 26: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Round 3

!26

Round 3Round 3Round 3Round 3Round 3

h3

α

ε33=0.92=0.14

slide by Rob Schapire

Page 27: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Round 3

!27

Round 3Round 3Round 3Round 3Round 3

h3

α

ε33=0.92=0.14

slide by Rob Schapire

Page 28: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Final HypothesisFinal HypothesisFinal HypothesisFinal HypothesisFinal Hypothesis

Hfinal

+ 0.92+ 0.650.42sign=

=

Final Hypothesis

!28

slide by Rob Schapire

Page 29: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Voted combination of classifiers• The general problem here is to try to combine many

simple “weak” classifiers into a single “strong” classifier• We consider voted combinations of simple binary ±1

component classifiers

where the (non-negative) votes αi can be used to emphasize component classifiers that are more reliable than others

!29

slide by Tomm

i S. Jaakkola

Page 30: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Components: Decision stumps• Consider the following simple family of component

classifiers generating ±1 labels:

where These are called decision stumps. • Each decision stump pays attention to only a single

component of the input vector

!30

slide by Tomm

i S. Jaakkola

Page 31: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Voted combinations (cont’d.)• We need to define a loss function for the combination

so we can determine which new component h(x; θ) to add and how many votes it should receive

• While there are many options for the loss function we consider here only a simple exponential loss

!31

slide by Tomm

i S. Jaakkola

Page 32: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Modularity, errors, and loss• Consider adding the mth component:

!32

slide by Tomm

i S. Jaakkola

Page 33: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Modularity, errors, and loss• Consider adding the mth component:

!33

slide by Tomm

i S. Jaakkola

Page 34: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Modularity, errors, and loss• Consider adding the mth component:

• So at the mth iteration the new component (and the votes) should optimize a weighted loss (weighted towards mistakes).

!34

slide by Tomm

i S. Jaakkola

Page 35: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Empirical exponential loss (cont’d.)• To increase modularity we’d like to further decouple the

optimization of h(x; θm) from the associated votes αm • To this end we select h(x; θm) that optimizes the rate at

which the loss would decrease as a function of αm

!35

slide by Tomm

i S. Jaakkola

Page 36: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Empirical exponential loss (cont’d.)• We find that minimizes

• We can also normalize the weights:

so that!36

slide by Tomm

i S. Jaakkola

Page 37: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Empirical exponential loss (cont’d.)• We find that minimizes

where

• is subsequently chosen to minimize

!37

slide by Tomm

i S. Jaakkola

Page 38: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!38

slide by Jiri Matas and Jan Šochm

an

Page 39: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!39

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}

slide by Jiri Matas and Jan Šochm

an

Page 40: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!40

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

slide by Jiri Matas and Jan Šochm

an

Page 41: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!41

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

t = 1

slide by Jiri Matas and Jan Šochm

an

Page 42: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!42

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

t = 1

slide by Jiri Matas and Jan Šochm

an

Page 43: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!43

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

t = 1

slide by Jiri Matas and Jan Šochm

an

Page 44: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!44

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

Output the final classifier:

H(x) = sign

TX

t=1

↵tht(x)

!

step

trai

ning

erro

r

t = 1

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

slide by Jiri Matas and Jan Šochm

an

Page 45: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!45

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

Output the final classifier:

H(x) = sign

TX

t=1

↵tht(x)

!

step

trai

ning

erro

r

t = 2

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

slide by Jiri Matas and Jan Šochm

an

Page 46: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!46

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

Output the final classifier:

H(x) = sign

TX

t=1

↵tht(x)

!

step

trai

ning

erro

r

t = 3

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

slide by Jiri Matas and Jan Šochm

an

Page 47: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!47

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

Output the final classifier:

H(x) = sign

TX

t=1

↵tht(x)

!

step

trai

ning

erro

r

t = 4

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

slide by Jiri Matas and Jan Šochm

an

Page 48: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!48

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

Output the final classifier:

H(x) = sign

TX

t=1

↵tht(x)

!

step

trai

ning

erro

r

t = 5

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

slide by Jiri Matas and Jan Šochm

an

Page 49: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!49

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

Output the final classifier:

H(x) = sign

TX

t=1

↵tht(x)

!

step

trai

ning

erro

r

t = 6

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

slide by Jiri Matas and Jan Šochm

an

Page 50: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!50

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

Output the final classifier:

H(x) = sign

TX

t=1

↵tht(x)

!

step

trai

ning

erro

r

t = 7

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

slide by Jiri Matas and Jan Šochm

an

Page 51: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

The AdaBoost Algorithm

!51

4/17

The AdaBoost Algorithm

Given: (x1, y1), . . . , (xm, ym);xi 2 X , yi 2 {�1,+1}Initialise weights D1(i) = 1/m

For t = 1, ..., T :

⌅ Find ht = arg minhj2H

✏j =mP

i=1Dt(i)Jyi 6= hj(xi)K

⌅ If ✏t � 1/2 then stop

⌅ Set ↵t = 12 log(1�✏t

✏t)

⌅ Update

Dt+1(i) =Dt(i)exp(�↵tyiht(xi))

Zt

where Zt is normalisation factor

Output the final classifier:

H(x) = sign

TX

t=1

↵tht(x)

!

step

trai

ning

erro

r

t = 40

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

slide by Jiri Matas and Jan Šochm

an

Page 52: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Reweighting

!52

slide by Jiri Matas and Jan Šochm

an

Page 53: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

!53

Reweighting

slide by Jiri Matas and Jan Šochm

an

Page 54: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

!54

Reweighting

slide by Jiri Matas and Jan Šochm

an

Page 55: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting results – Digit recognition

• Boosting often (but not always)- Robust to overfitting- Test set error decreases even after training error is zero

!55[Schapire, 1989]

pageMehryar Mohri - Foundations of Machine Learning 20

Several empirical observations (not all): AdaBoost does not seem to overfit, furthermore:

Empirical Observations

10 100 10000

5

10

15

20

erro

r

# rounds-1 -0.5 0.5 1

0.5

1.0

cum

ula

tive

dis

trib

uti

on

margin

Figure 2: Error curves and the margin distribution graph for boosting C4.5 on

the letter dataset as reported by Schapire et al. [69]. Left: the training and test

error curves (lower and upper curves, respectively) of the combined classifier as

a function of the number of rounds of boosting. The horizontal lines indicate the

test error rate of the base classifier as well as the test error of the final combined

classifier. Right: The cumulative distribution of margins of the training examples

after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly

hidden) and solid curves, respectively.

It is a number in and is positive if and only if correctly classifies the

example. Moreover, as before, the magnitude of the margin can be interpreted as a

measure of confidence in the prediction. Schapire et al. proved that larger margins

on the training set translate into a superior upper bound on the generalization error.

Specifically, the generalization error is at most

for any with high probability. Note that this bound is entirely independent

of , the number of rounds of boosting. In addition, Schapire et al. proved that

boosting is particularly aggressive at reducing the margin (in a quantifiable sense)

since it concentrates on the examples with the smallest margins (whether positive

or negative). Boosting’s effect on the margins can be seen empirically, for instance,

on the right side of Fig. 2 which shows the cumulative distribution of margins of the

training examples on the “letter” dataset. In this case, even after the training error

reaches zero, boosting continues to increase the margins of the training examples

effecting a corresponding drop in the test error.

Although the margins theory gives a qualitative explanation of the effectiveness

of boosting, quantitatively, the bounds are rather weak. Breiman [9], for instance,

7

training error

test error

C4.5 decision trees (Schapire et al., 1998).

slide by Carlos G

uestrin

Page 56: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Application: Detecting Faces• Training Data

- 5000 faces• All frontal

- 300 million non-faces• 9500 non-face images

!56

Application to face detection

[Viola and Jones, 2001]

[Viola & Jones]

slide by Rob Schapire

Page 57: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Application: Detecting Faces• Problem: find faces in photograph or movie• Weak classifiers: detect light/dark rectangle in image

• Many clever tricks to make extremely fast and accurate

!57[Viola & Jones]

slide by Rob Schapire

Page 58: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting vs. Logistic Regression

!58

Boos+ng$vs.$Logis+c$Regression$Logis+c$regression:$•  Minimize$log$loss$

•  Define$$$$where$xj$predefined$features$(linear$classifier)$

•  Jointly$op+mize$over$all$weights$w0,w1,w2,… $$

Boos+ng:$•  Minimize$exp$loss$

•  Define$$$where$ht(x)$defined$dynamically$to$fit$data$$(not$a$$linear$classifier)$

•  Weights$αt$learned$per$itera+on$t$incrementally$

$61$

Logis'c+regression:+•  Minimize+log+loss+

•  Define++

+++++where+xj+predefined+features+

+++++(linear+classifier)++

•  Jointly+op'mize+over+all+weights+w0,#w1,#w2…#

Boos'ng:+•  Minimize+exp+loss+

•  Define++

++++++where+ht(x)+defined+dynamically++to+fit+data+(not+a+linear+classifier)++

•  Weights+αt+learned+per+itera'on+t+incrementally+

30

Boos$ng'and'Logis$c'Regression'Logis'c+regression:+•  Minimize+log+loss+

•  Define++

+++++where+xj+predefined+features+

+++++(linear+classifier)++

•  Jointly+op'mize+over+all+weights+w0,#w1,#w2…#

Boos'ng:+•  Minimize+exp+loss+

•  Define++

++++++where+ht(x)+defined+dynamically++to+fit+data+(not+a+linear+classifier)++

•  Weights+αt+learned+per+itera'on+t+incrementally+

30

Boos$ng'and'Logis$c'Regression'Logis'c+regression:+•  Minimize+log+loss+

•  Define++

+++++where+xj+predefined+features+

+++++(linear+classifier)++

•  Jointly+op'mize+over+all+weights+w0,#w1,#w2…#

Boos'ng:+•  Minimize+exp+loss+

•  Define++

++++++where+ht(x)+defined+dynamically++to+fit+data+(not+a+linear+classifier)++

•  Weights+αt+learned+per+itera'on+t+incrementally+

30

Boos$ng'and'Logis$c'Regression'

Logis'c+regression:+•  Minimize+log+loss+

•  Define++

+++++where+xj+predefined+features+

+++++(linear+classifier)++

•  Jointly+op'mize+over+all+weights+w0,#w1,#w2…#

Boos'ng:+•  Minimize+exp+loss+

•  Define++

++++++where+ht(x)+defined+dynamically++to+fit+data+(not+a+linear+classifier)++

•  Weights+αt+learned+per+itera'on+t+incrementally+

30

Boos$ng'and'Logis$c'Regression'

slide by Aarti Singh

Page 59: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Boosting vs. BaggingBagging:• Resample data points• Weight of each classifier

is the same• Only variance reduction

!59

Boosting:• Reweights data points

(modifies their distribution) • Weight is dependent on

classifier’s accuracy • Both bias and variance

reduced – learning rule becomes more complex with iterations

slide by Aarti Singh

Page 60: Lecture 20 - Hacettepe Üniversitesiaykut/classes/... · • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples,

Next Lecture: K-Means Clustering

�60