introduction to boosting - the university of manchesterstapenr5/boosting.pdf · adaboost is short...

Post on 30-Apr-2018

215 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Introduction to Boosting

February 2, 2012

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Boosting - Introduction

Train one base learner at a time.

Focus it on the mistakes of its predecessors.

Weight it based on how ‘useful’ it is in the ensemble (not onits training error).

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Boosting - Introduction

Under very mild assumptions on the base learners (“Weaklearning” - they must perform better than random guessing):

1 Training error is eventually reduced to zero (We can fit thetraining data).

2 Generalisation error continues to be reduced even aftertraining error is zero (There is no overfitting).

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Boosting - Introduction

The most popular ensemble algorithm is a boosting algorithmcalled “Adaboost”.

Adaboost is short for “Adaptive Boosting”, because thealgorithm adapts weights on the base learners and trainingexamples.

There are many explanation of precisely what Adaboost doesand why it is so successful - but the basic idea is simple!

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Set uniform example weights.for Each base learner do

Train base learner with weighted sample.Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Loss Function

To see precisely how Adaboost is implemented, we need tounderstand its Loss Function

L(H) =N∑i=1

e−mi (1)

mi = yi

K∑k=1

αkhk(xi ) (2)

mi is the voting margin.There are N examples and K base learners.αK is the weight of the kth learner, hk(xi ) is its prediction on xi .

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Loss Function

Why Exponential Loss?

Loss is higher when a prediction is wrong.

Loss is steeper when a prediction is wrong.

Precise reasons later. . .

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Loss Function

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Example Weights

After k learners, the example weights used to train the k + 1th

learner are:

Dk(i) ∝ e−mi (3)

= eyi∑k

j=1 αjhj (xi ) (4)

= eyi∑k−1

j=1 αjhj (xi )eyiαkhk (xi ) (5)

∝ Dk−1(i)eyiαkhk (xi ) (6)

Dk(i) =Dk−1(i)eyiαkhk (xi )

Zk(7)

Zk normalises Dk(i) such that∑N

i=1Dk(i) = 1.

Larger when mi is negative.

Directly proportional to the loss funcion.

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Learner Weights

Once the kth learner is trained, we evaluate its predictions ontraining data hk(xi ) and assign weight:

αk =1

2log

1− εkεk

, (8)

εk =N∑i=1

Dk−1(i)δ[hk(xi ) 6= yi ] (9)

εk is the weighted error of the kth learner.

The log in the learner weight is related to the exponential inthe loss function.

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Learner Weights

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weightSet uniform example weights.for Each base learner do

Train base learner with weighted sample.

Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor Each base learner do

Train base learner with weighted sample.

Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on DTest base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on Dεk ←

∑Ni=1Dk−1(i)δ[hk(xi ) 6= yi ]

Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on Dεk ←

∑Ni=1Dk−1(i)δ[hk(xi ) 6= yi ]

αk = 12 log 1−εk

εkSet example weights based on ensemble predictions.

end for

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on Dεk ←

∑Ni=1Dk−1(i)δ[hk(xi ) 6= yi ]

αk ← 12 log 1−εk

εk

Dk(i)← Dk−1(i)e−αk yi hk (xi )

Zk

end for

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Decision rule:

H(x) = sign(K∑

k=1

αkhk(x)) (10)

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Demo

[Demo]

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

So the training error rate of Adaboost reaches 0 for largeenough K

Combination of exponential loss + weak learning assumption.

If base learner weighted error is always ε ≤ 12 − γ then:

etr ≤ (√

1− 4γ2)K (11)

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Let’s derive the bound!

1 Show that exponential loss is an upper bound on classificationerror.

2 Show that the loss can be written as a product of thenormalisation constant.

3 Minimise the normalisation constant at each step (derive thelearner weight rule).

4 Write this constant using the weighted error at each step.

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Show that exponential loss is an upper bound on classificationerror.

etr =1

N

N∑i=1

δ[H(xi ) 6= yi ] (12)

=1

N

N∑i=1

δ[mi ≤ 0] (13)

≤ 1

N

N∑i=1

e−mi (14)

Because x ≤ 0→ e−x ≥ 1 and x ≤ 1→ e−x ≥ 0

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Show that the loss can be written as a product of thenormalisation constant.

Dk(i) =Dk−1(i)e−yiαkhk (xi )

Zk(15)

= D0(i)[k∏

j=1

e−yiαjhj (xi )

Zj] (16)

=1

Ne−yi

∑kj=1 αjhj (xi ) 1∏k

j=1 Zj

(17)

e−mi = NDk(i)k∏

j=1

Zj (18)

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Show that the loss can be written as a product of thenormalisation constant.

e−mi = NDk(i)k∏

j=1

Zj (19)

etr ≤1

N

N∑i=1

e−mi (20)

=1

N

N∑i=1

NDk(i)k∏

j=1

Zj (21)

= [N∑i=1

Dk(i)]k∏

j=1

Zj (22)

=k∏

j=1

Zj (23)

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Minimise normalisation constant at each step.

Zk =N∑i=1

Dk−1(i)e−yiαkhk (xi ) (24)

=∑

hk (xi )6=yi

Dk−1(i)eαk +∑

hk (xi )=yi

Dk−1(i)e−αk (25)

= εkeαk + (1− εk)e−αk (26)

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Minimise normalisation constant at each step.

Zk = εkeαk + (1− εk)e−αk (27)

∂Zk

∂αk= εke

αk − (1− εk)e−αk (28)

0 = εke2αk − (1− εk)e0 (29)

e2αk =1− εkεk

(30)

αk =1

2log

1− εkεk

(31)

Minimising Zk gives us the learner weight rule.

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Write the normalisation constant using weighted error.

Zk = εkeαk + (1− εk)e−αk (32)

αk =1

2log

1− εkεk

(33)

Zk = εke12log

1−εkεk + (1− εk)e

− 12log

1−εkεk (34)

= 2√εk(1− εk) (35)

So Zk depends only on εk , and the weak learning assumption saysthat εk < 0.5.

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

We bound training error using: etr ≤∏k

j=1 Zj

We’ve found that Zj is a function of the weighted error:Zj = 2

√εj(1− εj)

So, we can bound the training error: etr ≤∏k

j=1 2√εj(1− εj)

Let εj ≤ 12 − γ, and we can write etr ≤ (

√1− 4γ2)k

If γ > 0, then this bound will shrink to 0 when k is largeenough.

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error Summary

We have seen an upper bound on the training error whichdecreases as K increase.

We can prove the bound by considering the loss function as abound on the error rate.

Examining the role of the normalisation constant in thisbound reveals the learner weight rule.

Examining the role of γ in the bound shows why we canachieve 0 training error even with weak learners.

top related