icml 2009 tutorial survey of boosting from an optimization ...vishy/introml/notes/boosting1.pdf ·...

66
ICML 2009 Tutorial Survey of Boosting from an Optimization Perspective Part I: Entropy Regularized LPBoost Part II: Boosting from an Optimization Perspective Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research Updated: March 23, 2010 Warmuth (UCSC) ICML ’09 Boosting Tutorial 1 / 62

Upload: hoangdan

Post on 05-Jun-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

ICML 2009 Tutorial

Survey of Boosting

from an Optimization Perspective

Part I: Entropy Regularized LPBoost

Part II: Boosting from an Optimization

Perspective

Manfred K. Warmuth - UCSCS.V.N. Vishwanathan - Purdue & Microsoft Research

Updated: March 23, 2010Warmuth (UCSC) ICML ’09 Boosting Tutorial 1 / 62

Page 2: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Outline

1 Introduction to Boosting

2 What is Boosting?

3 Entropy Regularized LPBoost

4 Overview of Boosting algorithms

5 Conclusion and Open Problems

Warmuth (UCSC) ICML ’09 Boosting Tutorial 2 / 62

Page 3: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Outline

1 Introduction to Boosting

2 What is Boosting?

3 Entropy Regularized LPBoost

4 Overview of Boosting algorithms

5 Conclusion and Open Problems

Warmuth (UCSC) ICML ’09 Boosting Tutorial 3 / 62

Page 4: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Setup for Boosting

[Giants of field: Schapire,Freund]

examples: 11 apples

+1 if artificial- 1 if natural

goal:classification

Warmuth (UCSC) ICML ’09 Boosting Tutorial 4 / 62

Page 5: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Setup for Boosting

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

+1/-1 examples

weight dn ≈ size

separable

Warmuth (UCSC) ICML ’09 Boosting Tutorial 5 / 62

Page 6: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Weak hypotheses

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

weak hypotheses:decision stumps on twofeaturesone can’t do it

goal:find convex combinationof weak hypotheses thatclassifies all

Warmuth (UCSC) ICML ’09 Boosting Tutorial 6 / 62

Page 7: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Boosting: 1st iteration

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

First hypothesis:error: 1

11

edge: 911

low error = high edge

edge = 1− 2 error

Warmuth (UCSC) ICML ’09 Boosting Tutorial 7 / 62

Page 8: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Update after 1st

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

Misclassified examplesincreased weights

After update

edge of hypothesisdecreased

Warmuth (UCSC) ICML ’09 Boosting Tutorial 8 / 62

Page 9: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Before 2nd iteration

y

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

Hard examples

high weight

Warmuth (UCSC) ICML ’09 Boosting Tutorial 9 / 62

Page 10: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Boosting: 2nd hypothesis

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

Pick hypotheseswith high (weighted) edge

Warmuth (UCSC) ICML ’09 Boosting Tutorial 10 / 62

Page 11: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Update after 2nd

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

After update

edges of all pasthypotheses should besmall

Warmuth (UCSC) ICML ’09 Boosting Tutorial 11 / 62

Page 12: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

3rd hypothesis

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

Warmuth (UCSC) ICML ’09 Boosting Tutorial 12 / 62

Page 13: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Update after 3rd

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

Warmuth (UCSC) ICML ’09 Boosting Tutorial 13 / 62

Page 14: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

4th hypothesis

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

Warmuth (UCSC) ICML ’09 Boosting Tutorial 14 / 62

Page 15: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Update after 4th

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

Warmuth (UCSC) ICML ’09 Boosting Tutorial 15 / 62

Page 16: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Final convex combination of all hypotheses

Decision:∑T

t=1 wtht(x) ≥ 0 ?

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

Positive total weight - Negative total weight

Warmuth (UCSC) ICML ’09 Boosting Tutorial 16 / 62

Page 17: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Protocol of Boosting [FS97]

Maintain distribution on N ±1 labeled examples

At iteration t = 1, . . . ,T :- Receive “weak” hypothesis ht of high edge- Update dt−1 to dt more weights on “hard” examples

Output convex combination of the weak hypotheses∑Tt=1 wt ht(x)

Two sets of weights:- distribution d on examples- distribution w on hypotheses

Warmuth (UCSC) ICML ’09 Boosting Tutorial 17 / 62

Page 18: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Data representation

ynht(xn) := ut

n

perfect +1opposite -1neutral 0

examples xn labels yn h1(xn) u1

-1 -1 1

-1 -1 1

-1 -1 1

-1 1 −1

1 1 1

1 1 1

1 1 1

1 -1 −1

Warmuth (UCSC) ICML ’09 Boosting Tutorial 18 / 62

Page 19: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Edge vs. margin [Br99]

Edge of a hypothesis ht for a distribution d on the examples

N∑n=1

accuracy of example︷︸︸︷ut

n dn︸ ︷︷ ︸weighted accuracy of hypothesis

d ∈ PN

Margin of example n for current hypothesis weighting w

T∑t=1

accuracy of example︷︸︸︷ut

n wt︸ ︷︷ ︸weighted accuracy of example

w ∈ PT

Warmuth (UCSC) ICML ’09 Boosting Tutorial 19 / 62

Page 20: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Edge vs. margin [Br99]

Edge of a hypothesis ht for a distribution d on the examples

N∑n=1

accuracy of example︷︸︸︷ut

n dn︸ ︷︷ ︸weighted accuracy of hypothesis

d ∈ PN

Margin of example n for current hypothesis weighting w

T∑t=1

accuracy of example︷︸︸︷ut

n wt︸ ︷︷ ︸weighted accuracy of example

w ∈ PT

Warmuth (UCSC) ICML ’09 Boosting Tutorial 19 / 62

Page 21: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

AdaBoost

Initialize t = 0 and d0n = 1

N

For t = 1, . . . ,T

Get ht whose edge w.r.t current distribution is 1− 2εt

Set wt = 12

ln(

1−εtεt

)Update distribution as follows

d tn =

d t−1n exp(−wtu

tn)∑

n′ dt−1n′ exp(−wtut

n′)

Final hypothesis: sgn(∑T

t=1 wtht(·))

Warmuth (UCSC) ICML ’09 Boosting Tutorial 20 / 62

Page 22: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Objectives

Edge

Edges of past hypotheses should be small after update

Minimize maximum edge of past hypotheses

Margin

Choose convex combination of weak hypothesesthat maximizes the minimum margin

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

featu

re 2

Which margin?

SVM 2-norm (weights on examples)Boosting 1-norm (weights on base hypotheses)

Connection between objectives?Warmuth (UCSC) ICML ’09 Boosting Tutorial 21 / 62

Page 23: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Edge vs. margin

min max edge = max min margin

mind∈SN

maxq=1,2,...,t−1

uq · d︸ ︷︷ ︸edge of hypothesis q

= maxw∈St−1

minn=1,2,...,N

t−1∑q=1

uqn wq︸ ︷︷ ︸

margin of example n

Linear Programming duality

Warmuth (UCSC) ICML ’09 Boosting Tutorial 22 / 62

Page 24: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Boosting as zero-sum-game [FS97]

Rock, Paper, Scissors game

row player

column playerR P Sw1 w2 w3

R d1 0 1 -1P d2 -1 0 1S d3 1 -1 0

gain matrix

Single row is pure strategy ofrow player and d is mixed strategy

Single column is pure strategy ofcolumn player and w is mixed strategy

Row player minimizesColumn player maximizes

payoff = dTU w=

∑i ,j diUi ,jwj

Warmuth (UCSC) ICML ’09 Boosting Tutorial 23 / 62

Page 25: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Optimum strategy

R P Sw1 w2 w3

.33 .33 .33

R d1 .33 0 1 -1P d2 .33 -1 0 1S d3 .33 1 -1 0

Min-max theorem:

mind

maxw

dTUw = mind

maxj

dT Uej

= maxw

mind

dTUw = maxw

mini

eiUw

= value of the game ( 0 in example )

ej is pure strategy

Warmuth (UCSC) ICML ’09 Boosting Tutorial 24 / 62

Page 26: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Connection to Boosting?

Rows are the examples

Columns uq encode weak hypothesis hq

Row sum: margin of example

Column sum: edge of weak hypothesis

Value of game:

min max edge = max min margin

Van Neumann’s Minimax Theorem

Warmuth (UCSC) ICML ’09 Boosting Tutorial 25 / 62

Page 27: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Edges/margins

R P Sw1 w2 w3 margin.33 .33 .33

R d1 .33 0 1 -1 0

P d2 .33 -1 0 1 0 minS d3 .33 1 -1 0 0

edge 0 0 0

max

value of game 0

Warmuth (UCSC) ICML ’09 Boosting Tutorial 26 / 62

Page 28: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

New column added: boosting

R P Sw1 w2 w3 w4 margin.44 0 .22 .33

R d1 .22 0 1 -1 1 .11

P d2 .33 -1 0 1 1 .11 minS d3 .44 1 -1 0 -1 .11

edge .11 -.22 .11 .11

max

Value of game increases from 0 to .11

Warmuth (UCSC) ICML ’09 Boosting Tutorial 27 / 62

Page 29: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Row added: on-line learning

R P Sw1 w2 w3 margin.33 .44 .22

R d1 0 0 1 -1 .22

P d2 .22 -1 0 1 -.11 minS d3 .44 1 -1 0 -.11

d4 .33 -1 1 -1 -.11

edge -.11 -.11 -.11

max

Value of game decreases from 0 to -.11

Warmuth (UCSC) ICML ’09 Boosting Tutorial 28 / 62

Page 30: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Introduction to Boosting

Boosting: maximize margin incrementally

w 11

d11 0

d12 1

d13 -1

iteration 1

w 21 w 2

2

d21 0 -1

d22 1 0

d23 -1 1

iteration 2

w 31 w 3

2 w 33

d31 0 -1 1

d32 1 0 -1

d33 -1 1 0

iteration 3

In each iteration solve optimization problem to update d

Column player / oracle provides new hypothesis

Boosting is column generation method in d domainand coordinate descent in w domain

Warmuth (UCSC) ICML ’09 Boosting Tutorial 29 / 62

Page 31: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

What is Boosting?

Outline

1 Introduction to Boosting

2 What is Boosting?

3 Entropy Regularized LPBoost

4 Overview of Boosting algorithms

5 Conclusion and Open Problems

Warmuth (UCSC) ICML ’09 Boosting Tutorial 30 / 62

Page 32: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

What is Boosting?

Boosting = greedy method for increasing margin

Converges to optimum margin w.r.t. all hypotheses

Want small number of iterationsWarmuth (UCSC) ICML ’09 Boosting Tutorial 31 / 62

Page 33: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

What is Boosting?

Assumption on next weak hypothesis

For current weighting of examples,oracle returns hypothesis of edge ≥ g

Goal

For given ε, produce convex combination of weak hypotheseswith soft margin ≥ g − εNumber of iterations O( log N

ε2)

Warmuth (UCSC) ICML ’09 Boosting Tutorial 32 / 62

Page 34: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

What is Boosting?

Recall min max thm

mind∈SN

maxq=1,2,...,t

uq · d︸ ︷︷ ︸edge of hypothesis q

= maxw∈St

minn=1,2,...,N

(t∑

q=1

uqn wq

)︸ ︷︷ ︸

margin of example n

Warmuth (UCSC) ICML ’09 Boosting Tutorial 33 / 62

Page 35: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

What is Boosting?

Visualizing the margin

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8fe

atu

re 2

Warmuth (UCSC) ICML ’09 Boosting Tutorial 34 / 62

Page 36: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

What is Boosting?

Min max thm - inseparable case

Slack variables in w domain = capping in d domain

mind∈SN ,d≤ 1

ν1

maxq=1,2,...,t

uq · d︸ ︷︷ ︸edge of hypothesis q

= maxw∈St ,ψ≥0

minn=1,2,...,N

(t∑

q=1

uqn wq + ψn

)︸ ︷︷ ︸soft margin of example n

−1

ν

N∑n=1

ψn

Warmuth (UCSC) ICML ’09 Boosting Tutorial 35 / 62

Page 37: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

What is Boosting?

Visualizing the soft margin

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

hypothesis 1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8hypoth

esi

s 2

ψ

Warmuth (UCSC) ICML ’09 Boosting Tutorial 36 / 62

Page 38: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

What is Boosting?

LPBoost

0.0 0.2 0.4 0.6 0.8 1.0

d1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Obje

ctiv

e v

alu

e

PLP

Choose distribution thatminimizes the maximumedge of current hypothesesby solving:

minPn dn=1,d≤ 1

ν1

maxq=1,2,...,t

uq · d︸ ︷︷ ︸Pt

LP

All weight is put onexamples with minimumsoft margin

Warmuth (UCSC) ICML ’09 Boosting Tutorial 37 / 62

Page 39: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Entropy Regularized LPBoost

Outline

1 Introduction to Boosting

2 What is Boosting?

3 Entropy Regularized LPBoost

4 Overview of Boosting algorithms

5 Conclusion and Open Problems

Warmuth (UCSC) ICML ’09 Boosting Tutorial 38 / 62

Page 40: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Entropy Regularized LPBoost

Entropy Regularized LPBoost

minPn dn=1,d≤ 1

ν1

maxq=1,2,...,t

uq · d +1

η∆(d,d0)

dn =exp−η soft margin of example n

Z”soft min”

Form of weights first in ν-Arc algorithm [RSS+00]

Regularization in d domain makes problem strongly convex

Gradient of dual Lipschitz continuous in w [e.g. HL93,RW97]

Warmuth (UCSC) ICML ’09 Boosting Tutorial 39 / 62

Page 41: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Entropy Regularized LPBoost

The effect of entropy regularization

Different distribution on the examples

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

feature 1

0.0

0.2

0.4

0.6

0.8

featu

re 2

LPBoost: lots of zeros / brittle ERLPBoost: smoother

Warmuth (UCSC) ICML ’09 Boosting Tutorial 40 / 62

Page 42: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Outline

1 Introduction to Boosting

2 What is Boosting?

3 Entropy Regularized LPBoost

4 Overview of Boosting algorithms

5 Conclusion and Open Problems

Warmuth (UCSC) ICML ’09 Boosting Tutorial 41 / 62

Page 43: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

AdaBoost [FS97]

d tn :=

d t−1n exp(−wtu

tn)∑

n′ dt−1n′ exp(−wtut

n′),

where wt s.t.∑

n′ dt−1n′ exp(−w ut

n′) is minimized

i.e.∂

Pn′ d

t−1n′ exp(−w ut

n′ )

∂w|w=wt =

∑n ut

nd t−1

n exp(−wt utn)P

n′ dt−1n′ exp(−wt ‘ ut

n′ )= ut · dt = 0

Easy to implement

Adjusts distribution so that edge of last hypothesis is zero

Gets within half of the optimal hard margin [RSD07]but only in the limit

Warmuth (UCSC) ICML ’09 Boosting Tutorial 42 / 62

Page 44: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Corrective versus totally corrective

Processing last hypothesis versus all past hypotheses

Corrective Totally CorrectiveAdaBoost LPBoost

LogitBoost TotalBoostAdaBoost* SoftBoostSS,Colt08 ERLPBoost

Warmuth (UCSC) ICML ’09 Boosting Tutorial 43 / 62

Page 45: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

From AdaBoost to ERLPBoost

AdaBoost (as interpreted in [KW99,La99])Primal:

mind

∆(d,dt−1)

s.t. d · ut = 0, ‖d‖1 = 1

Dual:

maxw−ln

∑n d t−1

n exp(−ηutnwt)

s.t. w ≥ 0Achieves half of optimum hard margin in the limit

AdaBoost∗ [RW05]Primal:

mind

∆(d,dt−1)

s.t. d · ut ≤ γt ,‖d‖1 = 1

Dual:

maxw−ln

∑n d t−1

n exp(−ηutnwt)

−γt ||w||1s.t. w ≥ 0

where edge bound γt is adjusted downward by a heuristicGood iteration bound for reaching optimum hard margin

Warmuth (UCSC) ICML ’09 Boosting Tutorial 44 / 62

Page 46: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

SoftBoost [WGR07]Primal:

mind

∆(d,d0)

s.t. ‖d‖1 = 1, d ≤ 1ν1

d · uq ≤ γt ,1 ≤ q ≤ t

Dual:

minw,ψ

− ln∑n

d0n exp(−η

t∑q=1

uqnwq

−ηψn)− 1ν‖ψ‖1 − γt‖w‖1

s.t. w ≥ 0, ψ ≥ 0

where edge bound γt is adjusted downward by a heuristicGood iteration bound for reaching soft marginERLPBoost [WGV08]

Primal:

mind,γ

γ + 1η

∆(d,d0)

s.t. ‖d‖1 = 1, d ≤ 1ν1

d · uq ≤ γ,1 ≤ q ≤ t

Dual:

minw,ψ

− 1η

ln∑n

d0n exp(−η

t∑q=1

uqnwq

−ηψn)− 1ν‖ψ‖1

s.t. w ≥ 0, ‖w‖1 = 1, ψ ≥ 0

where for the iteration bound η is fixed to max(2ε

ln Nν, 1

2)

Good iteration bound for reaching soft marginWarmuth (UCSC) ICML ’09 Boosting Tutorial 45 / 62

Page 47: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Corrective ERLPBoost [SS08]Primal:

mind

∑tq=1 wq(uq · d) + 1

η∆(d,d0)

s.t. ‖d‖1 = 1, d ≤ 1ν1

Dual:

minψ− 1η

ln∑n

d0n exp(−η

t∑q=1

uqnwq − ηψn)− 1

ν‖ψ‖1

s.t. ψ ≥ 0

where for the iteration bound η is fixed to max(2ε

ln Nν, 1

2)

Good iteration bound for reaching soft margin

Warmuth (UCSC) ICML ’09 Boosting Tutorial 46 / 62

Page 48: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Iteration bounds

Corrective Totally CorrectiveAdaBoost LPBoost

LogitBoost TotalBoostAdaBoost* SoftBoostSS,Colt08 ERLPBoost

Strong oracle: returnshypothesis with maximumedge

Weak oracle: returnshypothesis with edge ≥ g

In O(log N

ν

ε2) iterations

within ε of maximum soft margin for strong oracleor within ε of g for weak oracle

Ditto for hard margin case

In O( log Ng2 ) iterations consistency with weak oracle

Warmuth (UCSC) ICML ’09 Boosting Tutorial 47 / 62

Page 49: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

LPBoost may require Ω(N) iterations

w1 w2 w3 w4 w5 margin0 0 0 0 0

d1 .125 +1 -.95 -.93 -.91 -.99 −d2 .125 +1 -.95 -.93 -.91 -.99 −d3 .125 +1 -.95 -.93 -.91 -.99 −d4 .125 +1 -.95 -.93 -.91 -.99 −d5 .125 -.98 +1 -.93 -.91 +.99 −d6 .125 -.97 -.96 +1 -.91 +.99 −d7 .125 -.97 -.95 -.94 +1 +.99 −d8 .125 -.97 -.95 -.93 -.92 +.99 −

edge .0137 -.7075 -.6900 -.6725 .0000

value -1

Warmuth (UCSC) ICML ’09 Boosting Tutorial 48 / 62

Page 50: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

LPBoost may require Ω(N) iterations

w1 w2 w3 w4 w5 margin1 0 0 0 0

d1 0 +1 -.95 -.93 -.91 -.99 1

d2 0 +1 -.95 -.93 -.91 -.99 1

d3 0 +1 -.95 -.93 -.91 -.99 1

d4 0 +1 -.95 -.93 -.91 -.99 1

d5 1 -.98 +1 -.93 -.91 +.99 -.98

d6 0 -.97 -.96 +1 -.91 +.99 -.97

d7 0 -.97 -.95 -.94 +1 +.99 -.97

d8 0 -.97 -.95 -.93 -.92 +.99 -.97

edge -.98 1 -.93 -.91 .99

value -1 -.98

Warmuth (UCSC) ICML ’09 Boosting Tutorial 49 / 62

Page 51: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

LPBoost may require Ω(N) iterations

w1 w2 w3 w4 w5 margin0 1 0 0 0

d1 0 +1 -.95 -.93 -.91 -.99 -.95

d2 0 +1 -.95 -.93 -.91 -.99 -.95

d3 0 +1 -.95 -.93 -.91 -.99 -.95

d4 0 +1 -.95 -.93 -.91 -.99 -.95

d5 0 -.98 +1 -.93 -.91 +.99 1

d6 1 -.97 -.96 +1 -.91 +.99 -.96

d7 0 -.97 -.95 -.94 +1 +.99 -.95

d8 0 -.97 -.95 -.93 -.92 +.99 -.95

edge -.97 -.96 1 -.91 .99

value -1 -.98 -.96

Warmuth (UCSC) ICML ’09 Boosting Tutorial 50 / 62

Page 52: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

LPBoost may require Ω(N) iterations

w1 w2 w3 w4 w5 margin0 0 1 0 0

d1 0 +1 -.95 -.93 -.91 -.99 -.93

d2 0 +1 -.95 -.93 -.91 -.99 -.93

d3 0 +1 -.95 -.93 -.91 -.99 -.93

d4 0 +1 -.95 -.93 -.91 -.99 -.93

d5 0 -.98 +1 -.93 -.91 +.99 -.93

d6 0 -.97 -.96 +1 -.91 +.99 1

d7 1 -.97 -.95 -.94 +1 +.99 -.94

d8 0 -.97 -.95 -.93 -.92 +.99 -.93

edge -.97 -.95 -.94 1 .99

value -1 -.98 -.96 -.94

Warmuth (UCSC) ICML ’09 Boosting Tutorial 51 / 62

Page 53: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

LPBoost may require Ω(N) iterations

w1 w2 w3 w4 w5 margin0 0 0 1 0

d1 0 +1 -.95 -.93 -.91 -.99 -.91

d2 0 +1 -.95 -.93 -.91 -.99 -.91

d3 0 +1 -.95 -.93 -.91 -.99 -.91

d4 0 +1 -.95 -.93 -.91 -.99 -.91

d5 0 -.98 +1 -.93 -.91 +.99 -.91

d6 0 -.97 -.96 +1 -.91 +.99 -.91

d7 0 -.97 -.95 -.94 +1 +.99 1

d8 1 -.97 -.95 -.93 -.92 +.99 -.92

edge -.97 -.95 -.94 -.92 .99

value -1 -.98 -.96 -.94 -.92

Warmuth (UCSC) ICML ’09 Boosting Tutorial 52 / 62

Page 54: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

LPBoost may require Ω(N) iterations

w1 w2 w3 w4 w5 margin.5 .0026 0 0 .4975

d1 .497 +1 -.95 -.93 -.91 -.99 .0051

d2 0 +1 -.95 -.93 -.91 -.99 .0051

d3 0 +1 -.95 -.93 -.91 -.99 .0051

d4 0 +1 -.95 -.93 -.91 -.99 .0051

d5 0 -.98 +1 -.93 -.91 +.99 .0051

d6 .490 -.97 -.96 +1 -.91 +.99 .0051

d7 0 -.97 -.95 -.94 +1 +.99 .0051

d8 .013 -.97 -.95 -.93 -.92 +.99 .0051

edge .0051 .0051 .9055 .9100 .0051

value -1 -.98 -.96 -.94 -.92 .0051No ties!

Warmuth (UCSC) ICML ’09 Boosting Tutorial 53 / 62

Page 55: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

LPBoost may return bad final hypothesis

How good is the master hypothesis returned by LPBoost comparedto the best possible convex combination of hypotheses?

Any linearly separable dataset can be reduced to a dataset on whichLPBoost misclassifies all examples by

adding a bad example

adding a bad hypothesis

Warmuth (UCSC) ICML ’09 Boosting Tutorial 54 / 62

Page 56: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Adding a bad example

w1 w2 w3 w4 w5 margin.5 .0026 0 0 .4975

d1 0 +1 -.95 -.93 -.91 -.99 .0051

d2 0 +1 -.95 -.93 -.91 -.99 .0051

d3 0 +1 -.95 -.93 -.91 -.99 .0051

d4 0 +1 -.95 -.93 -.91 -.99 .0051

d5 0 -.98 +1 -.93 -.91 +.99 .0051

d6 0 -.97 -.96 +1 -.91 +.99 .0051

d7 0 -.97 -.95 -.94 +1 +.99 .0051

d8 0 -.97 -.95 -.93 -.92 +.99 .0051

d9 1 -.03 -.03 -.03 -.03 -.03 −.03

edge −.03 −.03 −.03 −.03 −.03

value -1 -.98 -.96 -.94 -.92 −.03

Warmuth (UCSC) ICML ’09 Boosting Tutorial 55 / 62

Page 57: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Adding a bad hypothesis

w1 w2 w3 w4 w5 w6 margin0 0 0 0 0 1

d1 0 +1 -.95 -.93 -.91 -.99 -.01 .0051

d2 0 +1 -.95 -.93 -.91 -.99 -.01 .0051

d3 0 +1 -.95 -.93 -.91 -.99 -.01 .0051

d4 0 +1 -.95 -.93 -.91 -.99 -.01 .0051

d5 0 -.98 +1 -.93 -.91 +.99 -.01 .0051

d6 0 -.97 -.96 +1 -.91 +.99 -.01 .0051

d7 0 -.97 -.95 -.94 +1 +.99 -.01 .0051

d8 0 -.97 -.95 -.93 -.92 +.99 -.01 .0051

d9 1 -.03 -.03 -.03 -.03 -.03 -.02 .0051

edge −.03 −.03 −.03 −.03 −.03 −.02

value -1 -.98 -.96 -.94 -.92 -.03

Warmuth (UCSC) ICML ’09 Boosting Tutorial 56 / 62

Page 58: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Adding a bad hypothesis

w1 w2 w3 w4 w5 w6 margin0 0 0 0 0 1

d1 0 +1 -.95 -.93 -.91 -.99 -.01 −.01

d2 0 +1 -.95 -.93 -.91 -.99 -.01 −.01

d3 0 +1 -.95 -.93 -.91 -.99 -.01 −.01

d4 0 +1 -.95 -.93 -.91 -.99 -.01 −.01

d5 0 -.98 +1 -.93 -.91 +.99 -.01 −.01

d6 0 -.97 -.96 +1 -.91 +.99 -.01 −.01

d7 0 -.97 -.95 -.94 +1 +.99 -.01 −.01

d8 0 -.97 -.95 -.93 -.92 +.99 -.01 −.01

d9 1 -.03 -.03 -.03 -.03 -.03 -.02 −.02

edge −.03 −.03 −.03 −.03 −.03 −.02

value -1 -.98 -.96 -.94 -.92 -.03 -.02

Warmuth (UCSC) ICML ’09 Boosting Tutorial 56 / 62

Page 59: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Adding a bad hypothesis

w1 w2 w3 w4 w5 w6 margin0 0 0 0 0 1

d1 0 +1 -.95 -.93 -.91 -.99 -.01 −.01

d2 0 +1 -.95 -.93 -.91 -.99 -.01 −.01

d3 0 +1 -.95 -.93 -.91 -.99 -.01 −.01

d4 0 +1 -.95 -.93 -.91 -.99 -.01 −.01

d5 0 -.98 +1 -.93 -.91 +.99 -.01 −.01

d6 0 -.97 -.96 +1 -.91 +.99 -.01 −.01

d7 0 -.97 -.95 -.94 +1 +.99 -.01 −.01

d8 0 -.97 -.95 -.93 -.92 +.99 -.01 −.01

d9 1 -.03 -.03 -.03 -.03 -.03 -.02 −.02

edge −.03 −.03 −.03 −.03 −.03 −.02

value -1 -.98 -.96 -.94 -.92 -.03 -.02

Warmuth (UCSC) ICML ’09 Boosting Tutorial 56 / 62

Page 60: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Adding a bad hypothesis

w1 w2 w3 w4 w5 w6 margin.5 0 0 0 .5 0

d1 0 +1 -.95 -.93 -.91 -.99 -.01 +.005

d2 0 +1 -.95 -.93 -.91 -.99 -.01 +.005

d3 0 +1 -.95 -.93 -.91 -.99 -.01 +.005

d4 0 +1 -.95 -.93 -.91 -.99 -.01 +.005

d5 0 -.98 +1 -.93 -.91 +.99 -.01 +.005

d6 0 -.97 -.96 +1 -.91 +.99 -.01 +.01

d7 0 -.97 -.95 -.94 +1 +.99 -.01 +.01

d8 0 -.97 -.95 -.93 -.92 +.99 -.01 +.01

d9 1 -.03 -.03 -.03 -.03 -.03 -.02 −.03

Warmuth (UCSC) ICML ’09 Boosting Tutorial 56 / 62

Page 61: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Synopsis

LPBoost often unstable

For safety, add relative entropy regularization

Corrective algs

Sometimes easy to codeFast per iteration

Totally corrective algs

Smaller number of iterationsFaster overall time when ε small

Weak versus strong oracle makes a big difference in practice

Warmuth (UCSC) ICML ’09 Boosting Tutorial 57 / 62

Page 62: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

O( log Nε2

) iteration bounds

Good

Bound is major design tool

Any reasonable Boosting algorithm should have this bound

Bad

Bound is weak

ln Nε2≥ N

ε = .01 N ≤ 1.2 105

ε = .001 N ≤ 1.7 107

Why are totally corrective algorithms much better in practice?

Warmuth (UCSC) ICML ’09 Boosting Tutorial 58 / 62

Page 63: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Overview of Boosting algorithms

Lower bounds on the number of iterations

Majority of Ω( log Ng2 ) hypotheses for achieving consistency with

weak oracle of guarantee g [Fr95]

Easy: Ω( 1ε2

) iteration bound for getting within ε of hard marginwith strong oracle

Harder: Ω( log Nε2

) iteration bound for stron oracle [Ne83?]

Warmuth (UCSC) ICML ’09 Boosting Tutorial 59 / 62

Page 64: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Conclusion and Open Problems

Outline

1 Introduction to Boosting

2 What is Boosting?

3 Entropy Regularized LPBoost

4 Overview of Boosting algorithms

5 Conclusion and Open Problems

Warmuth (UCSC) ICML ’09 Boosting Tutorial 60 / 62

Page 65: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Conclusion and Open Problems

Conclusion

Adding relative entropy regularization of LPBoostleads to good boosting alg.

Boosting is instantiation of MaxEnt and MinxEnt principles[Jaines 57,Kullback 59]

Relative entropy regularizationsmoothes one-norm regularization

Open

When hypotheses have one-sided error thenO( log N

ε) iterations suffice [As00,HW03]

Does ERLPBoost have O( log Nε

) bound when hypothesesone-sided?

Replace geometric optimizers by entropic ones

Compare ours with Freund’s algorithms that don’t just cap, butforget examplesWarmuth (UCSC) ICML ’09 Boosting Tutorial 61 / 62

Page 66: ICML 2009 Tutorial Survey of Boosting from an Optimization ...vishy/introml/notes/Boosting1.pdf · Survey of Boosting from an Optimization Perspective ... AdaBoost Initialize t =

Conclusion and Open Problems

Acknowledgment

Rob Schapire and Yoav Freund for pioneering Boosting

Gunnar Ratsch for bringing in optimization

Karen Glocer for helping with figures and plots

Warmuth (UCSC) ICML ’09 Boosting Tutorial 62 / 62