Download - A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 1: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

A Decision-Theoretic Generalization of On-LineLearning and an Application to Boosting

Sachin Gupta and Nitish Lakhanpal

November 1, 2010

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Page 2: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Page 3: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Page 4: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Page 5: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Page 6: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Page 7: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Introduction

Page 8: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Page 9: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

i=1

pt = 1

∑i (p

Page 10: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

i=1

pt = 1

∑i (p

Page 11: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Allocator A decides on distribution pt over the strategies

pti ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

∑i (p

Page 12: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

i ≥ 0 is the amount allocated to strategy i

n∑i=1

pt = 1

∑i (p

Page 13: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

i=1

pt = 1

∑i (p

Page 14: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironment

Loss suffered by A is∑

i (pti · l ti ) i.e. the average loss of the

Page 15: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

i=1

pt = 1

∑i (p

strategies with respect to A’s chosen allocation rule

This loss function is called the mixture loss

Page 16: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

i=1

pt = 1

∑i (p

Page 17: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

i=1

pt = 1

∑i (p

Page 18: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Page 19: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Li =∑T

Page 20: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Li =∑T

Page 21: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Li =∑T

Page 22: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Li =∑T

Page 23: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Li =∑T

Page 24: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Bounds on Losses

Li =∑T

Page 25: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Page 26: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 27: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 28: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 29: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Regret Bound for Exponential Updating

We prove this in the repeated n decision game framework(equivalent to paper’s formulation)

Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Page 30: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Page 31: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Page 32: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Page 33: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

∑xi log Z t = 1

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Page 34: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

∑xi log Z t = 1

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Page 35: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

∑xi log Z t = 1

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Page 36: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Page 37: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

∣∣ ≤ 4εM2

Proof

Note first that P t+1i −M ≤ P t

i ≤ P t+1i + M and hence

ti ≤ eεP

t+1i eεM

x ti =

eεPti

t+1i

Z t+1∀i ≤ n

i

Page 38: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

∣∣ ≤ 4εM2

i −M ≤ P ti ≤ P t+1

i + M and hence

ti ≤ eεP

t+1i eεM

x ti =

eεPti

t+1i

Z t+1∀i ≤ n

i

Page 39: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

∣∣ ≤ 4εM2

i −M ≤ P ti ≤ P t+1

i + M and hence

ti ≤ eεP

t+1i eεM

x ti =

eεPti

t+1i

Z t+1∀i ≤ n

i

Page 40: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

∣∣ ≤ 4εM2

i −M ≤ P ti ≤ P t+1

i + M and hence

ti ≤ eεP

t+1i eεM

x ti =

eεPti

t+1i

Z t+1∀i ≤ n

i

Page 41: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

∣∣ ≤ 4εM2

i −M ≤ P ti ≤ P t+1

i + M and hence

ti ≤ eεP

t+1i eεM

x ti =

eεPti

t+1i

Z t+1∀i ≤ n

i

Page 42: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,

x t = (1− λ)x t+1 + λz t

Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows

directly from the argument above. The fact that∑

z ti = 1

follows from the fact that∑

x ti = 1,

∑x t+1i = 1 and that x t

is a convex combination of x t + 1 and x t

Page 43: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

x t = (1− λ)x t+1 + λz t

z ti = 1

x ti = 1,

Page 44: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

x t = (1− λ)x t+1 + λz t

z ti = 1

x ti = 1,

Page 45: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Finally,

x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt

Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma

Page 46: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Finally,

Page 47: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Finally,

Page 48: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Page 49: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Page 50: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Page 51: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Page 52: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n

1εH(x)− 1

εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the

“stability” term. Putting these together gives,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Page 53: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Regret Bound

Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n

1εH(x)− 1

εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the

“stability” term. Putting these together gives,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Page 54: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problems

Take decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Page 55: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]

At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

Page 56: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .

Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

Page 57: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]

Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

Page 58: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .

At each time step t, each expert i will produce its owndistribution E t

Page 59: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

t)

Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Page 60: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Page 61: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Page 62: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Page 63: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

∑Ni=1 pt

i Eti

∑Ni=1 pt

i (Λ(E ti , ω

t))

t=1 Λ(E ti , ω

t) +√

Page 64: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

∑Ni=1 pt

i Eti

∑Ni=1 pt

i (Λ(E ti , ω

t))

t=1 Λ(E ti , ω

t) +√

Page 65: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

∑Ni=1 pt

i Eti

∑Ni=1 pt

i (Λ(E ti , ω

t))

t=1 Λ(E ti , ω

t) +√

Page 66: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

∑Ni=1 pt

i Eti

∑Ni=1 pt

i (Λ(E ti , ω

t))

t=1 Λ(E ti , ω

t) +√

Page 67: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Applications

∑Ni=1 pt

i Eti

∑Ni=1 pt

i (Λ(E ti , ω

t))

t=1 Λ(E ti , ω

t) +√

Page 68: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Page 69: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

Page 70: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

Page 71: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

Page 72: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

Page 73: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

Page 74: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

Page 75: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Boosting

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic Error

Low Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Boosting

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High Variance

Ex: 1st Order Curves vs. 9th Order Curves

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Boosting

Use committees

Bias

Variance

Boosting

Use committees

Bias

Variance

Boosting

Use committees

Bias

Variance

Boosting

Use committees

Bias

Variance

Boosting

Use committees

Bias

Variance

Boosting

Use committees

Bias

Variance

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the space

Forces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Boosting

Intuition

Boosting

Intuition

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.

A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Boosting

Strong Learning

Weak Learning

Boosting

Strong Learning

Weak Learning

Boosting

Strong Learning

Weak Learning

decreases as 1/p for polynomial p

Intuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Boosting

Strong Learning

Weak Learning

Boosting

Strong Learning

Weak Learning

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examples

Compute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distribution

Feed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithm

Generate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weights

Output a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Boosting

The Algorithm

Boosting

The Algorithm

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Boosting

εt =∑N

Set βt =εt

w t+1i = w t

Boosting

εt =∑N

Set βt =εt

w t+1i = w t

Boosting

εt =∑N

Set βt =εt

1− εt

Calculate new weights as:

w t+1i = w t

Boosting

εt =∑N

Set βt =εt

w t+1i = w t

Boosting

εt =∑N

Set βt =εt

w t+1i = w t

Boosting

εt =∑N

Set βt =εt

w t+1i = w t

Page 113: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Page 114: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”

The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Page 115: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of Hedge

We also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Page 116: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 117: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 118: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 119: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Page 120: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Not enough data

Page 121: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Not enough data

Page 122: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Not enough data

Page 123: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Not enough data

Page 124: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Not enough data

Page 125: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Not enough data

Page 126: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Not enough data

Page 127: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Not enough data

Page 128: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Page 129: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Predict 1 if

Pr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]And predict 0 if

Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

εt = Pr [ht(x) 6= y ]

Page 130: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

εt = Pr [ht(x) 6= y ]

Page 131: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

And predict 0 if

Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

εt = Pr [ht(x) 6= y ]

Page 132: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

εt = Pr [ht(x) 6= y ]

Page 133: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

εt = Pr [ht(x) 6= y ]

Page 134: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

εt = Pr [ht(x) 6= y ]

Page 135: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

εt = Pr [ht(x) 6= y ]

Page 136: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

εt = Pr [ht(x) 6= y ]

Page 137: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., k

Correspondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 138: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

εt =∑N

otherwise

Let βt =εt

(1− εt)

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 139: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εt =∑N

otherwise

Let βt =εt

(1− εt)

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 140: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εt =∑N

otherwise

Let βt =εt

(1− εt)

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 141: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εt =∑N

otherwise

Let βt =εt

(1− εt)

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 142: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εt =∑N

otherwise

Let βt =εt

(1− εt)

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 143: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εt =∑N

otherwise

Let βt =εt

(1− εt)

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 144: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εt =∑N

otherwise

Let βt =εt

(1− εt)

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 145: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

εt =∑N

otherwise

Let βt =εt

(1− εt)

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Page 146: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Page 147: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 148: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 149: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Page 150: A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Summary

Hedge Algorithm, applications of Hedge

Adaboost Algorithm, relationship to Hedge, relationship toBayesian statistics, and extensions

Download - A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Top Related